|
|
|
--- |
|
tags: |
|
- bertopic |
|
library_name: bertopic |
|
--- |
|
|
|
# BERTopic_Multimodal |
|
|
|
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model. |
|
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets. |
|
|
|
This model was trained on 8000 images from Flickr **without** the captions. This demonstrates how BERTopic can be used for topic modeling using images as input only. |
|
|
|
A few examples of generated topics: |
|
|
|
 |
|
|
|
## Usage |
|
|
|
To use this model, please install BERTopic: |
|
|
|
``` |
|
pip install -U bertopic[vision] |
|
pip install -U safetensors |
|
``` |
|
|
|
You can use the model as follows: |
|
|
|
```python |
|
from bertopic import BERTopic |
|
topic_model = BERTopic.load("MaartenGr/BERTopic_Multimodal") |
|
|
|
topic_model.get_topic_info() |
|
``` |
|
|
|
You can view all information about a topic as follows: |
|
|
|
```python |
|
topic_model.get_topic(topic_id, full=True) |
|
``` |
|
|
|
## Topic overview |
|
|
|
* Number of topics: 29 |
|
* Number of training documents: 8091 |
|
|
|
<details> |
|
<summary>Click here for an overview of all topics.</summary> |
|
|
|
| Topic ID | Topic Keywords | Topic Frequency | Label | |
|
|----------|----------------|-----------------|-------| |
|
| -1 | while - air - the - in - jumping | 34 | -1_while_air_the_in | |
|
| 0 | bench - sitting - people - woman - street | 1132 | 0_bench_sitting_people_woman | |
|
| 1 | grass - running - dog - grassy - field | 1693 | 1_grass_running_dog_grassy | |
|
| 2 | boy - girl - little - young - holding | 1290 | 2_boy_girl_little_young | |
|
| 3 | dog - frisbee - running - water - mouth | 1224 | 3_dog_frisbee_running_water | |
|
| 4 | skateboard - ramp - doing - trick - cement | 415 | 4_skateboard_ramp_doing_trick | |
|
| 5 | snow - dog - covered - running - through | 309 | 5_snow_dog_covered_running | |
|
| 6 | mountain - range - slope - standing - person | 205 | 6_mountain_range_slope_standing | |
|
| 7 | pool - blue - boy - toy - water | 189 | 7_pool_blue_boy_toy | |
|
| 8 | trail - bike - down - riding - person | 166 | 8_trail_bike_down_riding | |
|
| 9 | snowboarder - mid - jump - air - after | 126 | 9_snowboarder_mid_jump_air | |
|
| 10 | rock - climbing - up - wall - tree | 124 | 10_rock_climbing_up_wall | |
|
| 11 | wave - surfboard - top - riding - of | 112 | 11_wave_surfboard_top_riding | |
|
| 12 | beach - surfboard - people - with - walking | 102 | 12_beach_surfboard_people_with | |
|
| 13 | jumping - track - horse - racquet - dog | 98 | 13_jumping_track_horse_racquet | |
|
| 14 | snowboard - snow - girl - hill - slope | 95 | 14_snowboard_snow_girl_hill | |
|
| 15 | game - being - football - played - professional | 91 | 15_game_being_football_played | |
|
| 16 | soccer - kicking - team - ball - player | 80 | 16_soccer_kicking_team_ball | |
|
| 17 | dirt - bike - person - rider - going | 75 | 17_dirt_bike_person_rider | |
|
| 18 | soccer - boys - field - ball - kicking | 69 | 18_soccer_boys_field_ball | |
|
| 19 | baseball - player - bat - swinging - into | 63 | 19_baseball_player_bat_swinging | |
|
| 20 | basketball - up - and - playing - jumping | 59 | 20_basketball_up_and_playing | |
|
| 21 | bird - body - flying - over - long | 55 | 21_bird_body_flying_over | |
|
| 22 | motorcycle - track - race - racer - racing | 55 | 22_motorcycle_track_race_racer | |
|
| 23 | boat - sitting - water - lake - hose | 53 | 23_boat_sitting_water_lake | |
|
| 24 | street - riding - down - bike - woman | 52 | 24_street_riding_down_bike | |
|
| 25 | paddle - suit - paddling - water - in | 49 | 25_paddle_suit_paddling_water | |
|
| 26 | pair - scissors - stage - white - shirt | 42 | 26_pair_scissors_stage_white | |
|
| 27 | tennis - court - racket - racquet - swinging | 34 | 27_tennis_court_racket_racquet | |
|
|
|
</details> |
|
|
|
## Training Procedure |
|
|
|
The data was retrieved as follows: |
|
|
|
```python |
|
import os |
|
import glob |
|
import zipfile |
|
import numpy as np |
|
import pandas as pd |
|
from tqdm import tqdm |
|
from sentence_transformers import util |
|
|
|
# Flickr 8k images |
|
img_folder = 'photos/' |
|
caps_folder = 'captions/' |
|
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0: |
|
os.makedirs(img_folder, exist_ok=True) |
|
|
|
if not os.path.exists('Flickr8k_Dataset.zip'): #Download dataset if does not exist |
|
util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip', 'Flickr8k_Dataset.zip') |
|
util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip', 'Flickr8k_text.zip') |
|
|
|
for folder, file in [(img_folder, 'Flickr8k_Dataset.zip'), (caps_folder, 'Flickr8k_text.zip')]: |
|
with zipfile.ZipFile(file, 'r') as zf: |
|
for member in tqdm(zf.infolist(), desc='Extracting'): |
|
zf.extract(member, folder) |
|
images = list(glob.glob('photos/Flicker8k_Dataset/*.jpg')) |
|
``` |
|
|
|
Then, to perform topic modeling on multimodal data with BERTopic: |
|
|
|
```python |
|
from bertopic import BERTopic |
|
from bertopic.backend import MultiModalBackend |
|
from bertopic.representation import VisualRepresentation, KeyBERTInspired |
|
|
|
# Image embedding model |
|
embedding_model = MultiModalBackend('clip-ViT-B-32', batch_size=32) |
|
|
|
# Image to text representation model |
|
representation_model = { |
|
"Visual_Aspect": VisualRepresentation(image_to_text_model="nlpconnect/vit-gpt2-image-captioning", image_squares=True), |
|
"KeyBERT": KeyBERTInspired() |
|
} |
|
|
|
# Train our model with images only |
|
topic_model = BERTopic(representation_model=representation_model, verbose=True, embedding_model=embedding_model, min_topic_size=30) |
|
topics, probs = topic_model.fit_transform(documents=None, images=images) |
|
``` |
|
|
|
The above demonstrates that the input were only images. These images are clustered and from those clusters a small subset of representative images are extracted. The representative images are captioned using `"nlpconnect/vit-gpt2-image-captioning"` to generate a small textual dataset over which we can run c-TF-IDF and the additional |
|
`KeyBERTInspired` representation model. |
|
|
|
## Training hyperparameters |
|
|
|
* calculate_probabilities: False |
|
* language: None |
|
* low_memory: False |
|
* min_topic_size: 30 |
|
* n_gram_range: (1, 1) |
|
* nr_topics: None |
|
* seed_topic_list: None |
|
* top_n_words: 10 |
|
* verbose: True |
|
|
|
## Framework versions |
|
|
|
* Numpy: 1.23.5 |
|
* HDBSCAN: 0.8.29 |
|
* UMAP: 0.5.3 |
|
* Pandas: 1.5.3 |
|
* Scikit-Learn: 1.2.2 |
|
* Sentence-transformers: 2.2.2 |
|
* Transformers: 4.29.2 |
|
* Numba: 0.56.4 |
|
* Plotly: 5.14.1 |
|
* Python: 3.10.10 |
|
|