Update README.md

7551eb1 almost 2 years ago

6.34 kB


	---
	tags:
	- bertopic
	library_name: bertopic
	---

	# BERTopic_Multimodal

	This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
	BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.

	This model was trained on 8000 images from Flickr without the captions. This demonstrates how BERTopic can be used for topic modeling using images as input only.

	A few examples of generated topics:

	!["multimodal.png"](multimodal.png)

	## Usage

	To use this model, please install BERTopic:

	```
	pip install -U bertopic[vision]
	pip install -U safetensors
	```

	You can use the model as follows:

	```python
	from bertopic import BERTopic
	topic_model = BERTopic.load("MaartenGr/BERTopic_Multimodal")

	topic_model.get_topic_info()
	```

	You can view all information about a topic as follows:

	```python
	topic_model.get_topic(topic_id, full=True)
	```

	## Topic overview

	* Number of topics: 29
	* Number of training documents: 8091

	<details>
	<summary>Click here for an overview of all topics.</summary>

	\| Topic ID \| Topic Keywords \| Topic Frequency \| Label \|
	\|----------\|----------------\|-----------------\|-------\|
	\| -1 \| while - air - the - in - jumping \| 34 \| -1_while_air_the_in \|
	\| 0 \| bench - sitting - people - woman - street \| 1132 \| 0_bench_sitting_people_woman \|
	\| 1 \| grass - running - dog - grassy - field \| 1693 \| 1_grass_running_dog_grassy \|
	\| 2 \| boy - girl - little - young - holding \| 1290 \| 2_boy_girl_little_young \|
	\| 3 \| dog - frisbee - running - water - mouth \| 1224 \| 3_dog_frisbee_running_water \|
	\| 4 \| skateboard - ramp - doing - trick - cement \| 415 \| 4_skateboard_ramp_doing_trick \|
	\| 5 \| snow - dog - covered - running - through \| 309 \| 5_snow_dog_covered_running \|
	\| 6 \| mountain - range - slope - standing - person \| 205 \| 6_mountain_range_slope_standing \|
	\| 7 \| pool - blue - boy - toy - water \| 189 \| 7_pool_blue_boy_toy \|
	\| 8 \| trail - bike - down - riding - person \| 166 \| 8_trail_bike_down_riding \|
	\| 9 \| snowboarder - mid - jump - air - after \| 126 \| 9_snowboarder_mid_jump_air \|
	\| 10 \| rock - climbing - up - wall - tree \| 124 \| 10_rock_climbing_up_wall \|
	\| 11 \| wave - surfboard - top - riding - of \| 112 \| 11_wave_surfboard_top_riding \|
	\| 12 \| beach - surfboard - people - with - walking \| 102 \| 12_beach_surfboard_people_with \|
	\| 13 \| jumping - track - horse - racquet - dog \| 98 \| 13_jumping_track_horse_racquet \|
	\| 14 \| snowboard - snow - girl - hill - slope \| 95 \| 14_snowboard_snow_girl_hill \|
	\| 15 \| game - being - football - played - professional \| 91 \| 15_game_being_football_played \|
	\| 16 \| soccer - kicking - team - ball - player \| 80 \| 16_soccer_kicking_team_ball \|
	\| 17 \| dirt - bike - person - rider - going \| 75 \| 17_dirt_bike_person_rider \|
	\| 18 \| soccer - boys - field - ball - kicking \| 69 \| 18_soccer_boys_field_ball \|
	\| 19 \| baseball - player - bat - swinging - into \| 63 \| 19_baseball_player_bat_swinging \|
	\| 20 \| basketball - up - and - playing - jumping \| 59 \| 20_basketball_up_and_playing \|
	\| 21 \| bird - body - flying - over - long \| 55 \| 21_bird_body_flying_over \|
	\| 22 \| motorcycle - track - race - racer - racing \| 55 \| 22_motorcycle_track_race_racer \|
	\| 23 \| boat - sitting - water - lake - hose \| 53 \| 23_boat_sitting_water_lake \|
	\| 24 \| street - riding - down - bike - woman \| 52 \| 24_street_riding_down_bike \|
	\| 25 \| paddle - suit - paddling - water - in \| 49 \| 25_paddle_suit_paddling_water \|
	\| 26 \| pair - scissors - stage - white - shirt \| 42 \| 26_pair_scissors_stage_white \|
	\| 27 \| tennis - court - racket - racquet - swinging \| 34 \| 27_tennis_court_racket_racquet \|

	</details>

	## Training Procedure

	The data was retrieved as follows:

	```python
	import os
	import glob
	import zipfile
	import numpy as np
	import pandas as pd
	from tqdm import tqdm
	from sentence_transformers import util

	# Flickr 8k images
	img_folder = 'photos/'
	caps_folder = 'captions/'
	if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
	os.makedirs(img_folder, exist_ok=True)

	if not os.path.exists('Flickr8k_Dataset.zip'): #Download dataset if does not exist
	util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip', 'Flickr8k_Dataset.zip')
	util.http_get('https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip', 'Flickr8k_text.zip')

	for folder, file in [(img_folder, 'Flickr8k_Dataset.zip'), (caps_folder, 'Flickr8k_text.zip')]:
	with zipfile.ZipFile(file, 'r') as zf:
	for member in tqdm(zf.infolist(), desc='Extracting'):
	zf.extract(member, folder)
	images = list(glob.glob('photos/Flicker8k_Dataset/*.jpg'))
	```

	Then, to perform topic modeling on multimodal data with BERTopic:

	```python
	from bertopic import BERTopic
	from bertopic.backend import MultiModalBackend
	from bertopic.representation import VisualRepresentation, KeyBERTInspired

	# Image embedding model
	embedding_model = MultiModalBackend('clip-ViT-B-32', batch_size=32)

	# Image to text representation model
	representation_model = {
	"Visual_Aspect": VisualRepresentation(image_to_text_model="nlpconnect/vit-gpt2-image-captioning", image_squares=True),
	"KeyBERT": KeyBERTInspired()
	}

	# Train our model with images only
	topic_model = BERTopic(representation_model=representation_model, verbose=True, embedding_model=embedding_model, min_topic_size=30)
	topics, probs = topic_model.fit_transform(documents=None, images=images)
	```

	The above demonstrates that the input were only images. These images are clustered and from those clusters a small subset of representative images are extracted. The representative images are captioned using `"nlpconnect/vit-gpt2-image-captioning"` to generate a small textual dataset over which we can run c-TF-IDF and the additional
	`KeyBERTInspired` representation model.

	## Training hyperparameters

	* calculate_probabilities: False
	* language: None
	* low_memory: False
	* min_topic_size: 30
	* n_gram_range: (1, 1)
	* nr_topics: None
	* seed_topic_list: None
	* top_n_words: 10
	* verbose: True

	## Framework versions

	* Numpy: 1.23.5
	* HDBSCAN: 0.8.29
	* UMAP: 0.5.3
	* Pandas: 1.5.3
	* Scikit-Learn: 1.2.2
	* Sentence-transformers: 2.2.2
	* Transformers: 4.29.2
	* Numba: 0.56.4
	* Plotly: 5.14.1
	* Python: 3.10.10