shainaraza commited on
Commit
6f3d4ac
·
verified ·
1 Parent(s): 81cd540

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +304 -0
README.md ADDED
@@ -0,0 +1,304 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc
3
+ datasets:
4
+ - vector-institute/newsmediabias-plus
5
+ ---
6
+
7
+ ```markdown
8
+ ---
9
+ language: en
10
+ tags:
11
+ - transformers
12
+ - llama
13
+ - multimodal
14
+ - vision-language
15
+ - newsmedia
16
+ - disinformation-detection
17
+ - text-classification
18
+ - image-classification
19
+ license: apache-2.0
20
+ datasets:
21
+ - https://huggingface.co/datasets/vector-institute/newsmediabias-plus
22
+ library_name: transformers
23
+ pipeline_tag: text-classification
24
+ ---
25
+
26
+ # LLaMA 3.2 Multimodal News Media Bias Detector
27
+
28
+ **Developed by the Vector Institute**
29
+
30
+ ## Model Overview
31
+
32
+ The LLaMA 3.2 Multimodal News Media Bias Detector is a fine-tuned version of the LLaMA 2 Vision-Instruct model, designed to assess potential disinformation in news media articles by analyzing both text and image content. The model identifies the presence of rhetorical techniques commonly associated with disinformation and classifies articles as 'Likely' or 'Unlikely' to be disinformation.
33
+
34
+ ## Model Details
35
+
36
+ - **Base Model**: [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)
37
+ - **Architecture**: Multimodal (Text and Image)
38
+ - **Model Size**: 11B parameters
39
+ - **Quantization**: 4-bit quantization (`nf4`) using `bitsandbytes` for efficient inference
40
+ - **Framework**: [Hugging Face Transformers](https://github.com/huggingface/transformers)
41
+ - **Fine-tuning**: The model was fine-tuned using the [PEFT](https://github.com/huggingface/peft) library with LoRA adapters
42
+
43
+ ## Intended Use
44
+
45
+ ### Primary Use Cases
46
+
47
+ - **Disinformation Detection**: Analyze news articles and accompanying images to detect potential disinformation.
48
+ - **Media Analysis**: Assist researchers and analysts in studying bias and rhetorical techniques in media content.
49
+ - **Educational Tool**: Serve as a resource for learning about multimodal models and disinformation detection.
50
+
51
+ ### Limitations
52
+
53
+ - **Not a Fact-Checker**: The model does not verify factual accuracy but assesses rhetorical techniques.
54
+ - **Bias in Training Data**: The model's judgments may reflect biases present in the training data.
55
+ - **Language and Domain Limitations**: Primarily trained on English-language news articles; performance may degrade on other languages or domains.
56
+
57
+ ## How to Use
58
+
59
+ ### Installation
60
+
61
+ Ensure you have the required libraries installed:
62
+
63
+ ```bash
64
+ pip install transformers accelerate bitsandbytes
65
+ ```
66
+
67
+ ### Loading the Model
68
+
69
+ import torch
70
+ from transformers import AutoModelForVision2Seq, AutoProcessor, BitsAndBytesConfig
71
+
72
+ merged_model_save_path= "vector-institute/Llama3.2-Multimodal-Newsmedia-Bias-Detector"
73
+ # Load quantization config if used during saving
74
+ bnb_config = BitsAndBytesConfig(
75
+ load_in_4bit=True,
76
+ bnb_4bit_use_double_quant=True,
77
+ bnb_4bit_quant_type="nf4",
78
+ bnb_4bit_compute_dtype=torch.bfloat16
79
+ )
80
+
81
+ # Load the merged model
82
+ model = AutoModelForVision2Seq.from_pretrained(
83
+ merged_model_save_path,
84
+ device_map="auto",
85
+ torch_dtype=torch.bfloat16,
86
+ quantization_config=bnb_config
87
+ )
88
+ model.eval()
89
+
90
+ # Load the processor
91
+ processor = AutoProcessor.from_pretrained(merged_model_save_path)
92
+
93
+ import pandas as pd
94
+ import os
95
+ from PIL import Image
96
+
97
+ # Path to your dataset CSV and image folder
98
+ dataset_csv_path = 'sample_dataset.csv'
99
+ image_folder_path = 'sampled_images'
100
+
101
+ # Load the DataFrame
102
+ df = pd.read_csv(dataset_csv_path)
103
+
104
+ # Function to prepare samples from the DataFrame
105
+ def prepare_samples_from_dataframe(df, image_folder_path):
106
+ samples = []
107
+ for index, row in df.iterrows():
108
+ unique_id = row['unique_id'] # Replace with the column that contains the image identifier
109
+ text_content = row['first_paragraph'] # Replace with your text column name
110
+
111
+ # Find the image file
112
+ possible_extensions = ['jpg', 'jpeg', 'png']
113
+ image_path = None
114
+ for ext in possible_extensions:
115
+ img_path = os.path.join(image_folder_path, f"{unique_id}.{ext}")
116
+ if os.path.exists(img_path):
117
+ image_path = img_path
118
+ break
119
+
120
+ if image_path is None:
121
+ print(f"No image found for ID {unique_id}")
122
+ continue
123
+
124
+ image = Image.open(image_path).convert("RGB")
125
+ max_size = (224, 224)
126
+ image.thumbnail(max_size, Image.Resampling.LANCZOS)
127
+
128
+ # Prepare the prompt text
129
+ sample_text = (
130
+ "Assess the text and image below for potential disinformation (try finding deliberately misleading or biased information) by identifying the presence of rhetorical techniques listed.\n"
131
+ "If you find any of the listed rhetorical techniques, then the article is likely disinformation; if not, it is likely not disinformation.\n\n"
132
+ "Rhetorical Techniques Checklist:\n"
133
+ "- Emotional Appeal: Uses language that intentionally invokes extreme emotions like fear or anger, aiming to distract from lack of factual backing.\n"
134
+ "- Exaggeration and Hyperbole: Makes claims that are unsupported by evidence, or presents normal situations as extraordinary to manipulate perceptions.\n"
135
+ "- Bias and Subjectivity: Presents information in a way that unreasonably favors one perspective, omitting key facts that might provide balance.\n"
136
+ "- Repetition: Uses repeated messaging of specific points or misleading statements to embed a biased viewpoint in the reader's mind.\n"
137
+ "- Specific Word Choices: Employs emotionally charged or misleading terms to sway opinions subtly, often in a manipulative manner.\n"
138
+ "- Appeals to Authority: References authorities who lack relevant expertise or cites sources that do not have the credentials to be considered authoritative in the context.\n"
139
+ "- Lack of Verifiable Sources: Relies on sources that either cannot be verified or do not exist, suggesting a fabrication of information.\n"
140
+ "- Logical Fallacies: Engages in flawed reasoning such as circular reasoning, strawman arguments, or ad hominem attacks that undermine logical debate.\n"
141
+ "- Conspiracy Theories: Propagates theories that lack proof and often contain elements of paranoia or implausible scenarios as facts.\n"
142
+ "- Inconsistencies and Factual Errors: Contains multiple contradictions or factual inaccuracies that are easily disprovable, indicating a lack of concern for truth.\n"
143
+ "- Selective Omission: Deliberately leaves out crucial information that is essential for a fair understanding of the topic, skewing perception.\n"
144
+ "- Manipulative Framing: Frames issues in a way that leaves out alternative perspectives or possible explanations, focusing only on aspects that support a biased narrative.\n\n"
145
+ f"{text_content}\n\n"
146
+ "Please **only** provide your answer in the format: 'Classification: Likely' or 'Classification: Unlikely'. Do not include any additional text or explanation."
147
+ )
148
+
149
+ sample = {
150
+ "unique_id": unique_id,
151
+ "messages": [
152
+ {
153
+ "role": "user",
154
+ "content": [
155
+ {
156
+ "type": "text",
157
+ "text": sample_text,
158
+ },
159
+ {
160
+ "type": "image",
161
+ "image": image,
162
+ }
163
+ ],
164
+ }
165
+ ]
166
+ }
167
+
168
+ samples.append(sample)
169
+
170
+ return samples
171
+
172
+ import re
173
+
174
+ def extract_assistant_reply(generated_text):
175
+ lines = generated_text.strip().split('\n')
176
+ for idx, line in enumerate(lines):
177
+ if line.strip().lower() == 'assistant':
178
+ assistant_reply = '\n'.join(lines[idx+1:]).strip()
179
+ return assistant_reply
180
+ return lines[-1].strip()
181
+
182
+ def extract_classification(assistant_reply):
183
+ match = re.search(r'Classification:\s*(Likely|Unlikely)', assistant_reply, re.IGNORECASE)
184
+ if match:
185
+ return match.group(1).capitalize()
186
+ else:
187
+ # Attempt to find 'Likely' or 'Unlikely' anywhere in the reply
188
+ match = re.search(r'\b(Likely|Unlikely)\b', assistant_reply, re.IGNORECASE)
189
+ if match:
190
+ return match.group(1).capitalize()
191
+ else:
192
+ return 'Unknown'
193
+
194
+ def generate_prediction(sample):
195
+ # Remove the model.to(device) line
196
+ # device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
197
+ # model.to(device)
198
+
199
+ # Prepare the input texts and images
200
+ texts = processor.apply_chat_template(sample["messages"], tokenize=False)
201
+ image_input = sample["messages"][0]['content'][1]['image']
202
+
203
+ # Prepare the inputs without moving them to a device
204
+ inputs = processor(text=texts, images=image_input, return_tensors="pt", padding=True)
205
+
206
+ # Ensure that the inputs are on the same device as the model's embeddings
207
+ # This is handled internally by the model when using device_map="auto"
208
+
209
+ with torch.no_grad():
210
+ outputs = model.generate(
211
+ **inputs,
212
+ max_new_tokens=20,
213
+ do_sample=False,
214
+ num_beams=1,
215
+ )
216
+
217
+ generated_texts = processor.batch_decode(outputs, skip_special_tokens=True)
218
+ assistant_reply = extract_assistant_reply(generated_texts[0])
219
+
220
+ return assistant_reply
221
+
222
+ # Prepare the samples from the DataFrame
223
+ samples = prepare_samples_from_dataframe(df, image_folder_path)
224
+
225
+ # Limit to a subset if desired
226
+ samples_to_infer = samples[:5] # For example, take the first 5 samples
227
+
228
+ # Run inference and collect results
229
+ results = []
230
+
231
+ for sample in samples_to_infer:
232
+ assistant_reply = generate_prediction(sample)
233
+ predicted_label = extract_classification(assistant_reply)
234
+
235
+ # Collect results
236
+ result = {
237
+ 'unique_id': sample['unique_id'],
238
+ 'assistant_reply': assistant_reply,
239
+ 'predicted_label': predicted_label,
240
+ # Add any other fields you need from the sample or DataFrame
241
+ }
242
+ results.append(result)
243
+
244
+ # Display the results
245
+ print(f"Sample ID: {sample['unique_id']}")
246
+ print("Assistant's Reply:")
247
+ print(assistant_reply)
248
+ print(f"Predicted Label: {predicted_label}")
249
+ print("-" * 50)
250
+
251
+ results_df = pd.DataFrame(results)
252
+ # Save to CSV if desired
253
+ results_df.to_csv('inference_results.csv', index=False)
254
+ ``
255
+ ### Example Output
256
+
257
+ ```
258
+ Assistant's Reply:
259
+ Classification: Likely
260
+ ```
261
+
262
+ ## Training Data
263
+
264
+ The model was fine-tuned on a custom dataset of news articles and images labeled for potential disinformation based on the presence of specific rhetorical techniques. The dataset includes balanced samples of articles classified as 'Likely' or 'Unlikely' to be disinformation.
265
+
266
+ ## Training Procedure
267
+
268
+ - **Fine-Tuning**: The model was fine-tuned using the PEFT library with LoRA adapters to efficiently adapt the base model to the disinformation detection task.
269
+ - **Quantization**: 4-bit quantization was applied using `bitsandbytes` to optimize inference performance.
270
+ - **Hyperparameters**: Standard hyperparameters were used, with adjustments made for optimal performance on the custom dataset.
271
+
272
+ ## Evaluation
273
+
274
+ The model was evaluated on a balanced test set with the following metrics:
275
+
276
+ - **Accuracy**: *88%*
277
+ - **Precision**: *78%*
278
+ - **Recall**: *75%*
279
+
280
+ *Note: Replace the above metrics with actual results from your evaluations.*
281
+
282
+ ## Limitations and Biases
283
+
284
+ - **Biases in Data**: The model's performance and judgments are influenced by the biases present in the training data.
285
+ - **False Positives/Negatives**: The model may incorrectly classify some articles due to subtle rhetorical techniques or lack thereof.
286
+ - **Ethical Considerations**: Users should be cautious when interpreting the model's outputs and consider them as part of a broader analysis.
287
+
288
+ ## Ethical Considerations
289
+
290
+ - **User Responsibility**: The model should not be used as the sole basis for making critical decisions.
291
+ - **Data Privacy**: Ensure compliance with data privacy laws when using the model with user data.
292
+ - **Misuse Prevention**: The model should not be used to unjustly label or discriminate against individuals or groups.
293
+
294
+
295
+
296
+ ## License
297
+
298
+ This model is licensed under the Apache 2.0 License.
299
+
300
+ ---
301
+
302
+ **Disclaimer**: The model is provided "as is" without any warranty. The developers are not responsible for any consequences arising from the use of this model.
303
+
304
+ ```