winamnd commited on
Commit
7eacd2f
·
verified ·
1 Parent(s): 37d5823

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -0
README.md CHANGED
@@ -107,4 +107,59 @@ The API returns a tuple with two elements:
107
 
108
  - **Extracted Text (`str`)**: The text extracted from the image.
109
  - **Spam Classification (`str`)**: The classification result ("Spam" or "Not Spam").
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
 
107
 
108
  - **Extracted Text (`str`)**: The text extracted from the image.
109
  - **Spam Classification (`str`)**: The classification result ("Spam" or "Not Spam").
110
+ -
111
+
112
+ ---
113
+
114
+ # Chosen LLM and Justification
115
+
116
+ We have chosen **DistilBERT** as the foundational LLM for text classification due to its efficiency, lightweight architecture, and high performance in natural language processing (NLP) tasks. DistilBERT is a distilled version of BERT that retains 97% of BERT’s performance while being 60% faster and requiring significantly fewer computational resources. This makes it ideal for classifying extracted text as spam or not spam in real-time OCR applications.
117
+
118
+ ---
119
+
120
+ ## Steps for Fine-Tuning or Prompt Engineering
121
+
122
+ ### Data Preparation:
123
+ - Gather a dataset of spam and non-spam text samples.
124
+ - Preprocess the text (cleaning, tokenization, and padding).
125
+ - Split data into training and validation sets.
126
+
127
+ ### Fine-Tuning DistilBERT:
128
+ 1. Load the pre-trained DistilBERT model.
129
+ 2. Apply transfer learning by training the model on the spam dataset.
130
+ 3. Use a classification head (fully connected layer) on top of DistilBERT for binary classification.
131
+ 4. Implement cross-entropy loss and optimize with AdamW.
132
+ 5. Evaluate performance using precision, recall, and F1-score.
133
+
134
+ ### Prompt Engineering (Alternative Approach):
135
+ - If fine-tuning is not preferred, use predefined prompts with a larger LLM (e.g., GPT) to classify text dynamically.
136
+ - Example prompt:
137
+
138
+
139
+ ---
140
+
141
+ ## Integration with OCR Output
142
+
143
+ - Once text is extracted using OCR (PaddleOCR, EasyOCR, or KerasOCR), it is passed to the DistilBERT model for classification.
144
+ - The classification result is appended to the OCR output and stored in `ocr_results.json` and `ocr_results.csv`.
145
+ - The system updates the UI in real-time via **Gradio** to display extracted text along with the classification label.
146
+
147
+ ---
148
+
149
+ ## Security and Evaluation Strategies
150
+
151
+ ### Security Measures:
152
+ - Sanitize input data to prevent injection attacks.
153
+ - Implement rate limiting to prevent abuse of the API.
154
+ - Store results securely, ensuring sensitive data is not exposed.
155
+
156
+ ### Evaluation Strategies:
157
+ - Perform cross-validation to assess model robustness.
158
+ - Continuously monitor classification accuracy on new incoming data.
159
+ - Implement feedback mechanisms for users to report misclassifications and improve the model.
160
+
161
+ ---
162
+
163
+ This integration of OCR and LLM ensures an efficient, scalable, and accurate system for spam classification of text extracted from images.
164
+
165