Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -107,4 +107,59 @@ The API returns a tuple with two elements:
|
|
107 |
|
108 |
- **Extracted Text (`str`)**: The text extracted from the image.
|
109 |
- **Spam Classification (`str`)**: The classification result ("Spam" or "Not Spam").
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
110 |
|
|
|
107 |
|
108 |
- **Extracted Text (`str`)**: The text extracted from the image.
|
109 |
- **Spam Classification (`str`)**: The classification result ("Spam" or "Not Spam").
|
110 |
+
-
|
111 |
+
|
112 |
+
---
|
113 |
+
|
114 |
+
# Chosen LLM and Justification
|
115 |
+
|
116 |
+
We have chosen **DistilBERT** as the foundational LLM for text classification due to its efficiency, lightweight architecture, and high performance in natural language processing (NLP) tasks. DistilBERT is a distilled version of BERT that retains 97% of BERT’s performance while being 60% faster and requiring significantly fewer computational resources. This makes it ideal for classifying extracted text as spam or not spam in real-time OCR applications.
|
117 |
+
|
118 |
+
---
|
119 |
+
|
120 |
+
## Steps for Fine-Tuning or Prompt Engineering
|
121 |
+
|
122 |
+
### Data Preparation:
|
123 |
+
- Gather a dataset of spam and non-spam text samples.
|
124 |
+
- Preprocess the text (cleaning, tokenization, and padding).
|
125 |
+
- Split data into training and validation sets.
|
126 |
+
|
127 |
+
### Fine-Tuning DistilBERT:
|
128 |
+
1. Load the pre-trained DistilBERT model.
|
129 |
+
2. Apply transfer learning by training the model on the spam dataset.
|
130 |
+
3. Use a classification head (fully connected layer) on top of DistilBERT for binary classification.
|
131 |
+
4. Implement cross-entropy loss and optimize with AdamW.
|
132 |
+
5. Evaluate performance using precision, recall, and F1-score.
|
133 |
+
|
134 |
+
### Prompt Engineering (Alternative Approach):
|
135 |
+
- If fine-tuning is not preferred, use predefined prompts with a larger LLM (e.g., GPT) to classify text dynamically.
|
136 |
+
- Example prompt:
|
137 |
+
|
138 |
+
|
139 |
+
---
|
140 |
+
|
141 |
+
## Integration with OCR Output
|
142 |
+
|
143 |
+
- Once text is extracted using OCR (PaddleOCR, EasyOCR, or KerasOCR), it is passed to the DistilBERT model for classification.
|
144 |
+
- The classification result is appended to the OCR output and stored in `ocr_results.json` and `ocr_results.csv`.
|
145 |
+
- The system updates the UI in real-time via **Gradio** to display extracted text along with the classification label.
|
146 |
+
|
147 |
+
---
|
148 |
+
|
149 |
+
## Security and Evaluation Strategies
|
150 |
+
|
151 |
+
### Security Measures:
|
152 |
+
- Sanitize input data to prevent injection attacks.
|
153 |
+
- Implement rate limiting to prevent abuse of the API.
|
154 |
+
- Store results securely, ensuring sensitive data is not exposed.
|
155 |
+
|
156 |
+
### Evaluation Strategies:
|
157 |
+
- Perform cross-validation to assess model robustness.
|
158 |
+
- Continuously monitor classification accuracy on new incoming data.
|
159 |
+
- Implement feedback mechanisms for users to report misclassifications and improve the model.
|
160 |
+
|
161 |
+
---
|
162 |
+
|
163 |
+
This integration of OCR and LLM ensures an efficient, scalable, and accurate system for spam classification of text extracted from images.
|
164 |
+
|
165 |
|