Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -113,9 +113,9 @@ The API returns a tuple with two elements:
|
|
113 |
|
114 |
# Chosen LLM and Justification
|
115 |
|
116 |
-
|
|
|
117 |
|
118 |
-
---
|
119 |
|
120 |
## Steps for Fine-Tuning or Prompt Engineering
|
121 |
|
@@ -131,12 +131,6 @@ We have chosen **DistilBERT** as the foundational LLM for text classification du
|
|
131 |
4. Implement cross-entropy loss and optimize with AdamW.
|
132 |
5. Evaluate performance using precision, recall, and F1-score.
|
133 |
|
134 |
-
### Prompt Engineering (Alternative Approach):
|
135 |
-
- If fine-tuning is not preferred, use predefined prompts with a larger LLM (e.g., GPT) to classify text dynamically.
|
136 |
-
- Example prompt:
|
137 |
-
|
138 |
-
|
139 |
-
---
|
140 |
|
141 |
## Integration with OCR Output
|
142 |
|
@@ -144,7 +138,6 @@ We have chosen **DistilBERT** as the foundational LLM for text classification du
|
|
144 |
- The classification result is appended to the OCR output and stored in `ocr_results.json` and `ocr_results.csv`.
|
145 |
- The system updates the UI in real-time via **Gradio** to display extracted text along with the classification label.
|
146 |
|
147 |
-
---
|
148 |
|
149 |
## Security and Evaluation Strategies
|
150 |
|
@@ -156,10 +149,4 @@ We have chosen **DistilBERT** as the foundational LLM for text classification du
|
|
156 |
### Evaluation Strategies:
|
157 |
- Perform cross-validation to assess model robustness.
|
158 |
- Continuously monitor classification accuracy on new incoming data.
|
159 |
-
- Implement feedback mechanisms for users to report misclassifications and improve the model.
|
160 |
-
|
161 |
-
---
|
162 |
-
|
163 |
-
This integration of OCR and LLM ensures an efficient, scalable, and accurate system for spam classification of text extracted from images.
|
164 |
-
|
165 |
-
|
|
|
113 |
|
114 |
# Chosen LLM and Justification
|
115 |
|
116 |
+
I have chosen **DistilBERT** as the foundational LLM for text classification due to its efficiency, lightweight architecture, and high performance in natural language processing (NLP) tasks. DistilBERT is a distilled version of BERT that retains 97% of BERT’s performance while being 60% faster and requiring significantly fewer computational resources. This makes it ideal for classifying extracted text as spam or not spam in real-time OCR applications.
|
117 |
+
[reference](https://arxiv.org/pdf/1910.01108)
|
118 |
|
|
|
119 |
|
120 |
## Steps for Fine-Tuning or Prompt Engineering
|
121 |
|
|
|
131 |
4. Implement cross-entropy loss and optimize with AdamW.
|
132 |
5. Evaluate performance using precision, recall, and F1-score.
|
133 |
|
|
|
|
|
|
|
|
|
|
|
|
|
134 |
|
135 |
## Integration with OCR Output
|
136 |
|
|
|
138 |
- The classification result is appended to the OCR output and stored in `ocr_results.json` and `ocr_results.csv`.
|
139 |
- The system updates the UI in real-time via **Gradio** to display extracted text along with the classification label.
|
140 |
|
|
|
141 |
|
142 |
## Security and Evaluation Strategies
|
143 |
|
|
|
149 |
### Evaluation Strategies:
|
150 |
- Perform cross-validation to assess model robustness.
|
151 |
- Continuously monitor classification accuracy on new incoming data.
|
152 |
+
- Implement feedback mechanisms for users to report misclassifications and improve the model.
|
|
|
|
|
|
|
|
|
|
|
|