Spaces:

winamnd
/

ocr-llm-test

Running

App Files Files Community

winamnd commited on Feb 16

Commit

7eacd2f

verified ·

1 Parent(s): 37d5823

Update README.md

Browse files

Files changed (1) hide show

README.md +55 -0

README.md CHANGED Viewed

@@ -107,4 +107,59 @@ The API returns a tuple with two elements:
 - **Extracted Text (`str`)**: The text extracted from the image.
 - **Spam Classification (`str`)**: The classification result ("Spam" or "Not Spam").

 - **Extracted Text (`str`)**: The text extracted from the image.
 - **Spam Classification (`str`)**: The classification result ("Spam" or "Not Spam").
+-
+---
+# Chosen LLM and Justification
+We have chosen **DistilBERT** as the foundational LLM for text classification due to its efficiency, lightweight architecture, and high performance in natural language processing (NLP) tasks. DistilBERT is a distilled version of BERT that retains 97% of BERT’s performance while being 60% faster and requiring significantly fewer computational resources. This makes it ideal for classifying extracted text as spam or not spam in real-time OCR applications.
+---
+## Steps for Fine-Tuning or Prompt Engineering
+### Data Preparation:
+- Gather a dataset of spam and non-spam text samples.
+- Preprocess the text (cleaning, tokenization, and padding).
+- Split data into training and validation sets.
+### Fine-Tuning DistilBERT:
+1. Load the pre-trained DistilBERT model.
+2. Apply transfer learning by training the model on the spam dataset.
+3. Use a classification head (fully connected layer) on top of DistilBERT for binary classification.
+4. Implement cross-entropy loss and optimize with AdamW.
+5. Evaluate performance using precision, recall, and F1-score.
+### Prompt Engineering (Alternative Approach):
+- If fine-tuning is not preferred, use predefined prompts with a larger LLM (e.g., GPT) to classify text dynamically.
+- Example prompt:
+---
+## Integration with OCR Output
+- Once text is extracted using OCR (PaddleOCR, EasyOCR, or KerasOCR), it is passed to the DistilBERT model for classification.
+- The classification result is appended to the OCR output and stored in `ocr_results.json` and `ocr_results.csv`.
+- The system updates the UI in real-time via **Gradio** to display extracted text along with the classification label.
+---
+## Security and Evaluation Strategies
+### Security Measures:
+- Sanitize input data to prevent injection attacks.
+- Implement rate limiting to prevent abuse of the API.
+- Store results securely, ensuring sensitive data is not exposed.
+### Evaluation Strategies:
+- Perform cross-validation to assess model robustness.
+- Continuously monitor classification accuracy on new incoming data.
+- Implement feedback mechanisms for users to report misclassifications and improve the model.
+---
+This integration of OCR and LLM ensures an efficient, scalable, and accurate system for spam classification of text extracted from images.