--- title: OCR + LLM emoji: 🔎 colorFrom: pink colorTo: gray sdk: gradio sdk_version: 5.16.0 app_file: app.py pinned: false short_description: Technical Assessment --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference # OCR LLM Classifier This project provides a simple interface for Optical Character Recognition (OCR) and spam classification using deep learning models. It supports three OCR methods (PaddleOCR, EasyOCR, and KerasOCR) and uses a DistilBERT model for classifying the extracted text as "Spam" or "Not Spam." ## Features - Extract text from images using OCR. - Classify extracted text as either "Spam" or "Not Spam." ## How It Works 1. **OCR**: The app uses one of the three OCR methods to extract text from the uploaded image: - **PaddleOCR** - **EasyOCR** - **KerasOCR** 2. **Classification**: The extracted text is passed to a pre-trained DistilBERT model that classifies the text as either "Spam" or "Not Spam." ## Installation To get started with this project, follow these steps: ### 1. Clone the Repository ```bash git clone https://github.com/yourusername/ocr-llm-test.git cd ocr-llm-test ``` ### 2. Install Dependencies You can install the required dependencies using pip: ```bash pip install -r requirements.txt ``` ### 3. Run the App To run the Gradio interface locally, execute: ```bash python app.py ``` Once the app is running, it will be accessible through your web browser at [http://localhost:7860](http://localhost:7860). ## API Documentation ### 1. API Endpoint The main endpoint for this API is `/predict`. ### 2. API Call Example #### Install the Python Client If you don't already have it installed, run the following command: ```bash pip install gradio_client ``` #### Make an API Call ```python from gradio_client import Client, handle_file client = Client("winamnd/ocr-llm-test") result = client.predict( method="PaddleOCR", img=handle_file('https://huggingface.co/spaces/winamnd/ocr-llm-test/blob/main/sample_images/sample2.png'), api_name="/predict" ) print(result) ``` ### 3. Parameters | Parameter | Type | Description | |-----------|------|-------------| | `method` | `Literal['PaddleOCR', 'EasyOCR', 'KerasOCR', 'TesseractOCR']` | Choose the OCR method to be used for text extraction. Default is "PaddleOCR." | | `img` | `dict` | The image input, which can be provided as a URL, path, or base64 encoded image. | #### Image Input Details - **path**: Path to a local file. - **url**: Publicly available URL for the image. - **size**: The size of the image (in bytes). - **orig_name**: Original filename. - **mime_type**: MIME type of the image. - **is_stream**: Always set to False. - **meta**: Metadata. ### 4. Returns The API returns a tuple with two elements: - **Extracted Text (`str`)**: The text extracted from the image. - **Spam Classification (`str`)**: The classification result ("Spam" or "Not Spam"). - --- # Chosen LLM and Justification I have chosen **DistilBERT** as the foundational LLM for text classification due to its efficiency, lightweight architecture, and high performance in natural language processing (NLP) tasks. DistilBERT is a distilled version of BERT that retains 97% of BERT’s performance while being 60% faster and requiring significantly fewer computational resources. This makes it ideal for classifying extracted text as spam or not spam in real-time OCR applications. [reference](https://arxiv.org/pdf/1910.01108) ## Steps for Fine-Tuning or Prompt Engineering ### Data Preparation: - Gather a dataset of spam and non-spam text samples. - Preprocess the text (cleaning, tokenization, and padding). - Split data into training and validation sets. ### Fine-Tuning DistilBERT: 1. Load the pre-trained DistilBERT model. 2. Apply transfer learning by training the model on the spam dataset. 3. Use a classification head (fully connected layer) on top of DistilBERT for binary classification. 4. Implement cross-entropy loss and optimize with AdamW. 5. Evaluate performance using precision, recall, and F1-score. ## Integration with OCR Output - Once text is extracted using OCR (PaddleOCR, EasyOCR, or KerasOCR), it is passed to the DistilBERT model for classification. - The classification result is appended to the OCR output and stored in `ocr_results.json` and `ocr_results.csv`. - The system updates the UI in real-time via **Gradio** to display extracted text along with the classification label. ## Security and Evaluation Strategies ### Security Measures: - Sanitize input data to prevent injection attacks. - Implement rate limiting to prevent abuse of the API. - Store results securely, ensuring sensitive data is not exposed. ### Evaluation Strategies: - Perform cross-validation to assess model robustness. - Continuously monitor classification accuracy on new incoming data. - Implement feedback mechanisms for users to report misclassifications and improve the model.