File size: 7,119 Bytes
4106921
5933d29
 
4106921
5933d29
4106921
5933d29
4106921
5933d29
 
 
 
 
 
 
 
 
 
4106921
 
5933d29
4106921
 
5933d29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d628e0
5933d29
 
 
5d628e0
 
 
 
 
5933d29
5d628e0
5933d29
 
 
5d628e0
888dd06
5d628e0
 
 
5933d29
 
 
b7c96a6
5933d29
 
 
4106921
b7c96a6
5933d29
 
888dd06
5933d29
b7c96a6
5933d29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae95864
5933d29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d628e0
5933d29
 
 
 
 
4106921
 
5933d29
 
 
 
 
 
 
4106921
 
5933d29
 
 
 
 
 
 
 
4106921
5933d29
 
4106921
5933d29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4106921
5933d29
4106921
5933d29
 
4106921
5933d29
4106921
5933d29
4106921
5933d29
 
5d628e0
5933d29
 
 
5d628e0
 
4106921
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
base_model:
- unsloth/Qwen2-VL-2B-Instruct-unsloth-bnb-4bit
tags:
- transformers
- unsloth
- qwen2_vl
- trl
- ocr
license: apache-2.0
language:
- ar
metrics:
- bleu
- wer
- cer
pipeline_tag: image-text-to-text
library_name: peft
---

# Qari-OCR-Arabic-0.2.2.1-VL-2B-Instruct Model


## Model Overview

This model is a fine-tuned version of [unsloth/Qwen2-VL-2B-Instruct](https://huggingface.co/unsloth/Qwen2-VL-2B-Instruct-unsloth-bnb-4bit) on an Arabic OCR dataset. It is optimized to perform Arabic Optical Character Recognition (OCR) for full-page text.


## Key Features

- **Superior Accuracy**: Achieves state-of-the-art performance metrics for Arabic OCR
- **Diacritics Support**: Full recognition of Arabic diacritical marks (tashkeel) including fatḥah, kasrah, ḍammah, sukūn, shadda, and tanwin forms - a strength confirmed by evaluation on a primarily diacritical text dataset
- **Multiple Font Support**: Works across a variety of Arabic font styles
- **Layout Flexibility**: Handles different document layouts and formats


![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/HuUcfziXcDT_2kwDoz5qH.png)

## Model Details
- **Base Model**: Qwen2 VL
- **Fine-tuning Dataset**: Arabic OCR dataset
- **Objective**: Extract full-page Arabic text with high accuracy
- **Languages**: Arabic
- **Tasks**: OCR (Optical Character Recognition)
- **Dataset size**: 50,000 records
- **Epochs**: 1


## Evaluation Metrics
Performance is evaluated using three standard metrics:
- **Word Error Rate (WER)**: Measures word-level accuracy (lower is better)
- **Character Error Rate (CER)**: Measures character-level accuracy (lower is better)
- **BLEU Score**: Measures overall translation quality (higher is better)

### Results

| Model | WER ↓ | CER ↓ | BLEU ↑ |
|-------|-------|-------|--------|
| **Qari-OCR-0.2.2.1-VL-2B-Instruct** | **0.221** | **0.059** | **0.597** |
| AIN 8B | 0.757 | 0.309 | 0.103 |
| Qari-OCR-0.1-VL-2B-Instruct | 1.294 | 0.770 | 0.022 |
| easyOCR | 1.004 | 0.648 | 0.005 |
| pytesseract | 0.990 | 0.911 | <0.001 |

### WER Comparison

<img src="https://cdn-uploads.huggingface.co/production/uploads/630535e0c7fed54edfaa1a75/Artnw-bVJuSaO_vnLeupE.png" height="400px"/>

### CER Comparison


<img src="https://cdn-uploads.huggingface.co/production/uploads/630535e0c7fed54edfaa1a75/GihjVBk32SCyFCpJ81AEX.png" height="400px"/>


### BLEU Score Comparison

<img src="https://cdn-uploads.huggingface.co/production/uploads/630535e0c7fed54edfaa1a75/HOOlFw5l_Os3dyyKmTXUs.png" height="400px"/>


## Model Details

### Training Data
The model was trained using the following specifications:

- **Font Sizes**: 14, 16, 18, 20, 24, 32, 40 pt
- **Page Layouts**:
  - A4 (210mm × 297mm)
  - Letter (216mm × 279mm)
  - Small (105mm × 148mm)
  - Square (1080px × 1080px)
  - OneLine (210mm × 10mm)
- **Arabic Fonts Used**:
  - IBM Plex Sans Arabic
  - KFGQPCUthman Taha Naskh
  - Scheherazade New
  - Amiri
  - Madina
  - Diwani Letter
  - Tajawal
  - Cairo
  - Lateef
  - Almarai
  - AlQalam Quran
  - Noto Naskh Arabic

### Limitations
Based on the training specifications, the model has the following limitations:

1. **Font Size Constraints**: May have reduced accuracy with very small (< 14pt) or very large (> 40pt) text
2. **Font Coverage**: Performance may degrade on uncommon Arabic fonts not represented in the training data
3. **Diacritics Complexity**: While the model supports diacritics (tashkeel), extremely dense or unconventional diacritical mark combinations may reduce accuracy
4. **Layout Sensitivity**: May have difficulty with complex multi-column layouts or unconventional page formats
4. **Handwriting Recognition**: Limited capability with handwritten text as training focused on digital fonts
5. **Decorative Text**: May struggle with highly stylized or decorative Arabic calligraphy
6. **Background Complexity**: Optimized for clear backgrounds; performance may degrade with complex or textured backgrounds
7. **Text Degradation**: May have challenges with severely degraded, blurry, or low-resolution text
8. **Non-standard Orientations**: Primarily designed for horizontally oriented text; may struggle with vertical or diagonal text

### Evaluation Method
Evaluation was performed on a diverse dataset of Arabic text images, **primarily featuring diacritical marks (tashkeel)**, measuring:
- **Word Error Rate (WER)**: The percentage of incorrectly recognized words
- **Character Error Rate (CER)**: The percentage of incorrectly recognized characters
- **BLEU Score**: A measure of translation quality, higher scores indicate better overall text recognition


## How to Use

[Try Qari v0.2.2.1 - Google Colab](https://colab.research.google.com/github/NAMAA-ORG/public-notebooks/blob/main/Qari_V0.2.2.1_Free_Colab.ipynb)

You can load this model using the `transformers` and `qwen_vl_utils` library:
```
!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes
```

```python
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info



model_name = "NAMAA-Space/Qari-OCR-0.2.2.1-Arabic-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

```

## License
This model follows the licensing terms of the original Qwen2 VL model. Please review the terms before using it commercially.

## Citation

If you use this model in your research, please cite:

```
@misc{QariOCR2025,
  title={Qari-OCR v0.2.2.1: A High-Accuracy Model for Arabic Optical Character Recognition},
  author={NAMAA},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/NAMAA-Space/Qari-OCR-0.2.2.1-VL-2B-Instruct}},
  note={Accessed: 2025-04-01}
}
```