File size: 5,367 Bytes
84c59cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86db9a2
84c59cc
1abe26f
84c59cc
 
 
bd24137
72a2d77
f691c9d
a91c580
0dd1da6
9a93d96
0dd1da6
9a93d96
a91c580
61ddd51
 
a91c580
d01361a
a235605
 
a91c580
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd24137
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
license: apache-2.0
datasets:
- linxy/LaTeX_OCR
- prithivMLmods/Img2Text-Plaintext-Retrieval
- prithivMLmods/Img2Text-Algorithm-Retrieval
- unsloth/LaTeX_OCR
- mychen76/invoices-and-receipts_ocr_v1
language:
- en
base_model:
- Qwen/Qwen2-VL-2B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- OCR
- KIE
- Key Information Extraction
- Messy Handwriting Recognition
- text-generation-inference
- VLM
- Callisto
- OCR#3
- RAG
- 2B
---

# **Callisto-OCR3-2B-Instruct [ VL / OCR ]**

![Callisto.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/8S9-RGoCAfSqzMHvuAWm-.png)

>  [!Note]
>  The **Callisto-OCR3-2B-Instruct** model is a fine-tuned version of *Qwen2-VL-2B-Instruct*, specifically optimized for *messy handwriting recognition*, *Optical Character Recognition (OCR)*, *English language understanding*, and *math problem solving with LaTeX formatting*. This model integrates a conversational approach with visual and textual understanding to handle multi-modal tasks effectively.

[![Open Demo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://huggingface.co/prithivMLmods/Callisto-OCR3-2B-Instruct/blob/main/Callisto-OCR3-2B-Instruct-Demo/Callisto_OCR3_2B_Instruct.ipynb)


#### Key Enhancements:

* **SoTA understanding of images of various resolution & ratio**: Callisto-OCR3 achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.

* **Enhanced Handwriting OCR**: Optimized for recognizing and interpreting **messy handwriting** with high accuracy, making it ideal for digitizing handwritten documents and notes.

* **Understanding videos of 20min+**: Callisto-OCR3 can process long videos, enabling high-quality video-based question answering, transcription, and content generation.

* **Agent that can operate your mobiles, robots, etc.**: With advanced reasoning and decision-making, Callisto-OCR3 can be integrated with mobile phones, robots, and other devices to perform automated tasks based on visual and textual input.

* **Multilingual Support**: Besides English and Chinese, Callisto-OCR3 supports text recognition inside images in multiple languages, including European languages, Japanese, Korean, Arabic, and Vietnamese.

### How to Use

```python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Callisto-OCR3-2B-Instruct", torch_dtype="auto", device_map="auto"
)

# Enable flash_attention_2 for better acceleration and memory optimization
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "prithivMLmods/Callisto-OCR3-2B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# Default processor
processor = AutoProcessor.from_pretrained("prithivMLmods/Callisto-OCR3-2B-Instruct")

# Customize visual token range for speed-memory balance
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Recognize the handwriting in this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generate the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

### Buffering Output
```python
buffer = ""
for new_text in streamer:
    buffer += new_text
    # Remove <|im_end|> or similar tokens from the output
    buffer = buffer.replace("<|im_end|>", "")
    yield buffer
```

### **Key Features**

1. **Advanced Handwriting OCR:**  
   - Excels at recognizing and transcribing **messy and cursive handwriting** into digital text with high accuracy.

2. **Vision-Language Integration:**  
   - Combines **image understanding** with **natural language processing** to convert images into text.  

3. **Optical Character Recognition (OCR):**  
   - Extracts and processes textual information from images with precision.

4. **Math and LaTeX Support:**  
   - Solves math problems and outputs equations in **LaTeX format**.

5. **Conversational Capabilities:**  
   - Designed to handle **multi-turn interactions**, providing context-aware responses.

6. **Image-Text-to-Text Generation:**  
   - Inputs can include **images, text, or a combination**, and the model generates descriptive or problem-solving text.