File size: 5,367 Bytes
84c59cc 86db9a2 84c59cc 1abe26f 84c59cc bd24137 72a2d77 f691c9d a91c580 0dd1da6 9a93d96 0dd1da6 9a93d96 a91c580 61ddd51 a91c580 d01361a a235605 a91c580 bd24137 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
---
license: apache-2.0
datasets:
- linxy/LaTeX_OCR
- prithivMLmods/Img2Text-Plaintext-Retrieval
- prithivMLmods/Img2Text-Algorithm-Retrieval
- unsloth/LaTeX_OCR
- mychen76/invoices-and-receipts_ocr_v1
language:
- en
base_model:
- Qwen/Qwen2-VL-2B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- OCR
- KIE
- Key Information Extraction
- Messy Handwriting Recognition
- text-generation-inference
- VLM
- Callisto
- OCR#3
- RAG
- 2B
---
# **Callisto-OCR3-2B-Instruct [ VL / OCR ]**

> [!Note]
> The **Callisto-OCR3-2B-Instruct** model is a fine-tuned version of *Qwen2-VL-2B-Instruct*, specifically optimized for *messy handwriting recognition*, *Optical Character Recognition (OCR)*, *English language understanding*, and *math problem solving with LaTeX formatting*. This model integrates a conversational approach with visual and textual understanding to handle multi-modal tasks effectively.
[](https://huggingface.co/prithivMLmods/Callisto-OCR3-2B-Instruct/blob/main/Callisto-OCR3-2B-Instruct-Demo/Callisto_OCR3_2B_Instruct.ipynb)
#### Key Enhancements:
* **SoTA understanding of images of various resolution & ratio**: Callisto-OCR3 achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
* **Enhanced Handwriting OCR**: Optimized for recognizing and interpreting **messy handwriting** with high accuracy, making it ideal for digitizing handwritten documents and notes.
* **Understanding videos of 20min+**: Callisto-OCR3 can process long videos, enabling high-quality video-based question answering, transcription, and content generation.
* **Agent that can operate your mobiles, robots, etc.**: With advanced reasoning and decision-making, Callisto-OCR3 can be integrated with mobile phones, robots, and other devices to perform automated tasks based on visual and textual input.
* **Multilingual Support**: Besides English and Chinese, Callisto-OCR3 supports text recognition inside images in multiple languages, including European languages, Japanese, Korean, Arabic, and Vietnamese.
### How to Use
```python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"prithivMLmods/Callisto-OCR3-2B-Instruct", torch_dtype="auto", device_map="auto"
)
# Enable flash_attention_2 for better acceleration and memory optimization
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# "prithivMLmods/Callisto-OCR3-2B-Instruct",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# Default processor
processor = AutoProcessor.from_pretrained("prithivMLmods/Callisto-OCR3-2B-Instruct")
# Customize visual token range for speed-memory balance
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Recognize the handwriting in this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generate the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
### Buffering Output
```python
buffer = ""
for new_text in streamer:
buffer += new_text
# Remove <|im_end|> or similar tokens from the output
buffer = buffer.replace("<|im_end|>", "")
yield buffer
```
### **Key Features**
1. **Advanced Handwriting OCR:**
- Excels at recognizing and transcribing **messy and cursive handwriting** into digital text with high accuracy.
2. **Vision-Language Integration:**
- Combines **image understanding** with **natural language processing** to convert images into text.
3. **Optical Character Recognition (OCR):**
- Extracts and processes textual information from images with precision.
4. **Math and LaTeX Support:**
- Solves math problems and outputs equations in **LaTeX format**.
5. **Conversational Capabilities:**
- Designed to handle **multi-turn interactions**, providing context-aware responses.
6. **Image-Text-to-Text Generation:**
- Inputs can include **images, text, or a combination**, and the model generates descriptive or problem-solving text. |