File size: 7,271 Bytes
e97541f d70d99b 47ff4dc d70d99b e97541f d70d99b e97541f d70d99b e97541f d70d99b e97541f d70d99b e97541f 141336b 86627b4 3b6071f 141336b 86627b4 3b6071f 86627b4 141336b e97541f d111b5e 6baedbf 4fac0f5 6baedbf d70d99b e97541f d70d99b e97541f d70d99b e97541f d70d99b e97541f d70d99b e97541f d70d99b e97541f d70d99b e97541f d70d99b e97541f d70d99b 4ace11e d70d99b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
---
base_model: Qwen/Qwen2.5-VL-3B-Instruct
library_name: peft
datasets:
- llamaindex/vdr-multilingual-train
- nomic-ai/colpali_train_set_split_by_source
language:
- en
- it
- fr
- de
- es
pipeline_tag: visual-document-retrieval
tags:
- vidore
- colpali
- multimodal_embedding
- multilingual_embedding
- Text-to-Visual Document (T→VD) retrieval
---
# ColNomic Embed Multimodal 3B: State-of-the-Art Visual Document Retrieval
`colnomic-embed-multimodal-3b` is a multi-vector state-of-the-art multimodal embedding model that excels at visual document retrieval tasks:
- **High Performance**: Achieves 61.2 NDCG@5 on Vidore-v2, outperforming all other models except ColNomic Embed Multimodal 7B
- **Unified Text-Image Encoding**: Directly encodes interleaved text and images without complex preprocessing
- **Advanced Architecture**: 3B parameter multimodal embedding model
- **Open-Weights**: Model weights available for research use
## Performance
| Model | Avg. | ESG Restaurant Human | Econ Macro Multi. | AXA Multi. | MIT Bio | ESG Restaurant Synth. | ESG Restaurant Synth. Multi. | MIT Bio Multi. | AXA | Econ. Macro |
|-------|------|----------------------|-------------------|------------|---------|----------------------|----------------------------|---------------|-----|------------|
| [ColNomic Embed Multimodal 7B](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-7b)| 62.7 | 73.9 | 54.7 | 61.3 | 66.1 | 57.3 | 56.7 | 64.2 | 68.3 | 61.6 |
| **ColNomic Embed Multimodal** 3B | 61.2 | 65.8 | 55.4 | 61.0 | 63.5 | 56.6 | 57.2 | 62.5 | 68.8 | 60.2 |
| T-Systems ColQwen2.5-3B | 59.9 | 72.1 | 51.2 | 60.0 | 65.3 | 51.7 | 53.3 | 61.7 | 69.3 | 54.8 |
| [Nomic Embed Multimodal 7B](https://huggingface.co/nomic-ai/nomic-embed-multimodal-7b) | 59.7 | 65.7 | 57.7 | 59.3 | 64.0 | 49.2 | 51.9 | 61.2 | 66.3 | 63.1 |
| GME Qwen2 7B | 59.0 | 65.8 | 56.2 | 55.4 | 64.0 | 54.3 | 56.7 | 55.1 | 60.7 | 62.9 |
| [Nomic Embed Multimodal 3B](https://huggingface.co/nomic-ai/nomic-embed-multimodal-3b) | 58.8 | 59.8 | 57.5 | 58.8 | 62.5 | 49.4 | 49.4 | 58.6 | 69.6 | 63.5 |
| Llama Index vdr-2b-multi-v1 | 58.4 | 63.1 | 52.8 | 61.0 | 60.6 | 50.3 | 51.2 | 56.9 | 68.8 | 61.2 |
| Voyage Multimodal 3 | 55.0 | 56.1 | 55.0 | 59.5 | 56.4 | 47.2 | 46.2 | 51.5 | 64.1 | 58.8 |
## Getting Started
To use `colnomic-embed-multimodal-3b`, please install `colpali` from source
```bash
pip install git+https://github.com/illuin-tech/colpali.git
```
```python
import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available
from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor
model_name = "nomic-ai/colnomic-embed-multimodal-3b"
model = ColQwen2_5.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda:0", # or "mps" if on Apple Silicon
attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()
processor = ColQwen2_5_Processor.from_pretrained(model_name)
# Your inputs
images = [
Image.new("RGB", (128, 128), color="white"),
Image.new("RGB", (64, 32), color="black"),
]
queries = [
"What is the organizational structure for our R&D department?",
"Can you provide a breakdown of last year’s financial performance?",
]
# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
# Forward pass
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
scores = processor.score_multi_vector(query_embeddings, image_embeddings)
```
## Model Architecture
- **Total Parameters**: 3B
- **Training Approach**: Fine-tuned from Qwen2.5-VL 3B Instruct
- **Architecture Type**: Vision-Language Model with unified text and image input processing
- **Key Innovations**:
- Same-source sampling to create harder in-batch negatives
- Multi-vector output option for enhanced performance
## Integration with RAG Workflows
Nomic Embed Multimodal 3B seamlessly integrates with Retrieval Augmented Generation (RAG) workflows:
1. **Direct Document Embedding**: Skip OCR and complex processing by directly embedding document page images
2. **Faster Processing**: Eliminate preprocessing steps for quicker indexing
3. **More Complete Information**: Capture both textual and visual cues in a single embedding
4. **Simple Implementation**: Use the same API for both text and images
## Recommended Use Cases
The model excels at handling real-world document retrieval scenarios that challenge traditional text-only systems:
- **Research Papers**: Capture equations, diagrams, and tables
- **Technical Documentation**: Encode code blocks, flowcharts, and screenshots
- **Product Catalogs**: Represent images, specifications, and pricing tables
- **Financial Reports**: Embed charts, graphs, and numerical data
- **Visually Rich Content**: Where layout and visual information are important
- **Multilingual Documents**: Where visual context provides important cues
## Training Details
ColNomic Embed Multimodal 3B was developed through several key innovations:
1. **Sampling From the Same Source**: Forcing sampling from the same dataset source creates harder in-batch negatives, preventing the model from learning dataset artifacts.
2. **Multi-Vector Configuration**: Providing a multi-vector variant that achieves higher performance than the dense variant.
## Limitations
- Performance may vary when processing documents with unconventional layouts or unusual visual elements
- While it handles multiple languages, performance is strongest on English content
- Processing very large or complex documents may require dividing them into smaller chunks
- Performance on documents with handwriting or heavily stylized fonts may be reduced
## Join the Nomic Community
- Nomic Embed Ecosystem: [https://www.nomic.ai/embed](https://www.nomic.ai/embed)
- Website: [https://nomic.ai](https://nomic.ai)
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
## Citation
If you find this model useful in your research or applications, please consider citing:
```bibtex
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
@misc{ma2024unifyingmultimodalretrievaldocument,
title={Unifying Multimodal Retrieval via Document Screenshot Embedding},
author={Xueguang Ma and Sheng-Chieh Lin and Minghan Li and Wenhu Chen and Jimmy Lin},
year={2024},
eprint={2406.11251},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2406.11251},
}
@misc{nomicembedmultimodal2025,
title={Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval},
author={Nomic Team},
year={2025},
publisher={Nomic AI},
url={https://nomic.ai/blog/posts/nomic-embed-multimodal},
}
``` |