lightonai
/

GTE-ModernColBERT-v1

@@ -10,6 +10,7 @@ tags:
 - loss:Distillation
 base_model: Alibaba-NLP/gte-modernbert-base
 pipeline_tag: sentence-similarity
 library_name: PyLate
 metrics:
 - MaxSim_accuracy@1
@@ -778,6 +779,21 @@ This is a [PyLate](https://github.com/lightonai/pylate) model trained on the [ms
 - **Language:** English
 - **License:** Apache 2.0
 ### Model Sources
 - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
@@ -957,6 +973,27 @@ For this reason, we also report the results of ColBERT-small in the same setup w
 | ColBERT-small (rerunned) | 53.35 | 41.01    | 36.86    | 83.14      | 24.95      | 46.76   | 87.89          | 18.72   | 74.02   | 59.42 | 32.83        | 76.88    | 46.36   | 39.36       | 88.66 | 43.44   |
 #### PyLate Information Retrieval

 - loss:Distillation
 base_model: Alibaba-NLP/gte-modernbert-base
 pipeline_tag: sentence-similarity
+license: apache-2.0
 library_name: PyLate
 metrics:
 - MaxSim_accuracy@1
 - **Language:** English
 - **License:** Apache 2.0
+### Document length
+GTE-ModernColBERT has been trained with knowledge distillation on MS MARCO with a document length of 300 tokens, explaining its default value for documents length.
+However, as illustrated in the ModernBERT paper, ColBERT models can generalize to documents lengths way beyond their training length and GTE-ModernColBERT actually yields results way above SOTA in long-context embedding benchmarks, see [LongEmbed results](#longembed-benchmark).
+Simply change adapt the document length parameter to your needs when loading the model:
+```
+model = models.ColBERT(
+    model_name_or_path=lightonai/GTE-ModernColBERT-v1,
+    document_length=8192,
+)
+```
+ModernBERT itself has only been trained on 8K context length, but it seems that GTE-ModernColBERT can generalize to even bigger context sizes, though it is not guaranteed so please perform your own benches!
 ### Model Sources
 - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
 | ColBERT-small (rerunned) | 53.35 | 41.01    | 36.86    | 83.14      | 24.95      | 46.76   | 87.89          | 18.72   | 74.02   | 59.42 | 32.83        | 76.88    | 46.36   | 39.36       | 88.66 | 43.44   |
+#### LongEmbed Benchmark
+GTE-ModernColBERT has been trained with knowledge distillation on MS MARCO with a document length of 300 tokens, explaining its default value for documents length.
+However, as illustrated in the ModernBERT paper, ColBERT models can generalize to documents lengths way beyond their training length and GTE-ModernColBERT actually yields results way above SOTA (almost 10 points above previous SOTA) in long-context embedding benchmark:
+| Model                                         | Mean      | LEMBNarrativeQARetrieval | LEMBNeedleRetrieval | LEMBPasskeyRetrieval | LEMBQMSumRetrieval | LEMBSummScreenFDRetrieval | LEMBWikimQARetrieval |
+|-----------------------------------------------|-----------|-------------------------|---------------------|----------------------|---------------------|---------------------------|----------------------|
+| GTE-ModernColBERT (with 32k document length)                        | **88.39**     | **78.82**                   | **92.5**                | 92                   | **72.17**               | 94.98                     | **99.87**                |
+| voyage-multilingual-2                        | 79.17216667| 64.694                  | 75.25               | **97**                   | 51.495              | **99.105**                    | 87.489               |
+| inf-retriever-v1                             | 73.19366667| 60.702                  | 61.5                | 78.75                | 55.072              | 97.387                    | 85.751               |
+| snowflake-arctic-embed-l-v2,0                 | 63.733    | 43.632                  | 50.25               | 77.25                | 40.04               | 96.383                    | 74.843               |
+| gte-multilingual-base                        | 62.11966667| 52.358                  | 42.25               | 55.5                 | 43.033              | 95.499                    | 84.078               |
+| jasper_en_vision_language_v1                  | 60.9325   | 37.928                  | 55                  | 62.25                | 41.186              | 97.206                    | 72.025               |
+| bge-m3                                       | 58.72816667| 45.761                  | 40.25               | 59                   | 35.543              | 94.089                    | 77.726               |
+| jina-embeddings-v3                           | 55.66433333| 34.297                  | 64                  | 38                   | 39.337              | 92.334                    | 66.018               |
+| e5-base-4k                                   | 54.50683333| 30.03                   | 37.75               | 65.25                | 31.268              | 93.868                    | 68.875               |
+| gte-Qwen2-7B-instruct                        | 47.24383333| 45.46                   | 31                  | 38.5                 | 31.272              | 76.08                     | 61.151               |
+ModernBERT itself has only been trained on 8K context length, but it seems that GTE-ModernColBERT can generalize to even bigger context sizes, though it is not guaranteed so please perform your own benches!
 #### PyLate Information Retrieval