NohTow commited on
Commit
7936b9a
·
verified ·
1 Parent(s): c4c9666

Adding LongEmbed results (and license to metadata)

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md CHANGED
@@ -10,6 +10,7 @@ tags:
10
  - loss:Distillation
11
  base_model: Alibaba-NLP/gte-modernbert-base
12
  pipeline_tag: sentence-similarity
 
13
  library_name: PyLate
14
  metrics:
15
  - MaxSim_accuracy@1
@@ -778,6 +779,21 @@ This is a [PyLate](https://github.com/lightonai/pylate) model trained on the [ms
778
  - **Language:** English
779
  - **License:** Apache 2.0
780
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
781
  ### Model Sources
782
 
783
  - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
@@ -957,6 +973,27 @@ For this reason, we also report the results of ColBERT-small in the same setup w
957
  | ColBERT-small (rerunned) | 53.35 | 41.01 | 36.86 | 83.14 | 24.95 | 46.76 | 87.89 | 18.72 | 74.02 | 59.42 | 32.83 | 76.88 | 46.36 | 39.36 | 88.66 | 43.44 |
958
 
959
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
960
 
961
  #### PyLate Information Retrieval
962
 
 
10
  - loss:Distillation
11
  base_model: Alibaba-NLP/gte-modernbert-base
12
  pipeline_tag: sentence-similarity
13
+ license: apache-2.0
14
  library_name: PyLate
15
  metrics:
16
  - MaxSim_accuracy@1
 
779
  - **Language:** English
780
  - **License:** Apache 2.0
781
 
782
+ ### Document length
783
+ GTE-ModernColBERT has been trained with knowledge distillation on MS MARCO with a document length of 300 tokens, explaining its default value for documents length.
784
+
785
+ However, as illustrated in the ModernBERT paper, ColBERT models can generalize to documents lengths way beyond their training length and GTE-ModernColBERT actually yields results way above SOTA in long-context embedding benchmarks, see [LongEmbed results](#longembed-benchmark).
786
+
787
+ Simply change adapt the document length parameter to your needs when loading the model:
788
+ ```
789
+ model = models.ColBERT(
790
+ model_name_or_path=lightonai/GTE-ModernColBERT-v1,
791
+ document_length=8192,
792
+ )
793
+ ```
794
+ ModernBERT itself has only been trained on 8K context length, but it seems that GTE-ModernColBERT can generalize to even bigger context sizes, though it is not guaranteed so please perform your own benches!
795
+
796
+
797
  ### Model Sources
798
 
799
  - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
 
973
  | ColBERT-small (rerunned) | 53.35 | 41.01 | 36.86 | 83.14 | 24.95 | 46.76 | 87.89 | 18.72 | 74.02 | 59.42 | 32.83 | 76.88 | 46.36 | 39.36 | 88.66 | 43.44 |
974
 
975
 
976
+ #### LongEmbed Benchmark
977
+ GTE-ModernColBERT has been trained with knowledge distillation on MS MARCO with a document length of 300 tokens, explaining its default value for documents length.
978
+ However, as illustrated in the ModernBERT paper, ColBERT models can generalize to documents lengths way beyond their training length and GTE-ModernColBERT actually yields results way above SOTA (almost 10 points above previous SOTA) in long-context embedding benchmark:
979
+
980
+ | Model | Mean | LEMBNarrativeQARetrieval | LEMBNeedleRetrieval | LEMBPasskeyRetrieval | LEMBQMSumRetrieval | LEMBSummScreenFDRetrieval | LEMBWikimQARetrieval |
981
+ |-----------------------------------------------|-----------|-------------------------|---------------------|----------------------|---------------------|---------------------------|----------------------|
982
+ | GTE-ModernColBERT (with 32k document length) | **88.39** | **78.82** | **92.5** | 92 | **72.17** | 94.98 | **99.87** |
983
+ | voyage-multilingual-2 | 79.17216667| 64.694 | 75.25 | **97** | 51.495 | **99.105** | 87.489 |
984
+ | inf-retriever-v1 | 73.19366667| 60.702 | 61.5 | 78.75 | 55.072 | 97.387 | 85.751 |
985
+ | snowflake-arctic-embed-l-v2,0 | 63.733 | 43.632 | 50.25 | 77.25 | 40.04 | 96.383 | 74.843 |
986
+ | gte-multilingual-base | 62.11966667| 52.358 | 42.25 | 55.5 | 43.033 | 95.499 | 84.078 |
987
+ | jasper_en_vision_language_v1 | 60.9325 | 37.928 | 55 | 62.25 | 41.186 | 97.206 | 72.025 |
988
+ | bge-m3 | 58.72816667| 45.761 | 40.25 | 59 | 35.543 | 94.089 | 77.726 |
989
+ | jina-embeddings-v3 | 55.66433333| 34.297 | 64 | 38 | 39.337 | 92.334 | 66.018 |
990
+ | e5-base-4k | 54.50683333| 30.03 | 37.75 | 65.25 | 31.268 | 93.868 | 68.875 |
991
+ | gte-Qwen2-7B-instruct | 47.24383333| 45.46 | 31 | 38.5 | 31.272 | 76.08 | 61.151 |
992
+
993
+
994
+ ModernBERT itself has only been trained on 8K context length, but it seems that GTE-ModernColBERT can generalize to even bigger context sizes, though it is not guaranteed so please perform your own benches!
995
+
996
+
997
 
998
  #### PyLate Information Retrieval
999