Adding LongEmbed results (and license to metadata)
Browse files
README.md
CHANGED
@@ -10,6 +10,7 @@ tags:
|
|
10 |
- loss:Distillation
|
11 |
base_model: Alibaba-NLP/gte-modernbert-base
|
12 |
pipeline_tag: sentence-similarity
|
|
|
13 |
library_name: PyLate
|
14 |
metrics:
|
15 |
- MaxSim_accuracy@1
|
@@ -778,6 +779,21 @@ This is a [PyLate](https://github.com/lightonai/pylate) model trained on the [ms
|
|
778 |
- **Language:** English
|
779 |
- **License:** Apache 2.0
|
780 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
781 |
### Model Sources
|
782 |
|
783 |
- **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
|
@@ -957,6 +973,27 @@ For this reason, we also report the results of ColBERT-small in the same setup w
|
|
957 |
| ColBERT-small (rerunned) | 53.35 | 41.01 | 36.86 | 83.14 | 24.95 | 46.76 | 87.89 | 18.72 | 74.02 | 59.42 | 32.83 | 76.88 | 46.36 | 39.36 | 88.66 | 43.44 |
|
958 |
|
959 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
960 |
|
961 |
#### PyLate Information Retrieval
|
962 |
|
|
|
10 |
- loss:Distillation
|
11 |
base_model: Alibaba-NLP/gte-modernbert-base
|
12 |
pipeline_tag: sentence-similarity
|
13 |
+
license: apache-2.0
|
14 |
library_name: PyLate
|
15 |
metrics:
|
16 |
- MaxSim_accuracy@1
|
|
|
779 |
- **Language:** English
|
780 |
- **License:** Apache 2.0
|
781 |
|
782 |
+
### Document length
|
783 |
+
GTE-ModernColBERT has been trained with knowledge distillation on MS MARCO with a document length of 300 tokens, explaining its default value for documents length.
|
784 |
+
|
785 |
+
However, as illustrated in the ModernBERT paper, ColBERT models can generalize to documents lengths way beyond their training length and GTE-ModernColBERT actually yields results way above SOTA in long-context embedding benchmarks, see [LongEmbed results](#longembed-benchmark).
|
786 |
+
|
787 |
+
Simply change adapt the document length parameter to your needs when loading the model:
|
788 |
+
```
|
789 |
+
model = models.ColBERT(
|
790 |
+
model_name_or_path=lightonai/GTE-ModernColBERT-v1,
|
791 |
+
document_length=8192,
|
792 |
+
)
|
793 |
+
```
|
794 |
+
ModernBERT itself has only been trained on 8K context length, but it seems that GTE-ModernColBERT can generalize to even bigger context sizes, though it is not guaranteed so please perform your own benches!
|
795 |
+
|
796 |
+
|
797 |
### Model Sources
|
798 |
|
799 |
- **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
|
|
|
973 |
| ColBERT-small (rerunned) | 53.35 | 41.01 | 36.86 | 83.14 | 24.95 | 46.76 | 87.89 | 18.72 | 74.02 | 59.42 | 32.83 | 76.88 | 46.36 | 39.36 | 88.66 | 43.44 |
|
974 |
|
975 |
|
976 |
+
#### LongEmbed Benchmark
|
977 |
+
GTE-ModernColBERT has been trained with knowledge distillation on MS MARCO with a document length of 300 tokens, explaining its default value for documents length.
|
978 |
+
However, as illustrated in the ModernBERT paper, ColBERT models can generalize to documents lengths way beyond their training length and GTE-ModernColBERT actually yields results way above SOTA (almost 10 points above previous SOTA) in long-context embedding benchmark:
|
979 |
+
|
980 |
+
| Model | Mean | LEMBNarrativeQARetrieval | LEMBNeedleRetrieval | LEMBPasskeyRetrieval | LEMBQMSumRetrieval | LEMBSummScreenFDRetrieval | LEMBWikimQARetrieval |
|
981 |
+
|-----------------------------------------------|-----------|-------------------------|---------------------|----------------------|---------------------|---------------------------|----------------------|
|
982 |
+
| GTE-ModernColBERT (with 32k document length) | **88.39** | **78.82** | **92.5** | 92 | **72.17** | 94.98 | **99.87** |
|
983 |
+
| voyage-multilingual-2 | 79.17216667| 64.694 | 75.25 | **97** | 51.495 | **99.105** | 87.489 |
|
984 |
+
| inf-retriever-v1 | 73.19366667| 60.702 | 61.5 | 78.75 | 55.072 | 97.387 | 85.751 |
|
985 |
+
| snowflake-arctic-embed-l-v2,0 | 63.733 | 43.632 | 50.25 | 77.25 | 40.04 | 96.383 | 74.843 |
|
986 |
+
| gte-multilingual-base | 62.11966667| 52.358 | 42.25 | 55.5 | 43.033 | 95.499 | 84.078 |
|
987 |
+
| jasper_en_vision_language_v1 | 60.9325 | 37.928 | 55 | 62.25 | 41.186 | 97.206 | 72.025 |
|
988 |
+
| bge-m3 | 58.72816667| 45.761 | 40.25 | 59 | 35.543 | 94.089 | 77.726 |
|
989 |
+
| jina-embeddings-v3 | 55.66433333| 34.297 | 64 | 38 | 39.337 | 92.334 | 66.018 |
|
990 |
+
| e5-base-4k | 54.50683333| 30.03 | 37.75 | 65.25 | 31.268 | 93.868 | 68.875 |
|
991 |
+
| gte-Qwen2-7B-instruct | 47.24383333| 45.46 | 31 | 38.5 | 31.272 | 76.08 | 61.151 |
|
992 |
+
|
993 |
+
|
994 |
+
ModernBERT itself has only been trained on 8K context length, but it seems that GTE-ModernColBERT can generalize to even bigger context sizes, though it is not guaranteed so please perform your own benches!
|
995 |
+
|
996 |
+
|
997 |
|
998 |
#### PyLate Information Retrieval
|
999 |
|