|
--- |
|
datasets: |
|
- BAAI/Infinity-Instruct |
|
- opencsg/chinese-fineweb-edu |
|
language: |
|
- zh |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
license: mit |
|
--- |
|
## Introduction |
|
|
|
This model was trained by [richinfoai](https://www.richinfo.cn/). |
|
Followed [Stella and Jasper models](https://arxiv.org/pdf/2412.19048), we do distillation training from |
|
[lier007/xiaobu-embedding-v2](https://huggingface.co/lier007/xiaobu-embedding-v2), |
|
[dunzhang/stella-large-zh-v3-1792d](https://huggingface.co/dunzhang/stella-large-zh-v3-1792d) |
|
and [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2). |
|
Thanks to their outstanding performance, our model has achieved excellent results on MTEB(cmn, v1). |
|
|
|
We believe this model once again demonstrates the effectiveness of distillation learning. |
|
In the future, we will train more bilingual vector models based on various excellent vector training methods. |
|
|
|
## Methods |
|
|
|
### Stage1 |
|
|
|
We use [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) |
|
and [opencsg/chinese-fineweb-edu](https://huggingface.co/datasets/opencsg/chinese-fineweb-edu) |
|
as training data to do a distillation from the above three models. |
|
In this stage, we only use cosine-loss. |
|
|
|
### Stage2 |
|
|
|
The objective of stage2 is reducing dimensions. |
|
We use the same training data as the stage1 with `similarity loss`. After stage2, the dimensions of our model is 1792. |
|
|
|
## Usage |
|
|
|
This model does not need instructions and you can use it in `SentenceTransformer`: |
|
|
|
```python |
|
import os |
|
|
|
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com" |
|
from sentence_transformers import SentenceTransformer |
|
|
|
text_encoder = SentenceTransformer("richinfoai/ritrieve_zh_v1") |
|
texts = [ |
|
"什么是人工智能", |
|
"介绍一下主流的LLM", |
|
"人工智能(AI)是模拟人类智能的计算机系统,能够执行学习、推理和决策等任务。它通过算法和大数据实现自动化,广泛应用于各行各业。" |
|
] |
|
vectors = text_encoder.encode(texts, normalize_embeddings=True) |
|
print(vectors @ vectors.T) |
|
# [[0.9999999 0.67707014 0.91421044] |
|
# [0.67707014 0.9999998 0.6353945 ] |
|
# [0.91421044 0.6353945 1.0000001 ]] |
|
|
|
``` |