richinfoai
/

ritrieve_zh_v1

Sentence Similarity

sentence-transformers

text-embeddings-inference

Model card Files Files and versions Community

ritrieve_zh_v1 / README.md

richinfoai's picture

Update README.md

f8d5a70 verified about 1 month ago

|

history blame contribute delete

2.18 kB

	---
	datasets:
	- BAAI/Infinity-Instruct
	- opencsg/chinese-fineweb-edu
	language:
	- zh
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	license: mit
	---
	## Introduction

	This model was trained by [richinfoai](https://www.richinfo.cn/).
	Followed [Stella and Jasper models](https://arxiv.org/pdf/2412.19048), we do distillation training from
	[lier007/xiaobu-embedding-v2](https://huggingface.co/lier007/xiaobu-embedding-v2),
	[dunzhang/stella-large-zh-v3-1792d](https://huggingface.co/dunzhang/stella-large-zh-v3-1792d)
	and [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2).
	Thanks to their outstanding performance, our model has achieved excellent results on MTEB(cmn, v1).

	We believe this model once again demonstrates the effectiveness of distillation learning.
	In the future, we will train more bilingual vector models based on various excellent vector training methods.

	## Methods

	### Stage1

	We use [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)
	and [opencsg/chinese-fineweb-edu](https://huggingface.co/datasets/opencsg/chinese-fineweb-edu)
	as training data to do a distillation from the above three models.
	In this stage, we only use cosine-loss.

	### Stage2

	The objective of stage2 is reducing dimensions.
	We use the same training data as the stage1 with `similarity loss`. After stage2, the dimensions of our model is 1792.

	## Usage

	This model does not need instructions and you can use it in `SentenceTransformer`:

	```python
	import os

	os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
	from sentence_transformers import SentenceTransformer

	text_encoder = SentenceTransformer("richinfoai/ritrieve_zh_v1")
	texts = [
	"什么是人工智能",
	"介绍一下主流的LLM",
	"人工智能（AI）是模拟人类智能的计算机系统，能够执行学习、推理和决策等任务。它通过算法和大数据实现自动化，广泛应用于各行各业。"
	]
	vectors = text_encoder.encode(texts, normalize_embeddings=True)
	print(vectors @ vectors.T)
	# [[0.9999999 0.67707014 0.91421044]
	# [0.67707014 0.9999998 0.6353945 ]
	# [0.91421044 0.6353945 1.0000001 ]]

	```