--- license: apache-2.0 language: - en base_model: - intfloat/e5-base-unsupervised pipeline_tag: sentence-similarity --- # cadet-embed-base-v1 **cadet-embed-base-v1** is a BERT-base embedding model fine-tuned **from `intfloat/e5-base-unsupervised`** with * **cross-encoder listwise distillation** (teachers: `RankT5-3B` and `BAAI/bge-reranker-v2.5-gemma2-lightweight`) * **purely synthetic queries** (Llama-3.1 8B generated: questions, claims, titles, keywords, zero-shot & few-shot web queries) over 400k passages total from MSMARCO, DBPedia and Wikipedia corpora. The result: highly effective BERT-base retrieval. We provide our training code and scripts to generate synthetic queries at https://github.com/manveertamber/cadet-dense-retrieval. --- ## Quick start ```python from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer("manveertamber/cadet-embed-base-v1") query = "query: capital of France" passages = [ "passage: Paris is the capital and largest city of France.", "passage: Berlin is known for its vibrant art scene.", "passage: The Eiffel Tower is located in Paris, France." ] # Encode q_emb = model.encode(query, normalize_embeddings=True) p_embs = model.encode(passages, normalize_embeddings=True) # shape (n_passages, dim) scores = np.dot(p_embs, q_emb) # shape (n_passages,) # Rank passages by score for passage, score in sorted(zip(passages, scores), key=lambda x: x[1], reverse=True): print(f"{score:.3f}\t{passage}") ``` If you use this model, please cite: ``` @article{tamber2025conventionalcontrastivelearningfalls, title={Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data}, author={Manveer Singh Tamber and Suleman Kazi and Vivek Sourabh and Jimmy Lin}, journal={arXiv:2505.19274}, year={2025} } ```