zELO: ELO-inspired Training Method for Rerankers and Embedding Models
Abstract
A novel training methodology named zELO optimizes retrieval performance by treating ranking tasks as equivalent to a Thurstone model, resulting in state-of-the-art open-weight reranker models that outperform proprietary models across various domains.
We introduce a novel training methodology named zELO, which optimizes retrieval performance via the analysis that ranking tasks are statically equivalent to a Thurstone model. Based on the zELO method, we use unsupervised data in order train a suite of state-of-the-art open-weight reranker models: zerank-1 and zerank-1-small. These models achieve the highest retrieval scores in multiple domains, including finance, legal, code, and STEM, outperforming closed-source proprietary rerankers on both NDCG@10 and Recall. These models also demonstrate great versatility, maintaining their 0-shot performance on out-of-domain and private customer datasets. The training data included 112,000 queries and 100 documents per query, and was trained end-to-end from unannotated queries and documents in less than 10,000 H100-hours.
Community
Presentation from the authors:
In chess, Elo scores predict who wins a face-off. @ZeroEntropy used the same trick for retrieval:
Let documents “battle” pairwise, turn the outcomes into Elo scores, then train a compact pointwise reranker.
We put everything in our new paper: the weights, the math, the RL loop, the ablations, the dead ends.
It’s out now on arXiv, and here is a sneak peak:
✅ What worked:
• The final reranker beat bigger, closed models on accuracy and latency.
• Pairwise → Elo → Pointwise gave better calibration than raw binary labels.
• Cycle sampling cut O(n²) pairwise inference cost to O(n) with no quality loss
❌ What didn’t work:
• Human annotations: too sparse, too noisy. Pairwise with LLMs scaled better.
• Raw Elo alone: needed calibration across queries; otherwise scores drifted.
• Listwise LLM reranking: slower, less consistent, and not worth the cost.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation (2025)
- How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models (2025)
- ERank: Fusing Supervised Fine-Tuning and Reinforcement Learning for Effective and Efficient Text Reranking (2025)
- DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation (2025)
- From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation (2025)
- RRRA: Resampling and Reranking through a Retriever Adapter (2025)
- Granite Embedding R2 Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper