arxiv:2509.12541

zELO: ELO-inspired Training Method for Rerankers and Embedding Models

Published on Sep 16

· Submitted by

Authors:

Abstract

A novel training methodology named zELO optimizes retrieval performance by treating ranking tasks as equivalent to a Thurstone model, resulting in state-of-the-art open-weight reranker models that outperform proprietary models across various domains.

AI-generated summary

We introduce a novel training methodology named zELO, which optimizes retrieval performance via the analysis that ranking tasks are statically equivalent to a Thurstone model. Based on the zELO method, we use unsupervised data in order train a suite of state-of-the-art open-weight reranker models: zerank-1 and zerank-1-small. These models achieve the highest retrieval scores in multiple domains, including finance, legal, code, and STEM, outperforming closed-source proprietary rerankers on both NDCG@10 and Recall. These models also demonstrate great versatility, maintaining their 0-shot performance on out-of-domain and private customer datasets. The training data included 112,000 queries and 100 documents per query, and was trained end-to-end from unannotated queries and documents in less than 10,000 H100-hours.

View arXiv page View PDF Add to collection

Community

thomwolf

Paper submitter 16 days ago

Presentation from the authors:

In chess, Elo scores predict who wins a face-off. @ZeroEntropy used the same trick for retrieval:

Let documents “battle” pairwise, turn the outcomes into Elo scores, then train a compact pointwise reranker.

We put everything in our new paper: the weights, the math, the RL loop, the ablations, the dead ends.

It’s out now on arXiv, and here is a sneak peak:

✅ What worked:
• The final reranker beat bigger, closed models on accuracy and latency.
• Pairwise → Elo → Pointwise gave better calibration than raw binary labels.
• Cycle sampling cut O(n²) pairwise inference cost to O(n) with no quality loss

❌ What didn’t work:
• Human annotations: too sparse, too noisy. Pairwise with LLMs scaled better.
• Raw Elo alone: needed calibration across queries; otherwise scores drifted.
• Listwise LLM reranking: slower, less consistent, and not worth the cost.

librarian-bot

16 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.12541 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.12541 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.12541 in a Space README.md to link it from this page.