Winnow HeLa Single Shot Probability Calibrator
Winnow recalibrates confidence scores and provides FDR control for de novo peptide sequencing (DNS) workflows.
This repository contains the calibrator trained on HeLa Single Shot data as referenced in our paper: De novo peptide sequencing rescoring and FDR estimation with Winnow.
- Intended inputs: spectrum input data and corresponding MS/MS PSM results produced by InstaNovo
- Outputs: calibrated per-PSM probabilities in
calibrated_confidence
.
What’s inside
calibrator.pkl
: trained classifierscaler.pkl
: feature standardiserirt_predictor.pkl
: Prosit iRT regressor used by RT features
How to use
Python
from pathlib import Path
from huggingface_hub import snapshot_download
from winnow.calibration.calibrator import ProbabilityCalibrator
from winnow.datasets.data_loaders import InstaNovoDatasetLoader
from winnow.scripts.main import filter_dataset
from winnow.fdr.nonparametric import NonParametricFDRControl
# 1) Download model files
helaqc_model = Path("helaqc_model")
snapshot_download(
repo_id="InstaDeepAI/winnow-helaqc-model",
allow_patterns=["*.pkl"],
repo_type="model",
local_dir=helaqc_model,
)
# 2) Load calibrator
calibrator = ProbabilityCalibrator.load(helaqc_model)
# 3) Load your dataset (InstaNovo-style config)
dataset = InstaNovoDatasetLoader().load(
data_path="path_to_spectrum_data.parquet",
predictions_path="path_to_instanovo_predictions.csv",
)
dataset = filter_dataset(dataset) # standard Winnow filtering
# 4) Predict calibrated confidences
calibrator.predict(dataset) # adds dataset.metadata["calibrated_confidence"]
# 5) Optional: FDR control on calibrated confidence
fdr = NonParametricFDRControl()
fdr.fit(dataset.metadata["calibrated_confidence"])
cutoff = fdr.get_confidence_cutoff(0.05) # 5% FDR cutoff
dataset.metadata["keep@5%"] = dataset.metadata["calibrated_confidence"] >= cutoff
CLI
# After `pip install winnow`
winnow predict \
--data-source instanovo \
--dataset-config-path config_with_dataset_paths.yaml \
--model-folder general_model_folder \
--method winnow \
--fdr-threshold 0.05 \
--confidence-column calibrated_confidence \
--output-path outputs/winnow_predictions.csv
Inputs and outputs
Required columns for calibration:
Spectrum data (*.parquet)
spectrum_id
(string): unique spectrum identifiersequence
(string): ground truth peptide sequence from database search (optional)retention_time
(float): retention time (seconds)precursor_mass
(float): mass of the precursor ion (from MS1)mz_array
(list[float]): mass-to-charge values of the MS2 spectrumintensity_array
(list[float]): intensity values of the MS2 spectrumprecursor_charge
(int): charge of the precursor (from MS1)
Beam predictions (*_beams.csv)
spectrum_id
(string)sequence
(string): ground truth peptide sequence from database search (optional)preds
(string): top prediction, untokenised sequencepreds_tokenised
(string): comma‐separated tokens for the top predictionlog_probs
(float): top prediction log probabilitypreds_beam_k
(string): untokenised sequence for beam k (k≥0)log_probs_beam_k
(float)token_log_probs_k
(string/list-encoded): per-token log probabilities for beam k
Output columns (added by Winnow's calibrator on predict
):
calibrated_confidence
: calibrated probability- Optional (if requested):
psm_pep
,psm_fdr
,psm_qvalue
- All input columns are retained in-place
Training data
- The general model was trained on the HeLa single-shot dataset (PXD044934)
- All default features were enabled for the training of this model.
- Predictions were obtained using InstaNovo v1.1.1 with knapsack beam search set to 50 beams.
Citation
If you use winnow
in your research, please cite our preprint: De novo peptide sequencing rescoring and FDR estimation with Winnow
@article{mabona2025novopeptidesequencingrescoring,
title = {De novo peptide sequencing rescoring and FDR estimation with Winnow},
author = {Amandla Mabona and Jemma Daniel and Henrik Servais Janssen Knudsen and
Rachel Catzel and Kevin Michael Eloff and Erwin M. Schoof and Nicolas
Lopez Carranza and Timothy P. Jenkins and Jeroen Van Goey and
Konstantinos Kalogeropoulos},
year = {2025},
eprint = {2509.24952},
archivePrefix = {arXiv},
primaryClass = {q-bio.QM},
url = {https://arxiv.org/abs/2509.24952},
}
If you use this calibrator trained on HeLa Single Shot data, please cite:
@misc{instadeep_ltd_2025,
author = { InstaDeep Ltd },
title = { winnow-helaqc-model (Revision b826cbb) },
year = 2025,
url = { https://huggingface.co/InstaDeepAI/winnow-helaqc-model },
doi = { 10.57967/hf/6612 },
publisher = { Hugging Face }
}
If you use the InstaNovo
model to generate predictions, please also cite: InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments
@article{eloff_kalogeropoulos_2025_instanovo,
title = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale
proteomics experiments},
author = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell,
Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen,
Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J.
and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars,
Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and
Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.},
year = 2025,
month = {Mar},
day = 31,
journal = {Nature Machine Intelligence},
doi = {10.1038/s42256-025-01019-5},
issn = {2522-5839},
url = {https://doi.org/10.1038/s42256-025-01019-5}
}
Contact
For issues with dataset structure or usage in Winnow, please open an issue on the Winnow GitHub: https://github.com/instadeepai/winnow