DeepTaxa: Hybrid CNN-BERT Model (April 2025)

DeepTaxa is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts the pre-trained hybrid CNN-BERT model, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species.

Model Details

  • Architecture: HybridCNNBERTClassifier (CNN + BERT)
  • Tokenizer: zhihan1996/DNABERT-2-117M
  • Training Data: Greengenes dataset (2024-09 split)
  • Levels Predicted: 7 (Domain: 2 labels, Phylum: 106, Class: 244, Order: 630, Family: 1353, Genus: 4798, Species: 10547)
  • Total Parameters: 72,635,154
  • Max Sequence Length: 512
  • Dropout Probability: 0.2
  • License: MIT
  • Version: April 2025
  • File: deeptaxa_april_2025.pt

Usage

Download the Model

To get started, download the pre-trained model file deeptaxa_april_2025.pt from this repository:

  • Manual Download: Visit https://huggingface.co/systems-genomics-lab/deeptaxa, click on the "Files and versions" tab, and download deeptaxa_april_2025.pt (871 MB).
  • Command Line (wget):
    wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
    
  • Command Line (git clone):
    git clone https://huggingface.co/systems-genomics-lab/deeptaxa
    cd deeptaxa
    # The model file is now in the current directory
    

Run Predictions

Once downloaded, use the model with the DeepTaxa CLI:

python -m deeptaxa.cli predict \
  --fasta-file /path/to/sequences.fna.gz \
  --checkpoint deeptaxa_april_2025.pt

Full instructions are available on the GitHub repository.

Training Details

  • Dataset: 161,866 training sequences, 40,467 validation sequences from Greengenes (gg_2024_09_training.fna.gz, gg_2024_09_training.tsv.gz)
  • Hyperparameters:
    • Learning Rate: 0.0001
    • Batch Size: 16
    • Epochs: 10
    • Optimizer: AdamW (lr=0.0001, betas=[0.9, 0.999], weight_decay=0.01)
    • Focal Loss Gamma: 2.0
    • Level Weights: [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]
  • Training Time: ~21 minutes (1,254 seconds) on NVIDIA A40 GPU
  • Timestamp: Trained on 2025-04-04

Performance

Validation metrics (on 40,467 sequences):

Level Accuracy Precision Recall F1-Score
Domain 99.99% 99.99% 99.99% 99.99%
Phylum 99.92% 99.92% 99.92% 99.92%
Class 99.89% 99.85% 99.89% 99.87%
Order 99.72% 99.64% 99.72% 99.67%
Family 99.51% 99.32% 99.51% 99.40%
Genus 98.33% 97.89% 98.33% 98.01%
Species 95.29% 94.34% 95.29% 94.56%
  • Training Loss: 0.283
  • Validation Loss: 0.606

Intended Use

  • Taxonomy classification in microbiome research and microbial ecology.

Limitations

  • GPU recommended (trained on NVIDIA A40).
  • Lower precision at species level due to label complexity (10,547 classes).

Citation

If you use this model in your research, please cite:

@software{DeepTaxa,
  author = {{Systems Genomics Lab}},
  title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/systems-genomics-lab/deeptaxa},
}

Contact

Open an issue on GitHub for support.

Acknowledgements

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train systems-genomics-lab/deeptaxa

Evaluation results