DeepTaxa: Hybrid CNN-BERT Model (April 2025)

DeepTaxa is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts the pre-trained hybrid CNN-BERT model, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species.

Model Details

Architecture: HybridCNNBERTClassifier (CNN + BERT)
Tokenizer: zhihan1996/DNABERT-2-117M
Training Data: Greengenes dataset (2024-09 split)
Levels Predicted: 7 (Domain: 2 labels, Phylum: 106, Class: 244, Order: 630, Family: 1353, Genus: 4798, Species: 10547)
Total Parameters: 72,635,154
Max Sequence Length: 512
Dropout Probability: 0.2
License: MIT
Version: April 2025
File: deeptaxa_april_2025.pt

Usage

Download the Model

To get started, download the pre-trained model file deeptaxa_april_2025.pt from this repository:

Manual Download: Visit https://huggingface.co/systems-genomics-lab/deeptaxa, click on the "Files and versions" tab, and download deeptaxa_april_2025.pt (871 MB).

Command Line (wget):

wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt

Command Line (git clone):

git clone https://huggingface.co/systems-genomics-lab/deeptaxa
cd deeptaxa
# The model file is now in the current directory

Run Predictions

Once downloaded, use the model with the DeepTaxa CLI:

python -m deeptaxa.cli predict \
  --fasta-file /path/to/sequences.fna.gz \
  --checkpoint deeptaxa_april_2025.pt

Full instructions are available on the GitHub repository.

Training Details

Dataset: 161,866 training sequences, 40,467 validation sequences from Greengenes (gg_2024_09_training.fna.gz, gg_2024_09_training.tsv.gz)
Hyperparameters:
- Learning Rate: 0.0001
- Batch Size: 16
- Epochs: 10
- Optimizer: AdamW (lr=0.0001, betas=[0.9, 0.999], weight_decay=0.01)
- Focal Loss Gamma: 2.0
- Level Weights: [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]
Training Time: ~21 minutes (1,254 seconds) on NVIDIA A40 GPU
Timestamp: Trained on 2025-04-04

Performance

Validation metrics (on 40,467 sequences):

Level	Accuracy	Precision	Recall	F1-Score
Domain	99.99%	99.99%	99.99%	99.99%
Phylum	99.92%	99.92%	99.92%	99.92%
Class	99.89%	99.85%	99.89%	99.87%
Order	99.72%	99.64%	99.72%	99.67%
Family	99.51%	99.32%	99.51%	99.40%
Genus	98.33%	97.89%	98.33%	98.01%
Species	95.29%	94.34%	95.29%	94.56%

Training Loss: 0.283
Validation Loss: 0.606

Intended Use

Taxonomy classification in microbiome research and microbial ecology.

Limitations

GPU recommended (trained on NVIDIA A40).
Lower precision at species level due to label complexity (10,547 classes).

Citation

If you use this model in your research, please cite:

@software{DeepTaxa,
  author = {{Systems Genomics Lab}},
  title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/systems-genomics-lab/deeptaxa},
}

Contact

Open an issue on GitHub for support.

Acknowledgements

Dr. Olaitan I. Awe and the Omics Codeathon team for their mentorship and contributions.
Hugging Face for providing a platform to host datasets and models.
The High-Performance Computing Team of the School of Sciences and Engineering (SSE) at the American University in Cairo (AUC) for their support and for granting access to GPU resources that enabled this work.

systems-genomics-lab
/

deeptaxa