DeepTaxa: Hybrid CNN-BERT Model (April 2025)
DeepTaxa is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts the pre-trained hybrid CNN-BERT model, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species.
Model Details
- Architecture: HybridCNNBERTClassifier (CNN + BERT)
- Tokenizer:
zhihan1996/DNABERT-2-117M
- Training Data: Greengenes dataset (2024-09 split)
- Levels Predicted: 7 (Domain: 2 labels, Phylum: 106, Class: 244, Order: 630, Family: 1353, Genus: 4798, Species: 10547)
- Total Parameters: 72,635,154
- Max Sequence Length: 512
- Dropout Probability: 0.2
- License: MIT
- Version: April 2025
- File:
deeptaxa_april_2025.pt
Usage
Download the Model
To get started, download the pre-trained model file deeptaxa_april_2025.pt
from this repository:
- Manual Download: Visit https://huggingface.co/systems-genomics-lab/deeptaxa, click on the "Files and versions" tab, and download
deeptaxa_april_2025.pt
(871 MB). - Command Line (wget):
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
- Command Line (git clone):
git clone https://huggingface.co/systems-genomics-lab/deeptaxa cd deeptaxa # The model file is now in the current directory
Run Predictions
Once downloaded, use the model with the DeepTaxa CLI:
python -m deeptaxa.cli predict \
--fasta-file /path/to/sequences.fna.gz \
--checkpoint deeptaxa_april_2025.pt
Full instructions are available on the GitHub repository.
Training Details
- Dataset: 161,866 training sequences, 40,467 validation sequences from Greengenes (
gg_2024_09_training.fna.gz
,gg_2024_09_training.tsv.gz
) - Hyperparameters:
- Learning Rate: 0.0001
- Batch Size: 16
- Epochs: 10
- Optimizer: AdamW (lr=0.0001, betas=[0.9, 0.999], weight_decay=0.01)
- Focal Loss Gamma: 2.0
- Level Weights: [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]
- Training Time: ~21 minutes (1,254 seconds) on NVIDIA A40 GPU
- Timestamp: Trained on 2025-04-04
Performance
Validation metrics (on 40,467 sequences):
Level | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Domain | 99.99% | 99.99% | 99.99% | 99.99% |
Phylum | 99.92% | 99.92% | 99.92% | 99.92% |
Class | 99.89% | 99.85% | 99.89% | 99.87% |
Order | 99.72% | 99.64% | 99.72% | 99.67% |
Family | 99.51% | 99.32% | 99.51% | 99.40% |
Genus | 98.33% | 97.89% | 98.33% | 98.01% |
Species | 95.29% | 94.34% | 95.29% | 94.56% |
- Training Loss: 0.283
- Validation Loss: 0.606
Intended Use
- Taxonomy classification in microbiome research and microbial ecology.
Limitations
- GPU recommended (trained on NVIDIA A40).
- Lower precision at species level due to label complexity (10,547 classes).
Citation
If you use this model in your research, please cite:
@software{DeepTaxa,
author = {{Systems Genomics Lab}},
title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
year = {2025},
publisher = {GitHub},
url = {https://github.com/systems-genomics-lab/deeptaxa},
}
Contact
Open an issue on GitHub for support.
Acknowledgements
- Dr. Olaitan I. Awe and the Omics Codeathon team for their mentorship and contributions.
- Hugging Face for providing a platform to host datasets and models.
- The High-Performance Computing Team of the School of Sciences and Engineering (SSE) at the American University in Cairo (AUC) for their support and for granting access to GPU resources that enabled this work.
- Downloads last month
- 8
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Dataset used to train systems-genomics-lab/deeptaxa
Evaluation results
- Domain Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported1.000
- Phylum Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported0.999
- Class Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported0.999
- Order Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported0.997
- Family Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported0.995
- Genus Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported0.983
- Species Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported0.953