UME: Universal Molecular Encoder
UME is a multi-modal transformer model for encoding molecular sequences across different modalities:
- Proteins (amino acid sequences)
- Chemicals (SMILES strings)
- DNA/RNA (nucleotide sequences)
This model uses a unified architecture to understand and encode molecular information across these different domains.
π Quick Start
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
model = AutoModel.from_pretrained("karina-zadorozhny/ume-base", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("karina-zadorozhny/ume-base", trust_remote_code=True)
# Example protein sequence
protein = "MKTVRQERLKSIVRILERSKEPVSGAQL"
inputs = tokenizer(protein, modality="amino_acid", return_tensors="pt", padding=True)
outputs = model(**inputs)
# Get embeddings
embeddings = outputs.last_hidden_state[:, 0] # CLS token embedding
print(f"Protein embedding shape: {embeddings.shape}")
𧬠Multi-Modal Usage
Protein Sequences
protein = "MKTVRQERLKSIVRILERSKEPVSGAQL"
inputs = tokenizer([protein], modality="amino_acid", return_tensors="pt")
outputs = model(**inputs)
Chemical Structures (SMILES)
aspirin = "CC(=O)OC1=CC=CC=C1C(=O)O"
inputs = tokenizer([aspirin], modality="SMILES", return_tensors="pt")
outputs = model(**inputs)
DNA/RNA Sequences
dna = "ATGCGTACGTAGCTACGTACG"
inputs = tokenizer([dna], modality="nucleotide", return_tensors="pt")
outputs = model(**inputs)
Automatic Modality Detection
# UME can automatically detect the modality
sequences = [
"MKTVRQERLKSIVRILERSKEPVSGAQL", # protein
"CC(=O)OC1=CC=CC=C1C(=O)O", # SMILES
"ATGCGTACGTAGCTACGTACG" # DNA
]
for seq in sequences:
inputs = tokenizer([seq], modality=None, return_tensors="pt") # Auto-detect
outputs = model(**inputs)
print(f"Sequence: {seq[:20]}...")
print(f"Embedding: {outputs.last_hidden_state.shape}")
π¬ Use Cases
Molecular Similarity
def get_embedding(sequence, modality=None):
inputs = tokenizer([sequence], modality=modality, return_tensors="pt")
outputs = model(**inputs)
return outputs.last_hidden_state[:, 0] # CLS token
# Compare two proteins
protein1 = "MKTVRQERLKSIVRILERSKEPVSGAQL"
protein2 = "MKTVRQERLKSIVRILERSKEPVSGAQT" # Similar but different
emb1 = get_embedding(protein1, "amino_acid")
emb2 = get_embedding(protein2, "amino_acid")
# Cosine similarity
similarity = torch.nn.functional.cosine_similarity(emb1, emb2)
print(f"Protein similarity: {similarity.item():.3f}")
Batch Processing
# Process multiple sequences at once
sequences = [
"MKTVRQERLKSIVRILERSKEPVSGAQL",
"AVKTVRQERLKSIVRILERSKEPVSGAQL",
"MKTVRQERLKSIVRILERSKEPVSGAQL"
]
inputs = tokenizer(sequences, modality="amino_acid", return_tensors="pt", padding=True)
outputs = model(**inputs)
batch_embeddings = outputs.last_hidden_state[:, 0] # Shape: [batch_size, hidden_size]
print(f"Batch embeddings shape: {batch_embeddings.shape}")
π Model Architecture
- Base Architecture: ModernBERT-based transformer
- Model Size: 11.3M parameters (mini variant)
- Hidden Size: 384
- Layers: 6
- Attention Heads: 6
- Max Sequence Length: 8,192 tokens
- Vocabulary Size: 100 tokens (unified across modalities)
- Position Embeddings: RoPE (Rotary Position Embeddings)
- Attention: Local-Global Alternating Attention with sliding window
π― Model Variants
Model | Parameters | Layers | Hidden Size | Use Case |
---|---|---|---|---|
UME-mini | 11.3M | 6 | 384 | Fast inference, prototyping |
UME-small | ~90M | 12 | 768 | Balanced performance |
UME-medium | ~480M | 24 | 1280 | High accuracy |
UME-large | ~740M | 24 | 1600 | Best performance |
Currently only UME-mini is available publicly.
βοΈ Advanced Usage
Masked Language Modeling
from transformers import AutoModelForMaskedLM
# Load for MLM
mlm_model = AutoModelForMaskedLM.from_pretrained("karina-zadorozhny/ume-base", trust_remote_code=True)
# Protein with masked token
protein_masked = "MKTVRQERLK<mask>IVRILERSKEPVSGAQL"
inputs = tokenizer([protein_masked], modality="amino_acid", return_tensors="pt")
# Predict masked token
outputs = mlm_model(**inputs)
logits = outputs.logits
# Get predictions for masked position
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print(f"Predicted amino acid: {predicted_token}")
Feature Extraction
# Extract features from all layers
outputs = model(**inputs, output_hidden_states=True)
# Get representations from each layer
all_layer_embeddings = outputs.hidden_states # List of [batch_size, seq_len, hidden_size]
# Use different layers for different purposes
early_features = all_layer_embeddings[2] # Layer 2 - more syntactic
late_features = all_layer_embeddings[-1] # Last layer - more semantic
π§ͺ Training Data
UME was trained on a large corpus of molecular sequences including:
- Protein sequences from UniProt
- Chemical structures from ChEMBL
- DNA/RNA sequences from various genomic databases
π Performance
UME achieves competitive performance on various molecular understanding tasks:
- Protein function prediction
- Chemical property prediction
- Molecular similarity search
- Cross-modal molecular retrieval
π οΈ Requirements
torch >= 1.9.0
transformers >= 4.20.0
π Citation
If you use UME in your research, please cite:
@misc{ume2024,
title={UME: Universal Molecular Encoder for Multi-Modal Molecular Understanding},
author={UME Team},
year={2024},
url={https://github.com/your-org/ume}
}
π€ Contributing
We welcome contributions! Please see our contributing guidelines for details.
π License
This model is released under the MIT License. See LICENSE for details.
π Links
- Paper: [Coming soon]
- Code: GitHub Repository
- Documentation: Full Documentation
Note: This model requires trust_remote_code=True
due to the custom tokenizer that handles multi-modal molecular sequences. The tokenizer is fully self-contained and does not require external dependencies beyond standard Python libraries.
- Downloads last month
- 4