YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

UME: Universal Molecular Encoder

UME is a multi-modal transformer model for encoding molecular sequences across different modalities:

  • Proteins (amino acid sequences)
  • Chemicals (SMILES strings)
  • DNA/RNA (nucleotide sequences)

This model uses a unified architecture to understand and encode molecular information across these different domains.

πŸš€ Quick Start

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("karina-zadorozhny/ume-base", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("karina-zadorozhny/ume-base", trust_remote_code=True)

# Example protein sequence
protein = "MKTVRQERLKSIVRILERSKEPVSGAQL"
inputs = tokenizer(protein, modality="amino_acid", return_tensors="pt", padding=True)
outputs = model(**inputs)

# Get embeddings
embeddings = outputs.last_hidden_state[:, 0]  # CLS token embedding
print(f"Protein embedding shape: {embeddings.shape}")

🧬 Multi-Modal Usage

Protein Sequences

protein = "MKTVRQERLKSIVRILERSKEPVSGAQL"
inputs = tokenizer([protein], modality="amino_acid", return_tensors="pt")
outputs = model(**inputs)

Chemical Structures (SMILES)

aspirin = "CC(=O)OC1=CC=CC=C1C(=O)O"
inputs = tokenizer([aspirin], modality="SMILES", return_tensors="pt")
outputs = model(**inputs)

DNA/RNA Sequences

dna = "ATGCGTACGTAGCTACGTACG"
inputs = tokenizer([dna], modality="nucleotide", return_tensors="pt")
outputs = model(**inputs)

Automatic Modality Detection

# UME can automatically detect the modality
sequences = [
    "MKTVRQERLKSIVRILERSKEPVSGAQL",  # protein
    "CC(=O)OC1=CC=CC=C1C(=O)O",      # SMILES
    "ATGCGTACGTAGCTACGTACG"          # DNA
]

for seq in sequences:
    inputs = tokenizer([seq], modality=None, return_tensors="pt")  # Auto-detect
    outputs = model(**inputs)
    print(f"Sequence: {seq[:20]}...")
    print(f"Embedding: {outputs.last_hidden_state.shape}")

πŸ”¬ Use Cases

Molecular Similarity

def get_embedding(sequence, modality=None):
    inputs = tokenizer([sequence], modality=modality, return_tensors="pt")
    outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0]  # CLS token

# Compare two proteins
protein1 = "MKTVRQERLKSIVRILERSKEPVSGAQL"
protein2 = "MKTVRQERLKSIVRILERSKEPVSGAQT"  # Similar but different

emb1 = get_embedding(protein1, "amino_acid")
emb2 = get_embedding(protein2, "amino_acid")

# Cosine similarity
similarity = torch.nn.functional.cosine_similarity(emb1, emb2)
print(f"Protein similarity: {similarity.item():.3f}")

Batch Processing

# Process multiple sequences at once
sequences = [
    "MKTVRQERLKSIVRILERSKEPVSGAQL",
    "AVKTVRQERLKSIVRILERSKEPVSGAQL", 
    "MKTVRQERLKSIVRILERSKEPVSGAQL"
]

inputs = tokenizer(sequences, modality="amino_acid", return_tensors="pt", padding=True)
outputs = model(**inputs)

batch_embeddings = outputs.last_hidden_state[:, 0]  # Shape: [batch_size, hidden_size]
print(f"Batch embeddings shape: {batch_embeddings.shape}")

πŸ“Š Model Architecture

  • Base Architecture: ModernBERT-based transformer
  • Model Size: 11.3M parameters (mini variant)
  • Hidden Size: 384
  • Layers: 6
  • Attention Heads: 6
  • Max Sequence Length: 8,192 tokens
  • Vocabulary Size: 100 tokens (unified across modalities)
  • Position Embeddings: RoPE (Rotary Position Embeddings)
  • Attention: Local-Global Alternating Attention with sliding window

🎯 Model Variants

Model Parameters Layers Hidden Size Use Case
UME-mini 11.3M 6 384 Fast inference, prototyping
UME-small ~90M 12 768 Balanced performance
UME-medium ~480M 24 1280 High accuracy
UME-large ~740M 24 1600 Best performance

Currently only UME-mini is available publicly.

βš™οΈ Advanced Usage

Masked Language Modeling

from transformers import AutoModelForMaskedLM

# Load for MLM
mlm_model = AutoModelForMaskedLM.from_pretrained("karina-zadorozhny/ume-base", trust_remote_code=True)

# Protein with masked token
protein_masked = "MKTVRQERLK<mask>IVRILERSKEPVSGAQL"
inputs = tokenizer([protein_masked], modality="amino_acid", return_tensors="pt")

# Predict masked token
outputs = mlm_model(**inputs)
logits = outputs.logits

# Get predictions for masked position
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)

print(f"Predicted amino acid: {predicted_token}")

Feature Extraction

# Extract features from all layers
outputs = model(**inputs, output_hidden_states=True)

# Get representations from each layer
all_layer_embeddings = outputs.hidden_states  # List of [batch_size, seq_len, hidden_size]

# Use different layers for different purposes
early_features = all_layer_embeddings[2]   # Layer 2 - more syntactic
late_features = all_layer_embeddings[-1]   # Last layer - more semantic

πŸ§ͺ Training Data

UME was trained on a large corpus of molecular sequences including:

  • Protein sequences from UniProt
  • Chemical structures from ChEMBL
  • DNA/RNA sequences from various genomic databases

πŸ“ˆ Performance

UME achieves competitive performance on various molecular understanding tasks:

  • Protein function prediction
  • Chemical property prediction
  • Molecular similarity search
  • Cross-modal molecular retrieval

πŸ› οΈ Requirements

torch >= 1.9.0
transformers >= 4.20.0

πŸ“š Citation

If you use UME in your research, please cite:

@misc{ume2024,
  title={UME: Universal Molecular Encoder for Multi-Modal Molecular Understanding},
  author={UME Team},
  year={2024},
  url={https://github.com/your-org/ume}
}

🀝 Contributing

We welcome contributions! Please see our contributing guidelines for details.

πŸ“„ License

This model is released under the MIT License. See LICENSE for details.

πŸ”— Links


Note: This model requires trust_remote_code=True due to the custom tokenizer that handles multi-modal molecular sequences. The tokenizer is fully self-contained and does not require external dependencies beyond standard Python libraries.

Downloads last month
4
Safetensors
Model size
14.2M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support