AiDaeng-Thai-RoPE
A Thai language transformer model with Rotary Position Embedding (RoPE) for enhanced long-context understanding and multilingual capabilities.
Model Description
AiDaeng-Thai-RoPE is an advanced Thai language model that uses Rotary Position Embedding (RoPE) instead of traditional absolute positional embeddings. This allows the model to better extrapolate to sequences longer than those seen during training, making it particularly effective for long-context tasks.
Key Features
- Long Context Support: Can process up to 2048 tokens (approximately 1200-1500 Thai words)
- RoPE Implementation: Rotary Position Embedding for better position generalization
- Multilingual Training: Trained on multilingual dataset including Thai, English, and Chinese
- Confidence Scoring: Built-in confidence mechanism for uncertainty detection
- Reasoning Enhancement: Configurable reasoning effort for different task complexities
- Custom Architecture: Uses ThaiTransformerModel with specialized Thai language optimizations
Important Notes
⚠️ This model uses a custom architecture and cannot be loaded with AutoModelForCausalLM
If you get an error like 'thai_transformer'
, you must use the custom model class:
# ❌ This will NOT work:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE")
# ✅ Use this instead:
from src.hf_model import ThaiTransformerModel
model = ThaiTransformerModel.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE")
What's New in v2.0
- ✅ Fixed tokenizer issues: Resolved PyDecoderWrapper errors for better compatibility
- ✅ Extended context window: Now supports 2048 tokens (up from 256)
- ✅ Improved model architecture: Better RoPE implementation and confidence scoring
- ✅ Enhanced documentation: Comprehensive usage examples and troubleshooting
Important Notes
⚠️ This model uses a custom architecture and cannot be loaded with AutoModelForCausalLM
If you get an error like 'thai_transformer'
, you must use the custom model class:
# ❌ This will NOT work:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE")
# ✅ Use this instead:
from src.hf_model import ThaiTransformerModel
model = ThaiTransformerModel.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE")
How to Use
Option 1: Clone Repository (Recommended)
# Clone the repository
git clone https://huggingface.co/JonusNattapong/AiDaeng-Thai-RoPE
cd AiDaeng-Thai-RoPE
# Install dependencies
pip install -r requirements.txt
# Use the model
from transformers import PreTrainedTokenizerFast
from src.hf_model import ThaiTransformerModel
import torch
# Load model and tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(".")
model = ThaiTransformerModel.from_pretrained(".")
# Generate text
text = "สวัสดีครับ"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Option 3: Direct Usage (Recommended for most users)
from transformers import AutoTokenizer
import requests
import os
# Step 1: Download source files (required for custom model)
os.makedirs('src', exist_ok=True)
files_to_download = [
'https://huggingface.co/JonusNattapong/AiDaeng-Thai-RoPE/raw/main/src/hf_model.py',
'https://huggingface.co/JonusNattapong/AiDaeng-Thai-RoPE/raw/main/src/__init__.py'
]
for url in files_to_download:
filename = url.split('/')[-1]
response = requests.get(url)
with open(f'src/{filename}', 'w', encoding='utf-8') as f:
f.write(response.text)
# Step 2: Load tokenizer (use fast tokenizer to avoid issues)
tokenizer = AutoTokenizer.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE", use_fast=True)
# Step 3: Load custom model (NOT AutoModelForCausalLM)
from src.hf_model import ThaiTransformerModel
model = ThaiTransformerModel.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE")
# Step 4: Generate text
text = "สวัสดีครับ"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
response = tokenizer.decode(generated.squeeze(), skip_special_tokens=True)
print(response)
# Load model and tokenizer
model_path = "JonusNattapong/AiDaeng-Thai-RoPE"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)
model = ThaiTransformerModel.from_pretrained(model_path)
# Prepare input
text = "สวัสดีครับ ผมอยากเรียนรู้เกี่ยวกับ AI"
inputs = tokenizer(text, return_tensors="pt")
# Generate response
generated = model.generate(**inputs, max_length=50, do_sample=True, temperature=0.7)
response = tokenizer.decode(generated.squeeze(), skip_special_tokens=True)
print(response)
Recommended Generation Parameters
For best results, use these parameters:
outputs = model.generate(
**inputs,
max_new_tokens=50, # Generate up to 50 new tokens
min_new_tokens=5, # Ensure at least 5 new tokens
do_sample=True, # Enable sampling
temperature=0.8, # Creativity level (0.7-1.0)
top_p=0.9, # Nucleus sampling
top_k=40, # Top-k sampling
repetition_penalty=1.1, # Reduce repetition
no_repeat_ngram_size=3, # Prevent repeating phrases
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
Current Limitations
⚠️ Generation Quality: This model was primarily trained on structured tasks (QA, math, reasoning). For open-ended conversation, results may be limited.
Recommended Use Cases:
- ✅ Question answering
- ✅ Math problem solving
- ✅ Reasoning tasks
- ✅ Structured text completion
- ⚠️ Open-ended conversation (limited)
- ⚠️ Creative writing (limited)
Improving Generation Quality
To improve results for your use case:
- Fine-tune on conversational data for chat applications
- Adjust temperature: Higher (0.8-1.0) for creativity, lower (0.3-0.7) for consistency
- Use task-specific prompts that match training data patterns
- Experiment with different generation parameters
Long Context Processing
# Process long documents (up to 2048 tokens)
long_text = "..." # Your long Thai text
inputs = tokenizer(long_text, return_tensors="pt", max_length=2048, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
# Process outputs for summarization, analysis, etc.
Process long documents
long_text = "..." # Your long Thai text inputs = tokenizer(long_text, return_tensors="pt", max_length=2048, truncation=True)
with torch.no_grad(): outputs = model(**inputs) # Process outputs for summarization, analysis, etc.
## Training Details
### Training Data
- **Primary Dataset**: Custom multilingual knowledge dataset
- **Languages**: Thai, English, Chinese
- **Domains**: Mathematics, Science, History, General Knowledge, Logic
- **Special Features**: Includes "unknown response" examples for uncertainty training
### Training Procedure
- **Architecture**: Transformer with RoPE positional embeddings
- **Training Steps**: 100 steps with gradient accumulation
- **Batch Size**: 2 with 4-step gradient accumulation (effective batch size 8)
- **Learning Rate**: 1e-5 with warmup
- **Max Sequence Length**: 1024 tokens during training
- **Optimizer**: AdamW
### Hyperparameters
- **Model Size**: ~68M parameters
- **Hidden Size**: 384
- **Number of Heads**: 6
- **Number of Layers**: 6
- **Vocabulary Size**: 44,216
- **Max Position Embeddings**: 2048
## Technical Specifications
### Architecture Details
- **Position Embeddings**: Rotary Position Embedding (RoPE)
- **Attention**: Multi-head self-attention with causal masking
- **Feed Forward**: Standard transformer FFN with GELU activation
- **Normalization**: Layer normalization
- **Output Heads**: Language modeling head + confidence scoring head
### RoPE Implementation
The model uses RoPE with dynamic sequence length handling, allowing it to process inputs longer than the training context effectively.
### Confidence Mechanism
A separate confidence head provides uncertainty estimates for generated responses, enabling the model to admit ignorance when appropriate.
## Performance
### Benchmarks
- **Context Length**: Successfully processes up to 2048 tokens
- **Multilingual Capability**: Trained on Thai-English-Chinese parallel data
- **Reasoning Tasks**: Enhanced performance on logical reasoning with configurable effort
### Evaluation Results
- **Training Loss**: Converged to ~4.45 after 100 steps
- **Confidence Calibration**: Effective uncertainty detection for unknown queries
## Ethical Considerations
### Responsible AI
- **Uncertainty Awareness**: Model can express uncertainty for unfamiliar topics
- **Bias Mitigation**: Trained on diverse knowledge domains
- **Safety Features**: Confidence thresholding prevents overconfident incorrect responses
### Intended Users
- Researchers and developers working with Thai NLP
- Educational institutions
- Companies building Thai language applications
- Individual developers interested in multilingual AI
## Troubleshooting
### Common Issues
#### 1. `AutoModelForCausalLM` Loading Error
**Error**: `KeyError: 'thai_transformer'` or similar when using `AutoModelForCausalLM`
**Solution**: This model uses a custom architecture. Always use `ThaiTransformerModel` instead:
```python
# ❌ Wrong
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE")
# ✅ Correct
from src.hf_model import ThaiTransformerModel
model = ThaiTransformerModel.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE", ignore_mismatched_sizes=True)
2. Tokenizer Loading Issues
Error: Exception: data did not match any variant of untagged enum PyDecoderWrapper
Solution: Use the fast tokenizer instead of slow tokenizer:
# ✅ Recommended
tokenizer = AutoTokenizer.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE", use_fast=True)
# ❌ May cause issues
tokenizer = AutoTokenizer.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE", use_fast=False)
3. Rotary Position Embedding Size Mismatch
Error: size mismatch for rotary_pos_emb.sin
Solution: Add ignore_mismatched_sizes=True
to the from_pretrained
call:
model = ThaiTransformerModel.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE", ignore_mismatched_sizes=True)
This is safe because RoPE embeddings are fixed and don't affect model performance.
4. CUDA Out of Memory
Solution: Use smaller batch sizes or CPU inference:
# For CPU
model = ThaiTransformerModel.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE", device_map="cpu")
5. Long Generation Times
Solution: Use shorter max_length and adjust temperature:
outputs = model.generate(**inputs, max_length=100, temperature=0.8, do_sample=True)
Getting Help
If you encounter issues not covered here:
- Check that you're using the latest version of transformers (
pip install --upgrade transformers
) - Ensure you have downloaded the source files
- Try the examples in this README exactly as written
- Open an issue on the repository with your error message and code
Citation
If you use this model in your research, please cite:
@misc{thai-transformer-rope,
title={ThaiTransformer-RoPE: A Long-Context Thai Language Model with Rotary Position Embedding},
author={Your Name},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/your-username/ThaiTransformer-RoPE}
}
License
This model is released under the Apache 2.0 License. See the LICENSE file for details.
Contact
For questions or issues, please open an issue on the GitHub repository or contact the maintainers.
Acknowledgments
- Built upon the transformer architecture
- RoPE implementation inspired by recent advances in positional embeddings
- Training data includes contributions from various open knowledge sources
- Downloads last month
- 140