AiDaeng-Thai-RoPE

A Thai language transformer model with Rotary Position Embedding (RoPE) for enhanced long-context understanding and multilingual capabilities.

Model Description

AiDaeng-Thai-RoPE is an advanced Thai language model that uses Rotary Position Embedding (RoPE) instead of traditional absolute positional embeddings. This allows the model to better extrapolate to sequences longer than those seen during training, making it particularly effective for long-context tasks.

Key Features

Long Context Support: Can process up to 2048 tokens (approximately 1200-1500 Thai words)
RoPE Implementation: Rotary Position Embedding for better position generalization
Multilingual Training: Trained on multilingual dataset including Thai, English, and Chinese
Confidence Scoring: Built-in confidence mechanism for uncertainty detection
Reasoning Enhancement: Configurable reasoning effort for different task complexities
Custom Architecture: Uses ThaiTransformerModel with specialized Thai language optimizations

Important Notes

⚠️ This model uses a custom architecture and cannot be loaded with AutoModelForCausalLM

If you get an error like 'thai_transformer', you must use the custom model class:

# ❌ This will NOT work:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE")

# ✅ Use this instead:
from src.hf_model import ThaiTransformerModel
model = ThaiTransformerModel.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE")

What's New in v2.0

✅ Fixed tokenizer issues: Resolved PyDecoderWrapper errors for better compatibility
✅ Extended context window: Now supports 2048 tokens (up from 256)
✅ Improved model architecture: Better RoPE implementation and confidence scoring
✅ Enhanced documentation: Comprehensive usage examples and troubleshooting

Important Notes

⚠️ This model uses a custom architecture and cannot be loaded with AutoModelForCausalLM

If you get an error like 'thai_transformer', you must use the custom model class:

# ❌ This will NOT work:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE")

# ✅ Use this instead:
from src.hf_model import ThaiTransformerModel
model = ThaiTransformerModel.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE")

How to Use

Option 1: Clone Repository (Recommended)

# Clone the repository
git clone https://huggingface.co/JonusNattapong/AiDaeng-Thai-RoPE
cd AiDaeng-Thai-RoPE

# Install dependencies
pip install -r requirements.txt

# Use the model
from transformers import PreTrainedTokenizerFast
from src.hf_model import ThaiTransformerModel
import torch

# Load model and tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(".")
model = ThaiTransformerModel.from_pretrained(".")

# Generate text
text = "สวัสดีครับ"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Option 3: Direct Usage (Recommended for most users)

from transformers import AutoTokenizer
import requests
import os

# Step 1: Download source files (required for custom model)
os.makedirs('src', exist_ok=True)
files_to_download = [
    'https://huggingface.co/JonusNattapong/AiDaeng-Thai-RoPE/raw/main/src/hf_model.py',
    'https://huggingface.co/JonusNattapong/AiDaeng-Thai-RoPE/raw/main/src/__init__.py'
]

for url in files_to_download:
    filename = url.split('/')[-1]
    response = requests.get(url)
    with open(f'src/{filename}', 'w', encoding='utf-8') as f:
        f.write(response.text)

# Step 2: Load tokenizer (use fast tokenizer to avoid issues)
tokenizer = AutoTokenizer.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE", use_fast=True)

# Step 3: Load custom model (NOT AutoModelForCausalLM)
from src.hf_model import ThaiTransformerModel
model = ThaiTransformerModel.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE")

# Step 4: Generate text
text = "สวัสดีครับ"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

response = tokenizer.decode(generated.squeeze(), skip_special_tokens=True)

print(response)


# Load model and tokenizer
model_path = "JonusNattapong/AiDaeng-Thai-RoPE"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)
model = ThaiTransformerModel.from_pretrained(model_path)

# Prepare input
text = "สวัสดีครับ ผมอยากเรียนรู้เกี่ยวกับ AI"
inputs = tokenizer(text, return_tensors="pt")

# Generate response
generated = model.generate(**inputs, max_length=50, do_sample=True, temperature=0.7)
response = tokenizer.decode(generated.squeeze(), skip_special_tokens=True)

print(response)

Recommended Generation Parameters

For best results, use these parameters:

outputs = model.generate(
    **inputs,
    max_new_tokens=50,      # Generate up to 50 new tokens
    min_new_tokens=5,       # Ensure at least 5 new tokens
    do_sample=True,         # Enable sampling
    temperature=0.8,        # Creativity level (0.7-1.0)
    top_p=0.9,             # Nucleus sampling
    top_k=40,              # Top-k sampling
    repetition_penalty=1.1, # Reduce repetition
    no_repeat_ngram_size=3, # Prevent repeating phrases
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id
)

Current Limitations

⚠️ Generation Quality: This model was primarily trained on structured tasks (QA, math, reasoning). For open-ended conversation, results may be limited.

Recommended Use Cases:

✅ Question answering
✅ Math problem solving
✅ Reasoning tasks
✅ Structured text completion
⚠️ Open-ended conversation (limited)
⚠️ Creative writing (limited)

Improving Generation Quality

To improve results for your use case:

Fine-tune on conversational data for chat applications
Adjust temperature: Higher (0.8-1.0) for creativity, lower (0.3-0.7) for consistency
Use task-specific prompts that match training data patterns
Experiment with different generation parameters

Long Context Processing

# Process long documents (up to 2048 tokens)
long_text = "..."  # Your long Thai text
inputs = tokenizer(long_text, return_tensors="pt", max_length=2048, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    # Process outputs for summarization, analysis, etc.

Process long documents

long_text = "..." # Your long Thai text inputs = tokenizer(long_text, return_tensors="pt", max_length=2048, truncation=True)

with torch.no_grad(): outputs = model(**inputs) # Process outputs for summarization, analysis, etc.


## Training Details

### Training Data
- **Primary Dataset**: Custom multilingual knowledge dataset
- **Languages**: Thai, English, Chinese
- **Domains**: Mathematics, Science, History, General Knowledge, Logic
- **Special Features**: Includes "unknown response" examples for uncertainty training

### Training Procedure
- **Architecture**: Transformer with RoPE positional embeddings
- **Training Steps**: 100 steps with gradient accumulation
- **Batch Size**: 2 with 4-step gradient accumulation (effective batch size 8)
- **Learning Rate**: 1e-5 with warmup
- **Max Sequence Length**: 1024 tokens during training
- **Optimizer**: AdamW

### Hyperparameters
- **Model Size**: ~68M parameters
- **Hidden Size**: 384
- **Number of Heads**: 6
- **Number of Layers**: 6
- **Vocabulary Size**: 44,216
- **Max Position Embeddings**: 2048

## Technical Specifications

### Architecture Details
- **Position Embeddings**: Rotary Position Embedding (RoPE)
- **Attention**: Multi-head self-attention with causal masking
- **Feed Forward**: Standard transformer FFN with GELU activation
- **Normalization**: Layer normalization
- **Output Heads**: Language modeling head + confidence scoring head

### RoPE Implementation
The model uses RoPE with dynamic sequence length handling, allowing it to process inputs longer than the training context effectively.

### Confidence Mechanism
A separate confidence head provides uncertainty estimates for generated responses, enabling the model to admit ignorance when appropriate.

## Performance

### Benchmarks
- **Context Length**: Successfully processes up to 2048 tokens
- **Multilingual Capability**: Trained on Thai-English-Chinese parallel data
- **Reasoning Tasks**: Enhanced performance on logical reasoning with configurable effort

### Evaluation Results
- **Training Loss**: Converged to ~4.45 after 100 steps
- **Confidence Calibration**: Effective uncertainty detection for unknown queries

## Ethical Considerations

### Responsible AI
- **Uncertainty Awareness**: Model can express uncertainty for unfamiliar topics
- **Bias Mitigation**: Trained on diverse knowledge domains
- **Safety Features**: Confidence thresholding prevents overconfident incorrect responses

### Intended Users
- Researchers and developers working with Thai NLP
- Educational institutions
- Companies building Thai language applications
- Individual developers interested in multilingual AI

## Troubleshooting

### Common Issues

#### 1. `AutoModelForCausalLM` Loading Error
**Error**: `KeyError: 'thai_transformer'` or similar when using `AutoModelForCausalLM`

**Solution**: This model uses a custom architecture. Always use `ThaiTransformerModel` instead:
```python
# ❌ Wrong
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE")

# ✅ Correct
from src.hf_model import ThaiTransformerModel
model = ThaiTransformerModel.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE", ignore_mismatched_sizes=True)

2. Tokenizer Loading Issues

Error: Exception: data did not match any variant of untagged enum PyDecoderWrapper

Solution: Use the fast tokenizer instead of slow tokenizer:

# ✅ Recommended
tokenizer = AutoTokenizer.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE", use_fast=True)

# ❌ May cause issues
tokenizer = AutoTokenizer.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE", use_fast=False)

3. Rotary Position Embedding Size Mismatch

Error: size mismatch for rotary_pos_emb.sin

Solution: Add ignore_mismatched_sizes=True to the from_pretrained call:

model = ThaiTransformerModel.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE", ignore_mismatched_sizes=True)

This is safe because RoPE embeddings are fixed and don't affect model performance.

4. CUDA Out of Memory

Solution: Use smaller batch sizes or CPU inference:

# For CPU
model = ThaiTransformerModel.from_pretrained("JonusNattapong/AiDaeng-Thai-RoPE", device_map="cpu")

5. Long Generation Times

Solution: Use shorter max_length and adjust temperature:

outputs = model.generate(**inputs, max_length=100, temperature=0.8, do_sample=True)

Getting Help

If you encounter issues not covered here:

Check that you're using the latest version of transformers (pip install --upgrade transformers)
Ensure you have downloaded the source files
Try the examples in this README exactly as written
Open an issue on the repository with your error message and code

Citation

If you use this model in your research, please cite:

@misc{thai-transformer-rope,
  title={ThaiTransformer-RoPE: A Long-Context Thai Language Model with Rotary Position Embedding},
  author={Your Name},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/your-username/ThaiTransformer-RoPE}
}

License

This model is released under the Apache 2.0 License. See the LICENSE file for details.

Contact

For questions or issues, please open an issue on the GitHub repository or contact the maintainers.

Acknowledgments

Built upon the transformer architecture
RoPE implementation inspired by recent advances in positional embeddings
Training data includes contributions from various open knowledge sources

Downloads last month: 140

Safetensors

Model size

62.6M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support