---
datasets:
- Rainnighttram/Scam_detect_50
language:
- en
- zh
base_model:
- meta-llama/Llama-3.2-1B
---
# Scam SMS Detection Model (Llama 3.2 1B Fine-tuned)

A fine-tuned Llama 3.2 1B model specifically designed to detect and classify scam SMS messages in Hong Kong, with support for both Traditional Chinese and English text.

## 🚀 Model Overview

This model is based on Meta's Llama 3.2 1B and has been fine-tuned using MLX framework on a carefully curated dataset of SMS messages collected in Hong Kong. The model can effectively distinguish between legitimate and fraudulent SMS messages in both Traditional Chinese and English.

### Key Features
- **Bilingual Support**: Traditional Chinese and English
- **Lightweight**: 1B parameters for efficient deployment
- **Cross-Platform**: GGUF format optimized for llama.cpp deployment
- **Local Processing**: No internet connection required for inference

## 📊 Model Details

| Specification | Details |
|---------------|---------|
| Base Model | Meta Llama 3.2 1B |
| Fine-tuning Framework | MLX |
| Model Format | GGUF |
| Languages | Traditional Chinese, English |
| Training Data | Self-collected Hong Kong SMS samples |
| Model Size | ~2.5GB |
| Context Length | 8,192 tokens |

## 🛠 Requirements

### Software Dependencies
- **llama.cpp** (Model Engine)
- **Python 3.8+** (for preprocessing scripts)

### Hardware Requirements
- **Minimum RAM**: 8GB
- **Recommended RAM**: 16GB+
- **Storage**: 3GB free space

## 📱 Installation & Deployment

### Desktop/Server Deployment

1. **Install llama.cpp**
   ```bash
   git clone https://github.com/ggerganov/llama.cpp.git
   cd llama.cpp
   make
   ```

2. **Download the model**
   ```bash
   # Download your model file (replace with actual download link)
   wget [MODEL_DOWNLOAD_URL] -O scam_sms_detector.gguf
   ```

3. **Run inference**
   ```bash
   ./main -m scam_sms_detector.gguf -p "Classify this SMS: 恭喜您中獎了！請點擊鏈接領取獎金" -n 50
   ```

## 🔧 Usage Examples

### Basic Classification
```bash
# English SMS
./main -m scam_sms_detector.gguf -p "Classify: Congratulations! You've won $10,000. Click here to claim your prize!" -n 30

# Traditional Chinese SMS
./main -m scam_sms_detector.gguf -p "分類此短信：您的銀行賬戶已被凍結，請立即點擊鏈接驗證身份" -n 30
```

### Batch Processing
```python
import subprocess
import json

def classify_sms(text):
    cmd = [
        "./main", 
        "-m", "scam_sms_detector.gguf",
        "-p", f"Classify this SMS as SCAM or LEGITIMATE: {text}",
        "-n", "10"
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)
    return result.stdout.strip()

# Example usage
messages = [
    "Your package is ready for delivery. Track: https://bit.ly/track123",
    "Meeting scheduled for 3 PM tomorrow in conference room A",
    "恭喜！您已被選中獲得免費iPhone，請點擊領取"
]

for msg in messages:
    classification = classify_sms(msg)
    print(f"Message: {msg}")
    print(f"Classification: {classification}\n")
```

### API Integration
```python
# Simple Flask API wrapper
from flask import Flask, request, jsonify
import subprocess

app = Flask(__name__)

@app.route('/classify', methods=['POST'])
def classify_sms():
    data = request.json
    sms_text = data.get('text', '')
    
    cmd = [
        "./main", 
        "-m", "scam_sms_detector.gguf",
        "-p", f"Classify: {sms_text}",
        "-n", "20"
    ]
    
    result = subprocess.run(cmd, capture_output=True, text=True)
    
    return jsonify({
        'text': sms_text,
        'classification': result.stdout.strip(),
        'confidence': 'high'  # You may want to implement confidence scoring
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
```

## 📈 Performance & Capabilities

### Language Support
- **Traditional Chinese**: Optimized for Hong Kong usage patterns
- **English**: Standard international English
- **Mixed Language**: Can handle code-switching between Chinese and English

### Expected Performance
- **Memory Usage**: ~3GB RAM during inference

## ⚠️ Limitations

1. **Regional Specificity**: Optimized for Hong Kong SMS patterns; may need retraining for other regions
2. **Language Support**: Limited to Traditional Chinese and English
3. **Context Dependency**: May require additional context for borderline cases
4. **Update Frequency**: Scam patterns evolve; periodic retraining recommended
5. **Legal Compliance**: Users responsible for compliance with local privacy laws

## 🤝 Contributing

You are welcomed to contributions to improve the model:

1. **Data Collection**: Help expand the training dataset
2. **Bug Reports**: Report issues or false classifications
3. **Feature Requests**: Suggest improvements or new capabilities

### Acknowledgments
- Meta AI for the Llama 3.2 base model
- Apple MLX team for the fine-tuning framework
- Georgi Gerganov for llama.cpp