--- datasets: - Rainnighttram/Scam_detect_50 language: - en - zh base_model: - meta-llama/Llama-3.2-1B --- # Scam SMS Detection Model (Llama 3.2 1B Fine-tuned) A fine-tuned Llama 3.2 1B model specifically designed to detect and classify scam SMS messages in Hong Kong, with support for both Traditional Chinese and English text. ## 🚀 Model Overview This model is based on Meta's Llama 3.2 1B and has been fine-tuned using MLX framework on a carefully curated dataset of SMS messages collected in Hong Kong. The model can effectively distinguish between legitimate and fraudulent SMS messages in both Traditional Chinese and English. ### Key Features - **Bilingual Support**: Traditional Chinese and English - **Lightweight**: 1B parameters for efficient deployment - **Cross-Platform**: GGUF format optimized for llama.cpp deployment - **Local Processing**: No internet connection required for inference ## 📊 Model Details | Specification | Details | |---------------|---------| | Base Model | Meta Llama 3.2 1B | | Fine-tuning Framework | MLX | | Model Format | GGUF | | Languages | Traditional Chinese, English | | Training Data | Self-collected Hong Kong SMS samples | | Model Size | ~2.5GB | | Context Length | 8,192 tokens | ## 🛠 Requirements ### Software Dependencies - **llama.cpp** (Model Engine) - **Python 3.8+** (for preprocessing scripts) ### Hardware Requirements - **Minimum RAM**: 8GB - **Recommended RAM**: 16GB+ - **Storage**: 3GB free space ## 📱 Installation & Deployment ### Desktop/Server Deployment 1. **Install llama.cpp** ```bash git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp make ``` 2. **Download the model** ```bash # Download your model file (replace with actual download link) wget [MODEL_DOWNLOAD_URL] -O scam_sms_detector.gguf ``` 3. **Run inference** ```bash ./main -m scam_sms_detector.gguf -p "Classify this SMS: 恭喜您中獎了!請點擊鏈接領取獎金" -n 50 ``` ## 🔧 Usage Examples ### Basic Classification ```bash # English SMS ./main -m scam_sms_detector.gguf -p "Classify: Congratulations! You've won $10,000. Click here to claim your prize!" -n 30 # Traditional Chinese SMS ./main -m scam_sms_detector.gguf -p "分類此短信:您的銀行賬戶已被凍結,請立即點擊鏈接驗證身份" -n 30 ``` ### Batch Processing ```python import subprocess import json def classify_sms(text): cmd = [ "./main", "-m", "scam_sms_detector.gguf", "-p", f"Classify this SMS as SCAM or LEGITIMATE: {text}", "-n", "10" ] result = subprocess.run(cmd, capture_output=True, text=True) return result.stdout.strip() # Example usage messages = [ "Your package is ready for delivery. Track: https://bit.ly/track123", "Meeting scheduled for 3 PM tomorrow in conference room A", "恭喜!您已被選中獲得免費iPhone,請點擊領取" ] for msg in messages: classification = classify_sms(msg) print(f"Message: {msg}") print(f"Classification: {classification}\n") ``` ### API Integration ```python # Simple Flask API wrapper from flask import Flask, request, jsonify import subprocess app = Flask(__name__) @app.route('/classify', methods=['POST']) def classify_sms(): data = request.json sms_text = data.get('text', '') cmd = [ "./main", "-m", "scam_sms_detector.gguf", "-p", f"Classify: {sms_text}", "-n", "20" ] result = subprocess.run(cmd, capture_output=True, text=True) return jsonify({ 'text': sms_text, 'classification': result.stdout.strip(), 'confidence': 'high' # You may want to implement confidence scoring }) if __name__ == '__main__': app.run(host='0.0.0.0', port=5000) ``` ## 📈 Performance & Capabilities ### Language Support - **Traditional Chinese**: Optimized for Hong Kong usage patterns - **English**: Standard international English - **Mixed Language**: Can handle code-switching between Chinese and English ### Expected Performance - **Memory Usage**: ~3GB RAM during inference ## ⚠️ Limitations 1. **Regional Specificity**: Optimized for Hong Kong SMS patterns; may need retraining for other regions 2. **Language Support**: Limited to Traditional Chinese and English 3. **Context Dependency**: May require additional context for borderline cases 4. **Update Frequency**: Scam patterns evolve; periodic retraining recommended 5. **Legal Compliance**: Users responsible for compliance with local privacy laws ## 🤝 Contributing You are welcomed to contributions to improve the model: 1. **Data Collection**: Help expand the training dataset 2. **Bug Reports**: Report issues or false classifications 3. **Feature Requests**: Suggest improvements or new capabilities ### Acknowledgments - Meta AI for the Llama 3.2 base model - Apple MLX team for the fine-tuning framework - Georgi Gerganov for llama.cpp