File size: 5,449 Bytes
1b23b27
 
 
 
 
 
 
 
 
 
 
dc9c4f6
6db96c8
dc9c4f6
6db96c8
dc9c4f6
6db96c8
 
 
dc9c4f6
 
 
 
 
 
 
 
 
6db96c8
dc9c4f6
6db96c8
dc9c4f6
6db96c8
dc9c4f6
6db96c8
dc9c4f6
 
 
6db96c8
dc9c4f6
6db96c8
dc9c4f6
6db96c8
dc9c4f6
 
6db96c8
dc9c4f6
 
 
 
6db96c8
dc9c4f6
6db96c8
dc9c4f6
6db96c8
dc9c4f6
 
 
 
 
 
 
6db96c8
dc9c4f6
 
 
 
6db96c8
dc9c4f6
6db96c8
dc9c4f6
 
6db96c8
dc9c4f6
 
 
 
 
6db96c8
dc9c4f6
6db96c8
dc9c4f6
 
6db96c8
dc9c4f6
6db96c8
dc9c4f6
 
 
 
 
 
6db96c8
dc9c4f6
 
6db96c8
dc9c4f6
6db96c8
dc9c4f6
6db96c8
dc9c4f6
 
6db96c8
dc9c4f6
6db96c8
dc9c4f6
 
 
 
 
 
6db96c8
dc9c4f6
 
6db96c8
dc9c4f6
6db96c8
dc9c4f6
6db96c8
dc9c4f6
 
 
 
6db96c8
dc9c4f6
6db96c8
dc9c4f6
 
 
 
6db96c8
dc9c4f6
6db96c8
dc9c4f6
 
 
 
 
 
 
 
 
6db96c8
dc9c4f6
6db96c8
dc9c4f6
6db96c8
dc9c4f6
6db96c8
dc9c4f6
6db96c8
1b23b27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
license: apache-2.0
language:
- km
metrics:
- accuracy
base_model:
- google/mt5-small
pipeline_tag: summarization
library_name: transformers
---
# Khmer mT5 Summarization Model (1024 Tokens) - V2

## Introduction

This repository contains an improved version of the Khmer mT5 summarization model, **songhieng/khmer-mt5-summarization-1024tk-V2**. This version has been trained on an expanded dataset, including data from [kimleang123/rfi_news](https://huggingface.co/datasets/kimleang123/rfi_news), allowing for improved summarization performance on Khmer text.

## Model Details

- **Base Model:** `google/mt5-small`
- **Fine-tuned for:** Khmer text summarization with extended input length
- **Training Dataset:** `kimleang123/rfi_news` + previous dataset
- **Framework:** Hugging Face `transformers`
- **Task Type:** Sequence-to-Sequence (Seq2Seq)
- **Input:** Khmer text (articles, paragraphs, or documents) up to 1024 tokens
- **Output:** Summarized Khmer text
- **Training Hardware:** GPU (Tesla T4)
- **Evaluation Metric:** ROUGE Score

## Installation & Setup

### 1️⃣ Install Dependencies

Ensure you have `transformers`, `torch`, and `datasets` installed:

```bash
pip install transformers torch datasets
```

### 2️⃣ Load the Model

To load and use the fine-tuned model:

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "songhieng/khmer-mt5-summarization-1024tk-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
```

## How to Use

### 1️⃣ Using Python Code

```python
def summarize_khmer(text, max_length=150):
    input_text = f"summarize: {text}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=1024)
    summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

khmer_text = "αž€αž˜αŸ’αž–αž»αž‡αžΆαž˜αžΆαž“αž”αŸ’αžšαž‡αžΆαž‡αž“αž”αŸ’αžšαž˜αžΆαžŽ ៑៦ αž›αžΆαž“αž“αžΆαž€αŸ‹ αž αžΎαž™αžœαžΆαž‚αžΊαž‡αžΆαž”αŸ’αžšαž‘αŸαžŸαž“αŸ…αžαŸ†αž”αž“αŸ‹αž’αžΆαžŸαŸŠαžΈαž’αžΆαž‚αŸ’αž“αŸαž™αŸαŸ”"
summary = summarize_khmer(khmer_text)
print("Khmer Summary:", summary)
```

### 2️⃣ Using Hugging Face Pipeline

```python
from transformers import pipeline

summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization-1024tk-V2")
khmer_text = "αž€αž˜αŸ’αž–αž»αž‡αžΆαž˜αžΆαž“αž”αŸ’αžšαž‡αžΆαž‡αž“αž”αŸ’αžšαž˜αžΆαžŽ ៑៦ αž›αžΆαž“αž“αžΆαž€αŸ‹ αž αžΎαž™αžœαžΆαž‚αžΊαž‡αžΆαž”αŸ’αžšαž‘αŸαžŸαž“αŸ…αžαŸ†αž”αž“αŸ‹αž’αžΆαžŸαŸŠαžΈαž’αžΆαž‚αŸ’αž“αŸαž™αŸαŸ”"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("Khmer Summary:", summary[0]['summary_text'])
```

### 3️⃣ Deploy as an API using FastAPI

```python
from fastapi import FastAPI

app = FastAPI()

@app.post("/summarize/")
def summarize(text: str):
    inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=1024)
    summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return {"summary": summary}

# Run with: uvicorn filename:app --reload
```

## Model Evaluation

The model was evaluated using **ROUGE scores**, which measure the similarity between the generated summaries and the reference summaries.

```python
from datasets import load_metric

rouge = load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    return rouge.compute(predictions=decoded_preds, references=decoded_labels)

trainer.evaluate()
```

## Saving & Uploading the Model

After fine-tuning, the model can be uploaded to the Hugging Face Hub:

```python
model.push_to_hub("songhieng/khmer-mt5-summarization-1024tk-V2")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization-1024tk-V2")
```

To download it later:

```python
model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization-1024tk-V2")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization-1024tk-V2")
```

## Summary

| **Feature**           | **Details**                                     |
|-----------------------|-------------------------------------------------|
| **Base Model**        | `google/mt5-small`                              |
| **Task**              | Summarization                                   |
| **Language**          | Khmer (αžαŸ’αž˜αŸ‚αžš)                                   |
| **Dataset**           | `kimleang123/rfi_news` + previous dataset       |
| **Framework**         | Hugging Face Transformers                       |
| **Evaluation Metric** | ROUGE Score                                     |
| **Deployment**        | Hugging Face Model Hub, API (FastAPI), Python Code |

## Contributing

Contributions are welcome! Feel free to **open issues or submit pull requests** if you have any improvements or suggestions.

### Contact

If you have any questions, feel free to reach out via [Hugging Face Discussions](https://huggingface.co/) or create an issue in the repository.

**Built for the Khmer NLP Community**