developerPushkal commited on
Commit
b82fb3f
Β·
verified Β·
1 Parent(s): ab26254

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Grammar Correction with Text-to-Text Transfer Transformer
2
+
3
+ ## πŸ“Œ Overview
4
+
5
+ This repository hosts the quantized version of the T5 model fine-tuned for Grammar Correction. The model has been trained on the JFLEG dataset from Hugging Face to enhance grammatical accuracy in given text inputs. The model is quantized to Float16 (FP16) to optimize inference speed and efficiency while maintaining high performance.
6
+
7
+ ## πŸ— Model Details
8
+
9
+ - **Model Architecture:** t5-small
10
+ - **Task:** Grammar Correction
11
+ - **Dataset:** Hugging Face's `jfleg`
12
+ - **Quantization:** Float16 (FP16) for optimized inference
13
+ - **Fine-tuning Framework:** Hugging Face Transformers
14
+
15
+ ## πŸš€ Usage
16
+
17
+ ### Installation
18
+
19
+ ```bash
20
+ pip install transformers torch
21
+ ```
22
+
23
+ ### Loading the Model
24
+
25
+ ```python
26
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
27
+ import torch
28
+
29
+ device = "cuda" if torch.cuda.is_available() else "cpu"
30
+
31
+ model_name = "AventIQ-AI/t5-small-grammar-correction"
32
+ model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)
33
+ tokenizer = T5Tokenizer.from_pretrained(model_name)
34
+ ```
35
+
36
+ ### Grammar Correction Inference
37
+
38
+ ```python
39
+ def correct_grammar(text, model, tokenizer, device):
40
+ prefix = "correct grammar: "
41
+ input_text = prefix + text
42
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
43
+ outputs = model.generate(
44
+ input_ids,
45
+ max_length=128,
46
+ num_beams=5,
47
+ early_stopping=True,
48
+ )
49
+ corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
50
+ return corrected_text
51
+
52
+ # πŸ” Test Example
53
+ test_sentences = [
54
+ "He go to the store yesterday.",
55
+ "They was running in the park.",
56
+ "She dont like pizza.",
57
+ "We has completed the project already.",
58
+ ]
59
+ for sentence in test_sentences:
60
+ corrected = correct_grammar(sentence, model, tokenizer, device)
61
+ print(f"Original: {sentence}")
62
+ print(f"Corrected: {corrected}")
63
+ print("---")
64
+ ```
65
+
66
+ ## πŸ“Š Evaluation Metric: BLEU Score
67
+
68
+ For grammar correction, a high BLEU score indicates that the model’s corrected sentences closely match human-annotated corrections.
69
+
70
+ ## **Interpreting Our BLEU Score**
71
+ Our model achieved a **BLEU score of 0.8888**, which indicates:
72
+ βœ… **Good grammar correction ability**
73
+ βœ… **Moderate sentence fluency**
74
+
75
+ BLEU is computed by comparing the **1-gram, 2-gram, 3-gram, and 4-gram overlaps** between the model’s output and the reference sentence while applying a **brevity penalty** if the model generates shorter sentences.
76
+
77
+ ### **BLEU Score Ranges for Grammar Correction**
78
+
79
+ | BLEU Score | Interpretation |
80
+ | --- | --- |
81
+ | **0.8 - 1.0** | Near-perfect corrections, closely matching human annotations. |
82
+ | **0.7 - 0.8** | High-quality corrections, minor variations in phrasing. |
83
+ | **0.6 - 0.7** | Good corrections, but with some grammatical errors or missing words. βœ… _(Our Model)_ |
84
+ | **0.5 - 0.6** | Decent corrections, noticeable mistakes, lacks fluency. |
85
+ | **Below 0.5** | Needs improvement, frequent incorrect corrections. |
86
+
87
+
88
+ ## ⚑ Quantization Details
89
+
90
+ Post-training quantization was applied using PyTorch's built-in quantization framework. The model was quantized to Float16 (FP16) to reduce model size and improve inference efficiency while balancing accuracy.
91
+
92
+ ## πŸ“‚ Repository Structure
93
+
94
+ ```
95
+ .
96
+ β”œβ”€β”€ model/ # Contains the quantized model files
97
+ β”œβ”€β”€ tokenizer_config/ # Tokenizer configuration and vocabulary files
98
+ β”œβ”€β”€ model.safetensors/ # Quantized Model
99
+ β”œβ”€β”€ README.md # Model documentation
100
+ ```
101
+
102
+ ## ⚠️ Limitations
103
+
104
+ - The model may struggle with highly ambiguous sentences.
105
+ - Quantization may lead to slight degradation in accuracy compared to full-precision models.
106
+ - Performance may vary across different writing styles and sentence structures.
107
+
108
+ ## 🀝 Contributing
109
+
110
+ Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.