---
license: cc-by-nc-4.0
language:
- ru
base_model:
- ai-forever/FRED-T5-1.7B
---

# Grammatical Error Detection for Russian

This model detects grammatical errors, misspellings, and typos in Russian text, as detailed in the preprint [arXiv:2505.04507v1](https://arxiv.org/abs/2505.04507).

## Model Functionality

- **Task:** Binary classification for the presence of defects in text.
- **Input:** Russian text, which can range from a single sentence to a full paragraph.
- **Output:** `True` if any errors are detected; `False` if the text is error-free.

## Key Notes

- The model is designed to handle multi-sentence context.
- It only performs detection and provides a binary output. It does not locate the specific errors or suggest corrections.

## Usage example

```python
import torch
import transformers

ged_model_path = "inkoziev/ged-FRED-T5-1.7B"
ged_tokenizer = transformers.AutoTokenizer.from_pretrained(ged_model_path)
ged_model = transformers.T5ForConditionalGeneration.from_pretrained(ged_model_path, device_map='cuda:0', torch_dtype=torch.half)

# Input texts to check against grammatical or orthographical defects.
input_texts = ["Расскажу как настроить плагин и сделать быструю домашнюю страничку, расскажу как найти продвинутые гайды для создания сверх-эстетичной домашней страничке.",
               "Расскажу, как настроить плагин и сделать быструю домашнюю страничку, расскажу, как найти продвинутые гайды для создания сверхэстетичной домашней странички."]

# Construct an instructive prompt.
prepend_prompt = "<LM>Проанализируй заданный ниже текст. Идентифицируй в нем грамматические и орфографические ошибки. Если есть хотя бы одна такая ошибка, то выведи 'True'. Если текст не содержит грамматических и орфографических ошибок, выведи 'False'.\n\nТекст: "

xx = ged_tokenizer([(prepend_prompt + input_text) for input_text in input_texts],
                   truncation=False,
                   padding="longest",
                   return_tensors='pt').to(ged_model.device)
out_ids = ged_model.generate(input_ids=xx.input_ids, eos_token_id=ged_tokenizer.eos_token_id, max_length=5)

# The model returns "True" or "False" text for each input row.
for input_text, has_defects in zip(input_texts, out_ids.cpu().tolist()):
    has_defects = has_defects[1:has_defects.index(ged_tokenizer.eos_token_id)]
    has_defects = ged_tokenizer.decode(has_defects)
    print(f"{input_text} ==> {has_defects}")
```

The output should be:

```
Расскажу как настроить плагин и сделать быструю домашнюю страничку, расскажу как найти продвинутые гайды для создания сверх-эстетичной домашней страничке. ==> True
Расскажу, как настроить плагин и сделать быструю домашнюю страничку, расскажу, как найти продвинутые гайды для создания сверхэстетичной домашней странички. ==> False
```

## Metrics

The model was evaluated on the RUPOR dataset (which is not yet public) and on several open-source datasets:

| Domain                   | Population | F<sub>1</sub> | F<sub>0.5</sub> | Precision | Recall |
| ------------------------ | ---------- | ------------- | --------------- | --------- | ------ |
| RUPOR poetry             | 4508       | 0.838         | 0.88            | 0.911     | 0.776  |
| RUPOR prose              | 3998       | 0.882         | 0.911           | 0.932     | 0.837  |
| RuCoLa                   | 3998       | 0.268         | 0.464           | 0.905     | 0.158  |
| RuBLiMP                  | 3992       | 0.948         | 0.943           | 0.94      | 0.956  |
| rlc-toloka (ru)          | 3992       | 0.801         | 0.846           | 0.878     | 0.737  |
| SAGE MultidomainGold     | 2133       | 0.858         | 0.911           | 0.951     | 0.782  |
| SAGE RUSpellRU           | 1233       | 0.941         | 0.963           | 0.979     | 0.906  |
| SAGE MedSpellchecker     | 325        | 0.969         | 0.983           | 0.993     | 0.947  |
| SAGE GitHubTypoCorpusRu  | 307        | 0.782         | 0.84            | 0.883     | 0.702  |