Grammatical Error Detection for Russian

This model detects grammatical errors, misspellings, and typos in Russian text, as detailed in the preprint arXiv:2505.04507v1.

Model Functionality

Task: Binary classification for the presence of defects in text.
Input: Russian text, which can range from a single sentence to a full paragraph.
Output: True if any errors are detected; False if the text is error-free.

Key Notes

The model is designed to handle multi-sentence context.
It only performs detection and provides a binary output. It does not locate the specific errors or suggest corrections.

Usage example

import torch
import transformers

ged_model_path = "inkoziev/ged-FRED-T5-1.7B"
ged_tokenizer = transformers.AutoTokenizer.from_pretrained(ged_model_path)
ged_model = transformers.T5ForConditionalGeneration.from_pretrained(ged_model_path, device_map='cuda:0', torch_dtype=torch.half)

# Input texts to check against grammatical or orthographical defects.
input_texts = ["Расскажу как настроить плагин и сделать быструю домашнюю страничку, расскажу как найти продвинутые гайды для создания сверх-эстетичной домашней страничке.",
               "Расскажу, как настроить плагин и сделать быструю домашнюю страничку, расскажу, как найти продвинутые гайды для создания сверхэстетичной домашней странички."]

# Construct an instructive prompt.
prepend_prompt = "<LM>Проанализируй заданный ниже текст. Идентифицируй в нем грамматические и орфографические ошибки. Если есть хотя бы одна такая ошибка, то выведи 'True'. Если текст не содержит грамматических и орфографических ошибок, выведи 'False'.\n\nТекст: "

xx = ged_tokenizer([(prepend_prompt + input_text) for input_text in input_texts],
                   truncation=False,
                   padding="longest",
                   return_tensors='pt').to(ged_model.device)
out_ids = ged_model.generate(input_ids=xx.input_ids, eos_token_id=ged_tokenizer.eos_token_id, max_length=5)

# The model returns "True" or "False" text for each input row.
for input_text, has_defects in zip(input_texts, out_ids.cpu().tolist()):
    has_defects = has_defects[1:has_defects.index(ged_tokenizer.eos_token_id)]
    has_defects = ged_tokenizer.decode(has_defects)
    print(f"{input_text} ==> {has_defects}")

The output should be:

Расскажу как настроить плагин и сделать быструю домашнюю страничку, расскажу как найти продвинутые гайды для создания сверх-эстетичной домашней страничке. ==> True
Расскажу, как настроить плагин и сделать быструю домашнюю страничку, расскажу, как найти продвинутые гайды для создания сверхэстетичной домашней странички. ==> False

Metrics

The model was evaluated on the RUPOR dataset (which is not yet public) and on several open-source datasets:

Domain	Population	F₁	F_0.5	Precision	Recall
RUPOR poetry	4508	0.838	0.88	0.911	0.776
RUPOR prose	3998	0.882	0.911	0.932	0.837
RuCoLa	3998	0.268	0.464	0.905	0.158
RuBLiMP	3992	0.948	0.943	0.94	0.956
rlc-toloka (ru)	3992	0.801	0.846	0.878	0.737
SAGE MultidomainGold	2133	0.858	0.911	0.951	0.782
SAGE RUSpellRU	1233	0.941	0.963	0.979	0.906
SAGE MedSpellchecker	325	0.969	0.983	0.993	0.947
SAGE GitHubTypoCorpusRu	307	0.782	0.84	0.883	0.702

inkoziev
/

ged-FRED-T5-1.7B

Grammatical Error Detection for Russian

Model Functionality

Key Notes

Usage example

Metrics

Model tree for inkoziev/ged-FRED-T5-1.7B