Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText

Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText is a model that was used in the creation of Aleph-Alpha-GermanWeb, a new German-language dataset that combines heuristic and model-based filtering techniques with synthetic data generation to achieve SOTA performance in German-language benchmarks.

Here we provide one of our quality classification models, a fastText model, along with inference code. This model is released as part of a collection of four text quality classification models.

To train Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText, we used LanguageTool to annotate a random subset of 400,000 German FineWeb2 documents with the DE_AGREEMENT rule, which identifies text passages with grammatical disagreement. To train our classifier, we randomly selected 75,000 documents without identified grammar mistakes as high quality examples. As low quality examples, we took 75,000 random documents containing at least one identified grammar error.

We trained Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText on 95% of the data to classify the high and low quality examples -- and used the remaining 5% for validation, reaching a precision of 63% and recall of 63% on the validation set.

Further details can be found in our accompanying paper (link to paper coming soon).

Example Snippet

import fasttext
from huggingface_hub import hf_hub_download


model_path = hf_hub_download(repo_id="Aleph-Alpha/Aleph-Alpha-Grammar-Classifier-fastText", filename="model.bin")
model = fasttext.load_model(model_path)

text = "Das ist ein Beispieltext, um die Grammatik zu überprüfen."

pre_processed_document = text.replace("\n", " ")

predicted_class, prob = model.predict(pre_processed_document)
predicted_label = predicted_class[0].replace("__label__", "")
document_score = prob[0]
# similar to https://github.com/NVIDIA/NeMo-Curator/blob/31c8171434205e62f6a7d38565ffd9cb4c2806b7/nemo_curator/filters/classifier_filter.py#L47 , the document score is defined as the probability of the predicted class is the predicted label is 'high quality', otherwise it is 1 - document_score

if predicted_label != "high_quality":
    document_score = 1 - document_score

print(predicted_label, document_score)

Aleph-Alpha
/

Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText

Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText

Example Snippet

Collection including Aleph-Alpha/Aleph-Alpha-GermanWeb-Grammar-Classifier-fastText

Aleph-Alpha-GermanWeb