Model Card for impresso-project/language-identifier

Overview

impresso-project/language-identifier is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb) — the core languages of the Impresso Project, which focuses on analyzing historical media across national and linguistic borders.

This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.

Model Details

Model Description

  • Developed by: University of Zurich (UZH) from the Impresso team. The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation (CRSII5_173719, CRSII5_213585) and the Luxembourg National Research Fund (grant No. 17498891).
  • Model type: Language identification using a transformer-based classification architecture
  • Languages: French, German, English, Italian, Luxembourgish
  • License: AGPL-3.0
  • Finetuned from: Custom model trained on historical newspaper data from the Impresso corpus

How to Use

from transformers import pipeline

MODEL_NAME = "impresso-project/language-identifier"

lang_pipeline = pipeline(
    "langident",
    model=MODEL_NAME,
    trust_remote_code=True,
    device="cpu",
)

text = """En l'an 1348, au plus fort des ravages de la peste noire à travers
l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et
face à une opportunité."""

langs = lang_pipeline(text)
print(langs)

Output Format

The output is a single dictionary with the predicted language and confidence score:

{
  "language": "fr",
  "score": 1.0
}

Use Cases

  • Preprocessing for OCR and NLP tasks on historical corpora
  • Document and segment-level language tagging
  • Filtering and sorting multilingual newspaper archives

Limitations

  • Works best on sentence- or paragraph-length texts
  • May struggle with code-switching or OCR-degraded text that mixes languages
  • Primarily optimized for Impresso-like sources (19th–20th century newspapers)

Installation

pip install transformers floret

Contact

Impresso Logo

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support