Model Card for impresso-project/language-identifier
Overview
impresso-project/language-identifier
is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb) — the core languages of the Impresso Project, which focuses on analyzing historical media across national and linguistic borders.
This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.
Model Details
Model Description
- Developed by: University of Zurich (UZH) from the Impresso team. The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation (CRSII5_173719, CRSII5_213585) and the Luxembourg National Research Fund (grant No. 17498891).
- Model type: Language identification using a transformer-based classification architecture
- Languages: French, German, English, Italian, Luxembourgish
- License: AGPL-3.0
- Finetuned from: Custom model trained on historical newspaper data from the Impresso corpus
How to Use
from transformers import pipeline
MODEL_NAME = "impresso-project/language-identifier"
lang_pipeline = pipeline(
"langident",
model=MODEL_NAME,
trust_remote_code=True,
device="cpu",
)
text = """En l'an 1348, au plus fort des ravages de la peste noire à travers
l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et
face à une opportunité."""
langs = lang_pipeline(text)
print(langs)
Output Format
The output is a single dictionary with the predicted language and confidence score:
{
"language": "fr",
"score": 1.0
}
Use Cases
- Preprocessing for OCR and NLP tasks on historical corpora
- Document and segment-level language tagging
- Filtering and sorting multilingual newspaper archives
Limitations
- Works best on sentence- or paragraph-length texts
- May struggle with code-switching or OCR-degraded text that mixes languages
- Primarily optimized for Impresso-like sources (19th–20th century newspapers)
Installation
pip install transformers floret
Contact
- Website: https://impresso-project.ch
- Downloads last month
- 14
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support