--- library_name: transformers language: - fr - de license: gpl-3.0 tags: - ocr - bloomfilter - unigram - impresso - quality-assessment - v1.0.6 --- # Model Card for `impresso-project/ocr-quality-assessor-unigram-light` ## Overview This model is a **lightweight OCR quality assessor** for historical French and German texts. It is a streamlined version of the original [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram), now accessible via a Hugging Face `pipeline` for convenient integration into downstream tasks. It uses **Bloom filters** containing known word unigrams to evaluate text quality by measuring the proportion of known vs. unknown words in OCR outputs. It is part of the [Impresso Project](https://impresso-project.ch), which develops tools for media archive processing and exploration. ## Model Details ### Model Description - **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891). - **Model type:** Bloom filter–based scoring via a Transformers-compatible pipeline - **Languages:** French (fr), German (de) - **License:** GPL-3.0 - **Base resource:** [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram) - **Interface:** Hugging Face `transformers` pipeline - **Input format:** Raw text string - **Output format:** Float score representing OCR quality ## How to Use ```python from transformers import pipeline MODEL_NAME = "impresso-project/ocr-quality-assessor-unigram-light" ocrqa_pipeline = pipeline("ocr-qa-assessment", model=MODEL_NAME, trust_remote_code=True, device='cpu') sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité.""" score = ocrqa_pipeline(sentence) print(score) ``` ## Output Format Returns a single float value indicating the proportion of known tokens: ```python {'ocr_quality_score': 0.76} ``` ## Use Cases - OCR pipeline evaluation and quality diagnostics - Automated scoring of OCR segments or lines - Quick feedback in web-based transcription and correction tools ## Dataset and Preprocessing The Bloom filters used internally are derived from: - Wikipedia dumps (historical and modern) - Impresso-specific lexical resources Text normalization includes: - Unicode NFKC normalization - Digit masking (0) - Punctuation and symbol removal - Lowercasing ## Limitations - Currently supports only **French** and **German** - Does not provide spell correction suggestions - False positives are possible (due to the nature of Bloom filters) - Quality score is approximate and works best at the **segment** or **line** level ## Environmental Impact - **Hardware:** Standard laptop / CPU inference - **Training:** Reuse of existing Bloom filters; minimal additional compute - **Estimated Emissions:** < 0.01 kg CO₂eq ## Contact - Website: [https://impresso-project.ch](https://impresso-project.ch)