---
library_name: transformers
language:
- fr
- de
license: gpl-3.0
tags:
- ocr
- bloomfilter
- unigram
- impresso
- quality-assessment
- v1.0.6
---

# Model Card for `impresso-project/ocr-quality-assessor-unigram-light`

## Overview

This model is a **lightweight OCR quality assessor** for historical French and German texts. It is a streamlined version of the original [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram), now accessible via a Hugging Face `pipeline` for convenient integration into downstream tasks.

It uses **Bloom filters** containing known word unigrams to evaluate text quality by measuring the proportion of known vs. unknown words in OCR outputs. It is part of the [Impresso Project](https://impresso-project.ch), which develops tools for media archive processing and exploration.

## Model Details
### Model Description

- **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).
- **Model type:** Bloom filter–based scoring via a Transformers-compatible pipeline
- **Languages:** French (fr), German (de)
- **License:** GPL-3.0
- **Base resource:** [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram)
- **Interface:** Hugging Face `transformers` pipeline
- **Input format:** Raw text string
- **Output format:** Float score representing OCR quality

## How to Use

```python
from transformers import pipeline

MODEL_NAME = "impresso-project/ocr-quality-assessor-unigram-light"

ocrqa_pipeline = pipeline("ocr-qa-assessment", model=MODEL_NAME, 
                          trust_remote_code=True, 
                          device='cpu')

sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe,
          le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité."""

score = ocrqa_pipeline(sentence)
print(score)
```

## Output Format

Returns a single float value indicating the proportion of known tokens:

```python
{'ocr_quality_score': 0.76}
```

## Use Cases

- OCR pipeline evaluation and quality diagnostics
- Automated scoring of OCR segments or lines
- Quick feedback in web-based transcription and correction tools

## Dataset and Preprocessing

The Bloom filters used internally are derived from:
- Wikipedia dumps (historical and modern)
- Impresso-specific lexical resources

Text normalization includes:
- Unicode NFKC normalization
- Digit masking (0)
- Punctuation and symbol removal
- Lowercasing

## Limitations

- Currently supports only **French** and **German**
- Does not provide spell correction suggestions
- False positives are possible (due to the nature of Bloom filters)
- Quality score is approximate and works best at the **segment** or **line** level

## Environmental Impact

- **Hardware:** Standard laptop / CPU inference
- **Training:** Reuse of existing Bloom filters; minimal additional compute
- **Estimated Emissions:** < 0.01 kg CO₂eq

## Contact

- Website: [https://impresso-project.ch](https://impresso-project.ch)

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
</p>