Papers
arxiv:2510.04003

Enhancing OCR for Sino-Vietnamese Language Processing via Fine-tuned PaddleOCRv5

Published on Oct 5
Authors:
,

Abstract

A fine-tuning approach for PaddleOCRv5 improves character recognition accuracy on Han-Nom texts, particularly in noisy images, and supports downstream applications like semantic alignment and historical linguistics.

AI-generated summary

Recognizing and processing Classical Chinese (Han-Nom) texts play a vital role in digitizing Vietnamese historical documents and enabling cross-lingual semantic research. However, existing OCR systems struggle with degraded scans, non-standard glyphs, and handwriting variations common in ancient sources. In this work, we propose a fine-tuning approach for PaddleOCRv5 to improve character recognition on Han-Nom texts. We retrain the text recognition module using a curated subset of ancient Vietnamese Chinese manuscripts, supported by a full training pipeline covering preprocessing, LMDB conversion, evaluation, and visualization. Experimental results show a significant improvement over the base model, with exact accuracy increasing from 37.5 percent to 50.0 percent, particularly under noisy image conditions. Furthermore, we develop an interactive demo that visually compares pre- and post-fine-tuning recognition results, facilitating downstream applications such as Han-Vietnamese semantic alignment, machine translation, and historical linguistics research. The demo is available at https://huggingface.co/spaces/MinhDS/Fine-tuned-PaddleOCRv5.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.04003 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.04003 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.04003 in a Space README.md to link it from this page.

Collections including this paper 1