Unigram Tokenizer for English-Hindi Translation Research
This repository contains a Unigram tokenizer trained from scratch on a parallel English-Hindi corpus. It was developed as part of a research project to analyze the impact of different subword tokenization strategies (BPE, WordPiece, and Unigram) on the performance, training stability, and inference latency of a Neural Machine Translation (NMT) model.
Research Finding: This Unigram tokenizer demonstrated high efficiency and stability, performing on par with WordPiece and proving to be a vastly superior alternative to BPE for this task. It enabled stable model training and is expected to yield a model with low inference latency and high-quality translations.
Tokenizer Details
This is a PreTrainedTokenizerFast
object, wrapping a tokenizer from the tokenizers
library.
- Developed by: PABITRA07
- Model type: Unigram Language Model Tokenizer
- Language(s) (NLP): English (en), Hindi (hi)
- License: apache-2.0
Tokenizer Sources
Uses
This tokenizer is intended to be used with a Transformer-based model for English-to-Hindi machine translation. It is designed to produce efficient, semantically meaningful subword units that lead to stable training and fast inference.
How to Get Started with the Model
You can load and use this tokenizer directly in your Python code using the transformers
library.
from transformers import PreTrainedTokenizerFast
# Replace with your actual model ID
REPO_ID = "your-hf-username/your-unigram-repo-name"
tokenizer = PreTrainedTokenizerFast.from_pretrained(REPO_ID)
# Example Usage
english_text = "This is a test of the Unigram tokenizer."
hindi_text = "यह यूनिग्राम टोकनाइज़र का एक परीक्षण है।"
encoded_en = tokenizer(english_text)
encoded_hi = tokenizer(hindi_text)
print("Encoded English:", encoded_en['input_ids'])
print("Decoded English:", tokenizer.decode(encoded_en['input_ids']))