Unigram Tokenizer for English-Hindi Translation Research

This repository contains a Unigram tokenizer trained from scratch on a parallel English-Hindi corpus. It was developed as part of a research project to analyze the impact of different subword tokenization strategies (BPE, WordPiece, and Unigram) on the performance, training stability, and inference latency of a Neural Machine Translation (NMT) model.

Research Finding: This Unigram tokenizer demonstrated high efficiency and stability, performing on par with WordPiece and proving to be a vastly superior alternative to BPE for this task. It enabled stable model training and is expected to yield a model with low inference latency and high-quality translations.

Tokenizer Details

This is a PreTrainedTokenizerFast object, wrapping a tokenizer from the tokenizers library.

  • Developed by: PABITRA07
  • Model type: Unigram Language Model Tokenizer
  • Language(s) (NLP): English (en), Hindi (hi)
  • License: apache-2.0

Tokenizer Sources

Uses

This tokenizer is intended to be used with a Transformer-based model for English-to-Hindi machine translation. It is designed to produce efficient, semantically meaningful subword units that lead to stable training and fast inference.

How to Get Started with the Model

You can load and use this tokenizer directly in your Python code using the transformers library.

from transformers import PreTrainedTokenizerFast

# Replace with your actual model ID
REPO_ID = "your-hf-username/your-unigram-repo-name" 
tokenizer = PreTrainedTokenizerFast.from_pretrained(REPO_ID)

# Example Usage
english_text = "This is a test of the Unigram tokenizer."
hindi_text = "यह यूनिग्राम टोकनाइज़र का एक परीक्षण है।"

encoded_en = tokenizer(english_text)
encoded_hi = tokenizer(hindi_text)

print("Encoded English:", encoded_en['input_ids'])
print("Decoded English:", tokenizer.decode(encoded_en['input_ids']))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support