BPE Tokenizer for English-Hindi Translation Research

This repository contains a Byte-Pair Encoding (BPE) tokenizer trained from scratch on a parallel English-Hindi corpus. It was developed as part of a research project to analyze the impact of different subword tokenization strategies (BPE vs. WordPiece) on the performance, training stability, and inference latency of a Neural Machine Translation (NMT) model.

Research Finding: This BPE tokenizer was found to be computationally inefficient for the task, leading to training failures and poor translation quality (BLEU score: 1.99). It serves as a benchmark to demonstrate the superiority of more advanced tokenizers like WordPiece for this specific use case.

Tokenizer Details

This is a PreTrainedTokenizerFast object, wrapping a tokenizer from the tokenizers library.

Developed by: PABITRA07
Model type: BPE (Byte-Pair Encoding) Tokenizer
Language(s) (NLP): English (en), Hindi (hi)
License: apache-2.0

Tokenizer Sources

-->

Uses

This tokenizer is intended to be used with a Transformer-based model for English-to-Hindi machine translation. Its primary use in our research was to serve as a baseline to highlight the performance gains achieved with a more efficient WordPiece tokenizer.

How to Get Started with the Model

You can load and use this tokenizer directly in your Python code using the transformers library.

from transformers import PreTrainedTokenizerFast

# Replace with your actual model ID
REPO_ID = "your-hf-username/your-bpe-repo-name" 
tokenizer = PreTrainedTokenizerFast.from_pretrained(REPO_ID)

# Example Usage
english_text = "This is a test of the BPE tokenizer."
hindi_text = "यह बीपीई टोकनाइज़र का एक परीक्षण है।"

encoded_en = tokenizer(english_text)
encoded_hi = tokenizer(hindi_text)

print("Encoded English:", encoded_en['input_ids'])
print("Decoded English:", tokenizer.decode(encoded_en['input_ids']))

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support