BPE Tokenizer for English-Hindi Translation Research
This repository contains a Byte-Pair Encoding (BPE) tokenizer trained from scratch on a parallel English-Hindi corpus. It was developed as part of a research project to analyze the impact of different subword tokenization strategies (BPE vs. WordPiece) on the performance, training stability, and inference latency of a Neural Machine Translation (NMT) model.
Research Finding: This BPE tokenizer was found to be computationally inefficient for the task, leading to training failures and poor translation quality (BLEU score: 1.99). It serves as a benchmark to demonstrate the superiority of more advanced tokenizers like WordPiece for this specific use case.
Tokenizer Details
This is a PreTrainedTokenizerFast
object, wrapping a tokenizer from the tokenizers
library.
- Developed by: PABITRA07
- Model type: BPE (Byte-Pair Encoding) Tokenizer
- Language(s) (NLP): English (en), Hindi (hi)
- License: apache-2.0
Tokenizer Sources
-->Uses
This tokenizer is intended to be used with a Transformer-based model for English-to-Hindi machine translation. Its primary use in our research was to serve as a baseline to highlight the performance gains achieved with a more efficient WordPiece tokenizer.
How to Get Started with the Model
You can load and use this tokenizer directly in your Python code using the transformers
library.
from transformers import PreTrainedTokenizerFast
# Replace with your actual model ID
REPO_ID = "your-hf-username/your-bpe-repo-name"
tokenizer = PreTrainedTokenizerFast.from_pretrained(REPO_ID)
# Example Usage
english_text = "This is a test of the BPE tokenizer."
hindi_text = "यह बीपीई टोकनाइज़र का एक परीक्षण है।"
encoded_en = tokenizer(english_text)
encoded_hi = tokenizer(hindi_text)
print("Encoded English:", encoded_en['input_ids'])
print("Decoded English:", tokenizer.decode(encoded_en['input_ids']))