Central Bank-BERT: Domain-Adaptive Masked Language Model for Central Bank Communication

Central Bank-BERT is a domain-adapted masked language model based on bert-base-uncased, pretrained on more than 66 million tokens across over 2 million sentences extracted from central bank speeches published by the Bank for International Settlements (1996–2024). This model is specifically optimized for masked token prediction within the highly specialized domains of monetary policy, financial regulation, and macroeconomic communication, enabling deeper contextual understanding of central banking discourse and financial narratives.

Dataset Summary

Source: BIS Central Bank Speeches (1996–2024)
Total Speeches: 19,609
MLM Sentences: 2,087,615 (~2.09M)
Total Tokens: 66,359,113 (~66.36M)
Avg. Tokens per Sentence: 31.79

Model & Training Details

Category	Details
Tokenizer	`BertTokenizerFast` (base: `bert-base-uncased`) Vocab Size: 30,522 Max Seq Length: 128
Model	`BertForMaskedLM` (initialized from `bert-base-uncased`) Total Params: 109,514,298 (~109.5M) Trainable Params: 109,514,298
Training Setup	Epochs: 1 Batch Size (per device): 16 Gradient Accumulation: 2 Effective Batch Size: 32 MLM Probability: 15%
Hardware	Device: NVIDIA Tesla P100 (Kaggle) Mixed Precision: fp16
Training Duration	~8 hrs 18 mins Start: 2025-07-19 17:17 End: 2025-07-20 01:35
Evaluation Results	Perplexity bert-base: 13.06 `CentralBank-BERT`: 4.66

Lower perplexity demonstrates better fit on domain-specific central bank language.

Notebook: Training, Evaluation & Results The full training pipeline, including data preprocessing, tokenizer setup, model training, evaluation, and result visualizations, is documented in the notebook cb-bert-mlm.ipynb. This notebook includes actual outputs from the training run, perplexity comparisons, manual masked sentence evaluations, and Top-K accuracy analysis—ensuring full transparency and reproducibility of the model development process.

Model Files

model.safetensors: Trained model weights
config.json: Model architecture and hyperparameters
tokenizer.json: Serialized tokenizer
vocab.txt: Vocabulary file
tokenizer_config.json: Tokenizer configuration
special_tokens_map.json: Special tokens mapping
training_args.bin: Training arguments used during pretraining

This model repository includes all essential files required to load, evaluate, or fine-tune the CentralBank-BERT model using Hugging Face's transformers library. These components are necessary to ensure full compatibility with the original training environment and to support seamless deployment or transfer learning.

Downstream Models

In addition to the domain-adapted masked language model (CentralBank-BERT), a suite of fine-tuned downstream classifiers has been released to support CBDC-specific research and policy analysis. These models share the same encoder backbone and are designed for different classification and information extraction tasks on central bank communication.

Model	Purpose	Intended Use	Link
bilalzafar/CBDC-BERT	Binary classifier: CBDC vs. Non-CBDC.	Flagging CBDC-related discourse in large corpora.	CBDC-BERT
bilalzafar/CBDC-Stance	3-class stance model (Pro, Wait-and-See, Anti).	Research on policy stances and discourse monitoring.	CBDC-Stance
bilalzafar/CBDC-Sentiment	3-class sentiment model (Positive, Neutral, Negative).	Tone analysis in central bank communications.	CBDC-Sentiment
bilalzafar/CBDC-Type	Classifies Retail, Wholesale, General CBDC mentions.	Distinguishing policy focus (retail vs wholesale).	CBDC-Type
bilalzafar/CBDC-Discourse	3-class discourse classifier (Feature, Process, Risk-Benefit).	Structured categorization of CBDC communications.	CBDC-Discourse
bilalzafar/CentralBank-NER	Named Entity Recognition (NER) model for central banking discourse.	Identifying institutions, persons, and policy entities in speeches.	CentralBank-NER

Repository and Replication Package

All training pipelines, preprocessing scripts, evaluation notebooks, and result outputs are available in the companion GitHub repository:

🔗 https://github.com/bilalezafar/CentralBank-BERT

The repository includes:

End-to-end notebooks for CentralBank-BERT pretraining and all downstream classifiers (CBDC-BERT, Stance, Sentiment, Type, Discourse, NER).
Preprocessed BIS speech dataset subsets (CBDC-related sentences, annotated splits).
Reproducible code to generate figures, tables, and evaluation metrics reported in the manuscript.
Deployment-ready scripts for applying models to new corpora.

This ensures full transparency, reproducibility, and extension of the CentralBank-BERT family of models.

Citation

If you use this model, please cite as:

Zafar, M. B. (2025). CentralBank-BERT: Machine Learning Evidence on Central Bank Digital Currency Discourse. SSRN. https://papers.ssrn.com/abstract=5404456

@article{zafar2025centralbankbert,
  title={CentralBank-BERT: Machine Learning Evidence on Central Bank Digital Currency Discourse},
  author={Zafar, Muhammad Bilal},
  year={2025},
  journal={SSRN Electronic Journal},
  url={https://papers.ssrn.com/abstract=5404456}
}