Central Bank-BERT: Domain-Adaptive Masked Language Model for Central Bank Communication
Central Bank-BERT is a domain-adapted masked language model based on bert-base-uncased
, pretrained on more than 66 million tokens across over 2 million sentences extracted from central bank speeches published by the Bank for International Settlements (1996–2024).
This model is specifically optimized for masked token prediction within the highly specialized domains of monetary policy, financial regulation, and macroeconomic communication, enabling deeper contextual understanding of central banking discourse and financial narratives.
Dataset Summary
- Source: BIS Central Bank Speeches (1996–2024)
- Total Speeches: 19,609
- MLM Sentences: 2,087,615 (~2.09M)
- Total Tokens: 66,359,113 (~66.36M)
- Avg. Tokens per Sentence: 31.79
Model & Training Details
Category | Details |
---|---|
Tokenizer | BertTokenizerFast (base: bert-base-uncased ) Vocab Size: 30,522 Max Seq Length: 128 |
Model | BertForMaskedLM (initialized from bert-base-uncased ) Total Params: 109,514,298 (~109.5M) Trainable Params: 109,514,298 |
Training Setup | Epochs: 1 Batch Size (per device): 16 Gradient Accumulation: 2 Effective Batch Size: 32 MLM Probability: 15% |
Hardware | Device: NVIDIA Tesla P100 (Kaggle) Mixed Precision: fp16 |
Training Duration | ~8 hrs 18 mins Start: 2025-07-19 17:17 End: 2025-07-20 01:35 |
Evaluation Results | Perplexity bert-base: 13.06 CentralBank-BERT : 4.66 |
Lower perplexity demonstrates better fit on domain-specific central bank language.
Notebook: Training, Evaluation & Results
The full training pipeline, including data preprocessing, tokenizer setup, model training, evaluation, and result visualizations, is documented in the notebook cb-bert-mlm.ipynb
. This notebook includes actual outputs from the training run, perplexity comparisons, manual masked sentence evaluations, and Top-K accuracy analysis—ensuring full transparency and reproducibility of the model development process.
Model Files
model.safetensors
: Trained model weightsconfig.json
: Model architecture and hyperparameterstokenizer.json
: Serialized tokenizervocab.txt
: Vocabulary filetokenizer_config.json
: Tokenizer configurationspecial_tokens_map.json
: Special tokens mappingtraining_args.bin
: Training arguments used during pretraining
This model repository includes all essential files required to load, evaluate, or fine-tune the CentralBank-BERT
model using Hugging Face's transformers
library.
These components are necessary to ensure full compatibility with the original training environment and to support seamless deployment or transfer learning.
Downstream Models
In addition to the domain-adapted masked language model (CentralBank-BERT
), a suite of fine-tuned downstream classifiers has been released to support CBDC-specific research and policy analysis. These models share the same encoder backbone and are designed for different classification and information extraction tasks on central bank communication.
Model | Purpose | Intended Use | Link |
---|---|---|---|
bilalzafar/CBDC-BERT | Binary classifier: CBDC vs. Non-CBDC. | Flagging CBDC-related discourse in large corpora. | CBDC-BERT |
bilalzafar/CBDC-Stance | 3-class stance model (Pro, Wait-and-See, Anti). | Research on policy stances and discourse monitoring. | CBDC-Stance |
bilalzafar/CBDC-Sentiment | 3-class sentiment model (Positive, Neutral, Negative). | Tone analysis in central bank communications. | CBDC-Sentiment |
bilalzafar/CBDC-Type | Classifies Retail, Wholesale, General CBDC mentions. | Distinguishing policy focus (retail vs wholesale). | CBDC-Type |
bilalzafar/CBDC-Discourse | 3-class discourse classifier (Feature, Process, Risk-Benefit). | Structured categorization of CBDC communications. | CBDC-Discourse |
bilalzafar/CentralBank-NER | Named Entity Recognition (NER) model for central banking discourse. | Identifying institutions, persons, and policy entities in speeches. | CentralBank-NER |
Repository and Replication Package
All training pipelines, preprocessing scripts, evaluation notebooks, and result outputs are available in the companion GitHub repository:
🔗 https://github.com/bilalezafar/CentralBank-BERT
The repository includes:
- End-to-end notebooks for CentralBank-BERT pretraining and all downstream classifiers (CBDC-BERT, Stance, Sentiment, Type, Discourse, NER).
- Preprocessed BIS speech dataset subsets (CBDC-related sentences, annotated splits).
- Reproducible code to generate figures, tables, and evaluation metrics reported in the manuscript.
- Deployment-ready scripts for applying models to new corpora.
This ensures full transparency, reproducibility, and extension of the CentralBank-BERT family of models.
Citation
If you use this model, please cite as:
Zafar, M. B. (2025). CentralBank-BERT: Machine Learning Evidence on Central Bank Digital Currency Discourse. SSRN. https://papers.ssrn.com/abstract=5404456
@article{zafar2025centralbankbert,
title={CentralBank-BERT: Machine Learning Evidence on Central Bank Digital Currency Discourse},
author={Zafar, Muhammad Bilal},
year={2025},
journal={SSRN Electronic Journal},
url={https://papers.ssrn.com/abstract=5404456}
}
- Downloads last month
- 66