footballbert-base
Model Description
Model Description
FootballBERT is a transformer-based foundation model for football (soccer) player embeddings. Inspired by BERT in NLP, FootballBERT treats players as tokens and matches as sentences, learning contextualized player embeddings through a self-supervised Masked Player Prediction (MPP) task.
In an era where everyone claims to do AI, football analytics models still encode player identity using one-hot vectors — or worse, nothing at all. And player positions (GK, CM, FW, etc.) are often represented in overly simplistic ways — or as rigid x,y coordinates — which fail to capture the full semantic richness of roles in modern football. That means: no notion of collective context, tactical role, or interaction patterns between players.
This mirrors the old TF-IDF era in NLP: encodings that captured how frequent or rare a word was within a document, but said nothing about its meaning. Then came BERT, which changed everything by embedding words in context — and today, we have LLMs.
FootballBERT is a first major step in that same direction for football — and the first publicly available model that leverages transformers to learn foundational player dense embeddings that capture both player identity and tactical roles within the collective context of the game.
It is built as part of the ongoing RisingBALLER project.
Key Innovation
- 🏃 Players = Tokens: Each player is represented as a token in the model's vocabulary
- ⚽ Matches = Sentences: A match is encoded as a sequence of players in the lineups (both teams)
- 🎭 Context-Aware: Player embeddings are shaped by the specific context (teammates, opponents, roles, football season, home/away) of each match.
- 🎯 Foundation Model: Pre-trained representations ready for downstream tasks
Model Architecture
- Base Model: Transformer encoder (Attention is all you need)
- Hidden Size: 256
- Layers: 6
- Attention Heads: 1
- Vocabulary Size: 99944 players
- Max Sequence Length: 32 (2 teams, 11 starters + max 5 subs per team)
Pre-training
The model is pre-trained using Masked Player Prediction (MPP), analogous to BERT's masked language modeling:
- Random players in a match are masked
- The model learns to predict the masked players from context
- This teaches the model about player identity, roles, team dynamics, and even tactical preference based on real world usage by football coaches (expressed in the lineups).
Usage
Installation
pip install transformers torch pandas
Quick Start - Get Player Contextual Embeddings
from transformers import AutoModel, AutoTokenizer
import pandas as pd
# Load model and tokenizer
model = AutoModel.from_pretrained("achadj/footballbert-base", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("achadj/footballbert-base", trust_remote_code=True)
# example of input data for a match
barcelona = {
'ba078905': {'Lionel Messi': 'RW'},
'ba038720': {'Neymar Jr da Silva': 'LW'}, # 1. [MASK]
'ba060829': {'Luis Suárez': 'FW'},
'ba076261': {'Andrés Iniesta': 'LM'},
'ba033440': {'Sergio Busquets': 'CM'},
'ba050584': {'Ivan Rakitić': 'RM'}, # 5. [MASK]
'ba074037': {'Dani Alves': 'RB'},
'ba025784': {'Jordi Alba': 'LB'},
'ba029625': {'Javier Mascherano': 'CB'},
'ba063701': {'Gerard Piqué': 'CB'}, # 9. [MASK]
'ba040864': {'Marc-André ter Stegen': 'GK'},
'ba010095': {'Frenkie de Jong': 'CM'},
'ba049342': {'Munir El Haddadi': 'FW'},
}
real_madrid = {
'ba081757': {'Cristiano Ronaldo': 'LW'}, # 0. [MASK]
'ba041397': {'Karim Benzema': 'FW'},
'ba060618': {'Gareth Bale': 'RW'},
'ba040003': {'Toni Kroos': 'LM'},
'ba035435': {'Luka Modrić': 'RM'}, # 4. [MASK]
'ba028381': {'Casemiro': 'CM'},
'ba058430': {'Raphaël Varane': 'CB'},
'ba003088': {'Sergio Ramos': 'CB'},
'ba026976': {'Dani Carvajal': 'RB'},
'ba035453': {'Marcelo': 'LB'},
'ba086869': {'Keylor Navas': 'GK'}, # 10. [MASK]
'ba058850': {'Isco': 'AM'},
'ba006753': {'Álvaro Morata': 'FW'}, # 12. [MASK]
}
player_ids_team_1 = list(barcelona.keys())
player_ids_team_2 = list(real_madrid.keys())
# positions
positions_team_1 = [list(barcelona[p].values())[0] for p in barcelona.keys()]
positions_team_2 = [list(real_madrid[p].values())[0] for p in real_madrid.keys()]
match_data = pd.DataFrame({
'player_id': player_ids_team_1 + player_ids_team_2,
'positions': positions_team_1 + positions_team_2,
'season_name': ['2016-2017'] * len(player_ids_team_1 + player_ids_team_2),
'team_name': ['Home Team'] * len(player_ids_team_1) + ['Away Team'] * len(player_ids_team_2)
})
# Tokenize and get embeddings
inputs = tokenizer(match_data, return_tensors="pt", padding=False)
outputs = model(**inputs)
# Extract contextualized player embeddings
player_embeddings = outputs.last_hidden_state
print(player_embeddings.shape) # Shape: (1, num_players, hidden_size)
MPP With Batch Processing For Multiple Matches
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("achadj/footballbert-base", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("achadj/footballbert-base", trust_remote_code=True)
import pickle
# download the ids to names file at https://huggingface.co/achadj/footballbert-base/blob/main/baller_id2player_names_99316p.pkl
player_ids2name = pickle.load(open("footballbert-hf/baller_id2player_names_99316p.pkl", "rb")) # you might need to change the path
# you can also download the position_vocab file at https://huggingface.co/achadj/footballbert-base/blob/main/position_vocab.json
# example of input data for a match
barcelona = {
'ba078905': {'Lionel Messi': 'RW'},
'ba038720': {'Neymar Jr da Silva': 'LW'}, # 1. [MASK]
'ba060829': {'Luis Suárez': 'FW'},
'ba076261': {'Andrés Iniesta': 'LM'},
'ba033440': {'Sergio Busquets': 'CM'},
'ba050584': {'Ivan Rakitić': 'RM'}, # 5. [MASK]
'ba074037': {'Dani Alves': 'RB'},
'ba025784': {'Jordi Alba': 'LB'},
'ba029625': {'Javier Mascherano': 'CB'},
'ba063701': {'Gerard Piqué': 'CB'}, # 9. [MASK]
'ba040864': {'Marc-André ter Stegen': 'GK'},
'ba010095': {'Frenkie de Jong': 'CM'},
'ba049342': {'Munir El Haddadi': 'FW'},
}
real_madrid = {
'ba081757': {'Cristiano Ronaldo': 'LW'}, # 0. [MASK]
'ba041397': {'Karim Benzema': 'FW'},
'ba060618': {'Gareth Bale': 'RW'},
'ba040003': {'Toni Kroos': 'LM'},
'ba035435': {'Luka Modrić': 'RM'}, # 4. [MASK]
'ba028381': {'Casemiro': 'CM'},
'ba058430': {'Raphaël Varane': 'CB'},
'ba003088': {'Sergio Ramos': 'CB'},
'ba026976': {'Dani Carvajal': 'RB'},
'ba035453': {'Marcelo': 'LB'},
'ba086869': {'Keylor Navas': 'GK'}, # 10. [MASK]
'ba058850': {'Isco': 'AM'},
'ba006753': {'Álvaro Morata': 'FW'}, # 12. [MASK]
}
player_ids_team_1 = list(barcelona.keys())
player_ids_team_2 = list(real_madrid.keys())
masked_ids_team_1 = player_ids_team_1[:]
masked_ids_team_2 = player_ids_team_2[:]
# masking some players
masked_ids_team_1[1] = '[MASK]'
masked_ids_team_1[5] = '[MASK]'
masked_ids_team_1[9] = '[MASK]'
masked_ids_team_2[0] = '[MASK]'
masked_ids_team_2[4] = '[MASK]'
masked_ids_team_2[10] = '[MASK]'
masked_ids_team_2[12] = '[MASK]'
# positions
positions_team_1 = [list(barcelona[p].values())[0] for p in barcelona.keys()]
positions_team_2 = [list(real_madrid[p].values())[0] for p in real_madrid.keys()]
# Préparer un match
match_data_1 = pd.DataFrame({
'player_id': masked_ids_team_1 + masked_ids_team_2,
'positions': positions_team_1 + positions_team_2,
'season_name': ['2016-2017'] * len(masked_ids_team_1 + masked_ids_team_2),
'team_name': ['Home Team'] * len(masked_ids_team_1) + ['Away Team'] * len(masked_ids_team_2)
})
match_data_2 = pd.DataFrame({
'player_id': masked_ids_team_1 + masked_ids_team_2,
'positions': positions_team_1 + positions_team_2,
'season_name': ['2016-2017'] * len(masked_ids_team_1 + masked_ids_team_2),
'team_name': ['Home Team'] * len(masked_ids_team_1) + ['Away Team'] * len(masked_ids_team_2)
})
match_data = [match_data_1, match_data_2]
inputs = tokenizer.batch_encode_matches(match_data, return_tensors="pt", padding=True, max_length=32)
outputs = model(**inputs)
logits = outputs.logits
print(logits.shape) # (batch_size, max_length, Vocab_size)
predictions = logits.argmax(dim=-1)
print(predictions.shape) # (batch_size, max_length)
# Get predictions for masked players
match_1_predicted_labels = predictions[0].tolist()
player_ids_teams = player_ids_team_1 + player_ids_team_2
print("Match 1 predictions:")
for i, token in enumerate(match_data_1['player_id']):
if token == '[MASK]':
predicted_player_token = tokenizer.convert_ids_to_tokens(match_1_predicted_labels[i])
print(f"Predicted Player: {player_ids2name.get(predicted_player_token, 'Unknown')}, Real Player: {player_ids2name.get(player_ids_teams[i], 'Unknown')}")
"""
outputs:
Predicted Player: Arda Turan, Real Player: Neymar Jr da Silva
Predicted Player: Ivan Rakitić, Real Player: Ivan Rakitić
Predicted Player: Samuel Umtiti, Real Player: Gerard Piqué
Predicted Player: Marco Asensio, Real Player: Cristiano Ronaldo
Predicted Player: Luka Modrić, Real Player: Luka Modrić
Predicted Player: Keylor Navas, Real Player: Keylor Navas
Predicted Player: Álvaro Morata, Real Player: Álvaro Morata
"""
Downstream Applications
The learned embeddings can be used for various football analytics tasks:
- 🔍 Similar/Context Aware Player Retrieval: Find players with similar playing styles or similar playing context
- 🎯 Positional Analysis: Understand player roles beyond rigid positions via positions embeddings analysis
- 📊 Performance Forecasting: Predict player/team/match level statistics via fine-tuning
- 🤝 Build Identity-Aware model: Use dense player embeddings to create models that natively account for individual player identities and styles, like xG models.
For more details on donwstream application, read RisingBALLER project.
Input Format
The tokenizer expects a pandas DataFrame (or a list of pandas DataFrame) with the following columns:
player_id
(str): Unique identifier for each player. Check the dict player_ids2names.json in the model files to find your players of interest's ids and prepare your input data.positions
(str): Player position (e.g., 'FW', 'MF', 'DF', 'GK'). Check the position_vocab.json file to prepare your input data.season_name
(str): Season identifier (e.g., '2023-2024'). Check the season_vocab.json file to prepare your input data.team_name
(str): Team name (must have exactly 2 unique teams per match).input special tokens
(str): masking token: "[MASK]", padding token: "[PAD]"
Model Outputs
Embeddings model
last_hidden_state
: Contextualized player embeddings (batch_size, sequence_length, hidden_size)hidden_states
: All layer hidden statesattentions
: All layer attention weights
MPP model
logits
: (batch_size, sequence_length, vocab_size)hidden_states
: All layer hidden statesattentions
: All layer attention weights
Training Details
- Dataset: Football matches lineups data scrapped from all the web
- Pre-training Task: Masked Player Prediction (MPP)
- Masking Ratio: 25% of players per match
- Total training samples (unique matches lineups): 170K
- Total football seasons: 11, from 2014-2015 to 2024-25
- Total Competitions: 78 worldwide, including european leagues, cups and uefa competitions, african leagues, south&north american leagues, asian leagues and australian league.
- Validation Accuracy (Top-1): 57%
- Validation Accuracy (Top-3): 85%
Citation
If you use FootballBERT in your research, please cite:
@article{risingballer2024,
title={RisingBALLER: A player is a token, a match is a sentence, A path towards a foundational model for football players data analytics},
author={Adjileye, Akedjou Achraff},
journal={arXiv preprint arXiv:2410.00943},
year={2024}
}
Paper
📄 RisingBALLER: A path towards a foundational model for football players data analytics
License
Apache 2.0
Contact
For questions or collaboration opportunities, feel free to reach out or open an issue in the model repository.
Limitations
- Player vocabulary is fixed to the training set.
- Requires structured match data with consistent player IDs.
- While the model demonstrates some ability to infer tactical roles and patterns through Masked Player Prediction (MPP), its embeddings primarily capture a player’s collective identity — shaped by the frequency of matchups with teammates and opponents, as well as by how coaches utilize the player across different positions. Therefore, these embeddings should not be used as a statistical comparison tool in the way player vectors often are in football data-driven analyses.
Future Work
- Integration of additional data modalities beyond lineups (e.g., heatmaps, passing networks, per-90 metrics, etc.) to further enrich player embeddings.
- Fine-tune for specific downstream tasks
Built with ❤️ for the football analytics community
- Downloads last month
- 308