SoccerChat-qwen2-vl-7b ⚽📊

A Multimodal Vision-Language Model for Soccer Game Understanding

Model Details

Model Description

SoccerChat-qwen2-vl-7b is a LoRA-finetuned version of Qwen2-VL-7B-Instruct designed for soccer video understanding and dialogue.
It is trained on the SoccerChat dataset, introduced in the paper SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding.

The model integrates video frames, event annotations, and commentary text to support question answering, commentary generation, and event-based reasoning in soccer.

Developed by: SimulaMet (Simula Metropolitan Center for Digital Engineering, Norway)
Model type: Vision-Language Model (VLM) finetuned with PEFT/LoRA
Primary language: English (soccer-domain specific)
License: Apache 2.0
Base model: qwen/Qwen2-VL-7B-Instruct

How to Get Started with the Model

Use the code below to get started with the model. The model accepts video + text queries.

import os
import torch
from swift.llm import PtEngine, RequestConfig, InferRequest
from transformers import  BitsAndBytesConfig

# quantized for free T4 in Colab; paper reports performance on unquantized model.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",         # best accuracy for 4-bit
    bnb_4bit_use_double_quant=True,    # better compression
    bnb_4bit_compute_dtype=torch.float16
)
os.environ["FPS_MIN_FRAMES"]="24"
os.environ["FPS_MAX_FRAMES"]="24"
os.environ["VIDEO_MAX_PIXELS"]="100352"

engine = PtEngine(adapters=[ "SimulaMet/SoccerChat-qwen2-vl-7b"], quantization_config = bnb_config, attn_impl="sdpa", max_batch_size=1, use_hf=True, model_id_or_path="Qwen/Qwen2-VL-7B-Instruct", )
req_cfg = RequestConfig(max_tokens=512, temperature=0.3, top_k=20, top_p=0.7, repetition_penalty=1.05) 

infer_requests = [
    InferRequest(messages=[{
    "role": "user",
    "content": [
        {"type": "video", "video": "https://huggingface.co/datasets/SimulaMet/SoccerChat/resolve/main/videos/MultipleEvents/100037_Shotsontarget--Balloutofplay.mp4"},
        # {"type": "video","video": "data:video/mp4;base64," + base64.b64encode(open("/localpath/video.mp4", "rb").read()).decode("utf-8")}, # for local path
        {"type": "text", "text": "What is shown in the video?"}
    ],
}])
]
resp = engine.infer(infer_requests, req_cfg)
print(resp[0].choices[0].message.content)

Sources

GitHub: simula/SoccerChat
Dataset: SimulaMet/SoccerChat
Paper: arXiv:2505.16630

Uses

Direct Use

Answering questions about soccer matches based on video frames and commentary.
Explaining events such as goals, fouls, substitutions, and passes.
Generating contextual match commentary aligned with multimodal inputs.

Downstream Use

Sports analytics platforms for researchers and practitioners.
Interactive soccer assistants for fans, broadcasters, and educational tools.

Out-of-Scope Use

General-purpose reasoning beyond soccer.
Sensitive domains (medical, legal, safety-critical applications).
Gambling or betting predictions.

Bias, Risks, and Limitations

The model is trained on soccer-specific multimodal data → limited generalization outside this domain.
May generate hallucinated commentary if video frames are ambiguous.
Currently optimized for English → other languages are not supported.

Training Details

Training Data

Dataset: SoccerChat
Contains synchronized video frames, event labels, and commentary text for soccer matches.

Training Procedure

Method: LoRA finetuning with PEFT.
Base model: Qwen2-VL-7B-Instruct.
Precision: fp16 mixed.
Implementation: Training scripts.

(For full hyperparameters and details, see paper.)

Evaluation

Testing Data

Held-out splits from the SoccerChat dataset.

Metrics

Automatic metrics: BLEU, ROUGE, METEOR (for generated text).
Event-based metrics: accuracy/recall for detecting key match events.
Human evaluation: commentary fluency and correctness (as reported in the paper).

Results

The paper reports improved performance over baseline models in multimodal soccer understanding tasks.
See Table results in the paper for details.

Environmental Impact

Training used GPU-based compute (exact hardware and CO2 estimates not specified in paper).
Users are encouraged to consult the MLCO2 Impact Calculator for replication scenarios.

Citation

If you use this model, please cite:

@article{Gautam2025May,
    author = {Gautam, Sushant and Midoglu, Cise and Thambawita, Vajira and others},
    title = {{SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding}},
    journal = {ArXiv e-prints},
    year = {2025},
    month = may,
    eprint = {2505.16630},
    doi = {10.48550/arXiv.2505.16630}
}

Contact

Organization: SimulaMet
Website: simula.no
GitHub Issues: simula/SoccerChat

Downloads last month: 53

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SimulaMet/SoccerChat-qwen2-vl-7b

Base model

Qwen/Qwen2-VL-7B

Finetuned

Qwen/Qwen2-VL-7B-Instruct

Adapter

(168)

this model

SimulaMet
/

SoccerChat-qwen2-vl-7b