Educational Story Outcome Predictor

Version: 1.0 | Status: Stable | Release Date: September 16, 2025

GitHub repo with modelling code

Model Description

A fine-tuned DistilBERT model that predicts educational intervention outcomes from dual-sequence inputs: situation context + solution approach → success/failure prediction.

This model analyzes real educational scenarios and predicts the long-term effectiveness of interventions based on actual classroom outcomes, not theoretical or moral judgments about intervention approaches. The model learned from real teacher experiences to identify which intervention patterns tend to lead to successful vs. unsuccessful outcomes.

Model Details

  • Base Model: distilbert-base-uncased
  • Model Type: Text Classification (Binary)
  • Language: English
  • License: Apache 2.0
  • Parameters: ~67M parameters
  • Dataset: MU-NLPC/Edustories-en (1,492 educational stories)
  • Input Format: Two text sequences (situation + solution)
  • Output: Binary classification with confidence scores

Performance

Metric Score
Accuracy 74.18%
F1 Score 82.18%
Precision 71.80%
Recall 96.05%

Baseline Performance: 61.96% accuracy (most frequent class)
Improvement: +12.22 percentage points over baseline

Comparison with RoBERTa

DistilBERT outperforms RoBERTa (73.91% accuracy, 80.08% F1) while training 2.4x faster, making it the optimal choice for production deployment.

Intended Use

Primary Applications

  • Educational Research: Analyze intervention effectiveness patterns
  • Decision Support: Inform evidence-based educational choices
  • Content Analysis: Automatically categorize educational narratives
  • Bias Detection: Identify patterns in educational expectations

Out-of-Scope Uses

  • High-stakes educational decisions without human oversight
  • Medical or clinical decision making
  • General text classification outside educational domain
  • Real-time assessment of individual students

How to Use

Quick Start

from transformers import pipeline

# Load the model
classifier = pipeline(
    "text-classification",
    model="polkas/educational-story-outcome-predictor"
)

# Example prediction (combine situation and solution)
situation = "Student struggling with reading comprehension in grade 3"
solution = "Teacher implements guided reading sessions with peer support"
combined_text = f"{situation} {solution}"

result = classifier(combined_text)
print(f"Prediction: {result[0]['label']} (confidence: {result[0]['score']:.2f})")

Real Examples Demo

This repository includes example_usage.py and real_examples.json with 10 real examples from the MU-NLPC/Edustories-en dataset:

# Run the example script to see predictions on real data
python example_usage.py

The examples demonstrate both successful and failed interventions, achieving ~80% accuracy on these real cases.

Advanced Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "polkas/educational-story-outcome-predictor"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Prepare dual-sequence input
situation = "Your educational situation description..."
solution = "Your intervention solution description..."

# Tokenize
inputs = tokenizer(
    situation, solution,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=512
)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1).item()
    confidence = torch.max(predictions, dim=-1)[0].item()

# Map to labels
labels = ['Failure', 'Success']
result = labels[predicted_class]
print(f"Prediction: {result} (confidence: {confidence:.3f})")

Training Details

Dataset

  • Source: MU-NLPC/Edustories-en
  • Total Examples: 1,471 (after cleaning)
  • Training Set: 882 examples (60%)
  • Validation Set: 221 examples (15%)
  • Test Set: 368 examples (25%)

Training Configuration

  • Epochs: 4
  • Batch Size: 8
  • Learning Rate: 2e-5
  • Optimizer: AdamW with weight decay 0.01
  • Hardware: Apple Silicon (MPS acceleration)
  • Training Time: ~5 minutes
  • Framework: Transformers 4.30+, PyTorch 2.0+

Data Preprocessing

  • Input: Combined situation (description + anamnesis) + solution text
  • Labels: Binary mapping from original multi-class annotations
  • Label Distribution: 62% Success, 38% Failure
  • Max Sequence Length: 512 tokens

Evaluation

The model was evaluated on a stratified 25% holdout test set with the following results:

  • Test Set Size: 368 examples
  • Evaluation Metrics: Accuracy, F1, Precision, Recall
  • Baseline Comparison: Most frequent class predictor
  • Cross-validation: Not applied (single train/test split)

Limitations and Bias

Known Limitations

  1. Language: English only, performance on other languages not evaluated
  2. Domain Specificity: Trained only on educational narratives
  3. Binary Classification: Only predicts Success/Failure (no nuanced outcomes)
  4. Sequence Length: Limited to 512 tokens (longer texts are truncated)
  5. Temporal Context: Static training data may not capture evolving practices

Potential Biases

  • Representation Bias: Training data may not represent all educational contexts
  • Annotation Bias: Human-labeled outcomes may reflect annotator perspectives
  • Historical Bias: May perpetuate existing inequities in educational systems
  • Cultural Bias: Model trained primarily on specific cultural/linguistic contexts

Recommendations

  • Use as research tool, not for high-stakes decisions
  • Validate predictions with domain experts
  • Monitor for discriminatory patterns across different groups
  • Consider cultural and contextual factors in deployment
  • Implement human oversight for sensitive applications

Environmental Impact

  • Training Emissions: Minimal (efficient Apple Silicon hardware)
  • Model Size: 67M parameters (~250MB)
  • Inference Efficiency: Optimized for deployment on consumer hardware
  • Energy Usage: Low inference energy requirements

Technical Requirements

For Inference

  • Python: 3.8+
  • PyTorch: 2.0+
  • Transformers: 4.30+
  • Memory: 4GB+ RAM recommended
  • Hardware: CPU sufficient, GPU optional

For Training

  • Memory: 8GB+ RAM recommended
  • Hardware: Apple Silicon (MPS) or CUDA GPU for efficient training
  • Time: ~5 minutes for full training

Citation

If you use this model in your research, please cite:

@misc{educational-story-outcome-predictor-2025,
  title={Educational Story Outcome Predictor: A DistilBERT Model for Educational Intervention Analysis},
  author={Maciej Nasinski},
  year={2025},
  url={https://huggingface.co/polkas/educational-story-outcome-predictor},
  note={Fine-tuned DistilBERT model for binary classification of educational intervention outcomes}
}

Dataset Citation

@misc{edustories-en-2024,
  title={MU-NLPC/Edustories-en},
  author={MU-NLPC},
  year={2024},
  url={https://huggingface.co/datasets/MU-NLPC/Edustories-en}
}

Model Card Authors

  • Model Development: Maciej Nasinski
  • Model Card: Maciej Nasinski
  • Contact: [Insert your contact information]

Acknowledgments

  • Base model: Hugging Face DistilBERT team
  • Dataset: MU-NLPC research group
  • Framework: Hugging Face Transformers library

This model card follows the guidelines from Mitchell et al. (2019) and Hugging Face Model Card Guidelines.

Downloads last month
16
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for polkas/educational-story-outcome-predictor

Finetuned
(9639)
this model

Dataset used to train polkas/educational-story-outcome-predictor

Evaluation results