Image Quality Fusion Model

A multi-modal image quality assessment system that combines BRISQUE, Aesthetic Predictor, and CLIP features to predict human-like quality judgments on a 1-10 scale.

🎯 Model Description

This model fuses three complementary approaches to comprehensive image quality assessment:

  • πŸ”§ BRISQUE (OpenCV): Technical quality assessment detecting blur, noise, compression artifacts, and distortions
  • 🎨 Aesthetic Predictor (LAION): Visual appeal assessment using CLIP ViT-B-32 features trained on human aesthetic ratings
  • 🧠 CLIP (OpenAI): Semantic understanding and high-level feature extraction for content awareness

The fusion network learns optimal weights to combine these diverse quality signals, producing human-like quality judgments that correlate strongly with subjective assessments.

πŸš€ Quick Start

Installation

pip install torch torchvision huggingface_hub opencv-python pillow open-clip-torch

Basic Usage

# Define a minimal loader class that matches the uploaded head (512 -> 256 -> 1)
import torch
import torch.nn as nn
from huggingface_hub import PyTorchModelHubMixin

class IQFModel(nn.Module, PyTorchModelHubMixin):
    def __init__(self, in_dim=512, hidden=256, **kwargs):
        # Accept either in_dim/hidden or clip_embed_dim/hidden_dim from config.json
        in_dim = kwargs.pop("clip_embed_dim", in_dim)
        hidden = kwargs.pop("hidden_dim", hidden)
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, 1),
        )
    def forward(self, x):
        return self.mlp(x)

# Load weights from the Hub (defaults to model.safetensors)
model = IQFModel.from_pretrained("matthewyuan/image-quality-fusion", map_location="cpu")
model.eval()

# Smoke test on a dummy 512-d vector
with torch.no_grad():
    y = model(torch.randn(1, 512)).item()
print(f"score: {y}")

Advanced Usage

import torch
import torch.nn as nn
from PIL import Image
import open_clip
from huggingface_hub import PyTorchModelHubMixin

# Minimal loader class (same as above)
class IQFModel(nn.Module, PyTorchModelHubMixin):
    def __init__(self, in_dim=512, hidden=256, **kwargs):
        in_dim = kwargs.pop("clip_embed_dim", in_dim)
        hidden = kwargs.pop("hidden_dim", hidden)
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, 1),
        )
    def forward(self, x):
        return self.mlp(x)

# 1) Load CLIP ViT-B/32 image encoder (512-d output)
clip_model, _, clip_preprocess = open_clip.create_model_and_transforms(
    "ViT-B-32", pretrained="openai"
)
clip_model.eval()

# 2) Load the fusion head from the Hub
fusion = IQFModel.from_pretrained("matthewyuan/image-quality-fusion", map_location="cpu")
fusion.eval()

def image_to_clip_embedding(img: Image.Image) -> torch.Tensor:
    x = clip_preprocess(img).unsqueeze(0)  # [1, 3, H, W]
    with torch.no_grad():
        feat = clip_model.encode_image(x)   # [1, 512]
        feat = feat / feat.norm(dim=-1, keepdim=True)
    return feat

def predict_quality(image_path: str) -> float:
    img = Image.open(image_path).convert("RGB")
    emb = image_to_clip_embedding(img)      # [1, 512]
    with torch.no_grad():
        score = fusion(emb).item()          # scalar
    return float(score)

print("score:", predict_quality("test.jpg"))

πŸ“Š Performance Metrics

Evaluated on the SPAQ dataset (11,125 smartphone images with human quality ratings):

Metric Value Description
Pearson Correlation 0.520 Correlation with human judgments
RΒ² Score 0.250 Coefficient of determination
Mean Absolute Error 1.41 Average prediction error (1-10 scale)
Root Mean Square Error 1.69 RMS prediction error

Comparison with Individual Components

Method Correlation RΒ² Score MAE
Fusion Model 0.520 0.250 1.41
BRISQUE Only 0.31 0.12 2.1
Aesthetic Only 0.41 0.18 1.8
CLIP Only 0.28 0.09 2.3

The fusion approach significantly outperforms individual components.

πŸ—οΈ Model Architecture

Input Image (RGB)
    β”œβ”€β”€ OpenCV BRISQUE β†’ Technical Quality Score (0-100, normalized)
    β”œβ”€β”€ LAION Aesthetic β†’ Aesthetic Score (0-10, normalized) 
    └── OpenAI CLIP-B32 β†’ Semantic Features (512-dimensional)
                ↓
        Feature Fusion Network
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚ BRISQUE: 1D β†’ 64 β†’ 128  β”‚
        β”‚ Aesthetic: 1D β†’ 64 β†’ 128β”‚  
        β”‚ CLIP: 512D β†’ 256 β†’ 128  β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                ↓ (concat)
        Deep Fusion Layers (384D β†’ 256D β†’ 128D β†’ 1D)
        Dropout (0.3) + ReLU activations
                ↓
        Human-like Quality Score (1.0 - 10.0)

Technical Details

  • Input Resolution: Any size (resized to 224Γ—224 for CLIP)
  • Architecture: Feed-forward neural network with residual connections
  • Activation Functions: ReLU for hidden layers, Linear for output
  • Regularization: Dropout (0.3), Early stopping
  • Output Range: 1.0 - 10.0 (human rating scale)
  • Parameters: ~2.1M total parameters

πŸ”¬ Training Details

Dataset

  • Name: SPAQ (Smartphone Photography Attribute and Quality)
  • Size: 11,125 high-resolution smartphone images
  • Annotations: Human quality ratings (1-10 scale, 5+ annotators per image)
  • Split: 80% train, 10% validation, 10% test
  • Domain: Consumer smartphone photography

Training Configuration

  • Framework: PyTorch 2.0+ with MPS acceleration (M1 optimized)
  • Optimizer: AdamW (lr=1e-3, weight_decay=1e-4)
  • Batch Size: 128 (optimized for 32GB unified memory)
  • Epochs: 50 with early stopping (patience=10)
  • Loss Function: Mean Squared Error (MSE)
  • Learning Rate Schedule: ReduceLROnPlateau (factor=0.5, patience=5)
  • Hardware: M1 MacBook Pro (32GB RAM)
  • Training Time: ~1 hour (with feature caching)

Optimization Techniques

  • Mixed Precision Training: MPS autocast for M1 acceleration
  • Feature Caching: Pre-computed embeddings for 20-30x speedup
  • Data Loading: Optimized DataLoader (6-8 workers, memory pinning)
  • Memory Management: Garbage collection every 10 batches
  • Preprocessing Pipeline: Parallel BRISQUE computation

πŸ“± Use Cases

Professional Applications

  • Content Management: Automatic quality filtering for large image databases
  • Social Media: Real-time quality assessment for user uploads
  • E-commerce: Product image quality validation
  • Digital Asset Management: Automated quality scoring for photo libraries

Research Applications

  • Image Quality Research: Benchmark for perceptual quality metrics
  • Dataset Curation: Quality-based dataset filtering and ranking
  • Human Perception Studies: Computational model of aesthetic judgment
  • Multi-modal Learning: Example of successful feature fusion

Creative Applications

  • Photography Tools: Automated photo rating and selection
  • Mobile Apps: Real-time quality feedback during capture
  • Photo Editing: Quality-guided automatic enhancement
  • Portfolio Management: Intelligent photo organization

⚠️ Limitations and Biases

Model Limitations

  • Domain Specificity: Trained primarily on smartphone photography
  • Resolution Dependency: Performance may vary with very low/high resolution images
  • Cultural Bias: Aesthetic preferences may reflect training data demographics
  • Temporal Bias: Training data from specific time period may not reflect evolving preferences

Technical Limitations

  • BRISQUE Scope: May not capture all types of technical degradation
  • CLIP Bias: Inherits biases from CLIP's training data
  • Aesthetic Subjectivity: Individual preferences vary significantly
  • Computational Requirements: Requires GPU for optimal inference speed

Recommended Usage

  • Validation: Always validate on your specific domain before production use
  • Human Oversight: Use as a tool to assist, not replace, human judgment
  • Bias Mitigation: Consider diverse evaluation datasets
  • Performance Monitoring: Monitor performance on your specific use case

πŸ“š Citation

If you use this model in your research, please cite:

@misc{image-quality-fusion-2024,
  title={Image Quality Fusion: Multi-Modal Assessment with BRISQUE, Aesthetic, and CLIP Features},
  author={Matthew Yuan},
  year={2024},
  howpublished={\url{https://huggingface.co/matthewyuan/image-quality-fusion}},
  note={Trained on SPAQ dataset, deployed via GitHub Actions CI/CD}
}

πŸ”— Related Work

Datasets

Models

πŸ› οΈ Development

Local Development

# Clone repository
git clone https://github.com/mattkyuan/image-quality-fusion.git
cd image-quality-fusion

# Install dependencies  
pip install -r requirements.txt

# Run training
python src/image_quality_fusion/training/train_fusion.py \
    --image_dir data/images \
    --annotations data/annotations.csv \
    --prepare_data \
    --epochs 50

CI/CD Pipeline

This model is automatically deployed via GitHub Actions:

  • Training Pipeline: Automated model training on code changes
  • Deployment Pipeline: Automatic HF Hub deployment on model updates
  • Testing Pipeline: Comprehensive model validation and testing

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • SPAQ Dataset: H4nwei et al. for the comprehensive smartphone photography dataset
  • LAION: For the aesthetic predictor model and training methodology
  • OpenAI: For CLIP model architecture and pre-trained weights
  • OpenCV: For BRISQUE implementation and computer vision tools
  • Hugging Face: For model hosting and deployment infrastructure
  • PyTorch Team: For the deep learning framework and MPS acceleration

πŸ“ž Contact


This model was trained and deployed using automated CI/CD pipelines for reproducible ML workflows.

Downloads last month
54
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for matthewyuan/image-quality-fusion

Finetuned
(83)
this model