--- language: - en license: mit library_name: xgboost pipeline_tag: text-classification tags: - xgboost - multiclass - cuisine - region-classification - kaggle metrics: - accuracy - f1 model-index: - name: CuisineClassifier results: - task: type: text-classification name: Cuisine (20 classes) dataset: name: What's Cooking? (Kaggle) type: whats- url: https://www.kaggle.com/datasets/kaggle/recipe-ingredients-dataset split: test metrics: - type: accuracy value: 0.77 - type: f1 value: 0.69 - task: type: text-classification name: Region (5 classes) dataset: name: What's Cooking? (Kaggle) β€” aggregated to regions type: whats-cooking url: https://www.kaggle.com/datasets/kaggle/recipe-ingredients-dataset split: test metrics: - type: accuracy value: 0.89 --- # 🍽 Cuisine Classifier (XGBoost) This model classifies dishes based on their ingredients and assigns them either to a **Cuisine (20 classes)** or a **Region (5 classes)**. It uses an **XGBoost classifier** trained on normalized ingredient data. --- ## πŸ“Š Model Overview - **Task**: Multiclass Classification (Cuisines & Regions) - **Input**: List of ingredients (`["salt", "flour", "sugar", ...]`) - **Output**: Cuisine class (e.g. `"italian"`) or Region (e.g. `"Central Europe"`) - **Algorithm**: [XGBoost](https://xgboost.ai/) - **Training Data**: Kaggle [*What’s Cooking?*](https://www.kaggle.com/datasets/kaggle/recipe-ingredients-dataset) dataset, ingredients normalized using AllRecipes dataset - **Train/Test Split**: 80 / 20, stratified - **Cross Validation**: 5-fold CV with `random_state=42` ### 🌍 Region Mapping | Region | Cuisines | |-----------------|-----------------------------------------------------------| | Central Europe | british, french, greek, irish, italian, russian, spanish | | North America | cajun_creole, southern_us | | Asia | chinese, filipino, indian, japanese, korean, thai, vietnamese | | Middle East | moroccan | | Latin America | mexican, jamaican, brazilian | --- ## πŸ§ͺ Performance ### Model Comparison | Metric | Stratified Baseline | Logistic Regression | XGBoost | |-------|----------------------|---------------------|---------| | **Precision (20 cuisines)** | 0.05 | 0.65 | **0.75** | | **Recall (20 cuisines)** | 0.05 | **0.69** | 0.66 | | **Macro F1 (20 cuisines)** | 0.05 | 0.67 | **0.69** | | **Accuracy (20 cuisines)** | 0.10 | 0.75 | **0.77** | | **Accuracy (5 regions)** | 0.27 | **0.89** | **0.89** | βœ… **Conclusion:** XGBoost achieves the best results for the 20-class cuisine classification and clearly outperforms the baseline. For the 5-region setting, Logistic Regression and XGBoost perform nearly identically β€” however, XGBoost provides more consistent results across classes. --- ### Per-Region Metrics (5 Classes) | Region | Precision (XGB) | Recall (XGB) | F1 (XGB) | |-----------------|------------------|--------------|----------| | Asia | 0.94 | 0.92 | 0.93 | | Central Europe | 0.85 | **0.93** | 0.89 | | Latin America | 0.92 | 0.88 | 0.90 | | Middle East | **0.88** | 0.74 | 0.81 | | North America | **0.87** | 0.76 | 0.81 | --- ## πŸš€ How to Use ```python from huggingface_hub import hf_hub_download import joblib class CuisineClassifier: def __init__(self, classifier="region"): print("Initializing CuisineClassifier...") components = ["cuisine_pipeline", "label_encoder"] paths = {} print("Downloading files from Hugging Face Hub...") for name in components: print(f"Downloading {name}.joblib ...") try: paths[name] = hf_hub_download( repo_id="NoahMeissner/CuisineClassifier", filename=f"region_classifier/{name}.joblib" if classifier == "cuisine": filename=f"cuisine_classifier/{name}.joblib" ) print(f"{name} downloaded.") except Exception as e: print(f"Failed to download {name}: {e}") raise print("Loading model components with joblib...") try: self.model = joblib.load(paths["cuisine_pipeline"]) print("Model loaded.") self.label_encoder = joblib.load(paths["label_encoder"]) print("Label encoder loaded.") except Exception as e: print(f"Failed to load components: {e}") raise print("All components loaded successfully.") def classify(self, text_input): data = " ".join(text_input) predicted_class = self.model.predict([data]) predicted_label = self.label_encoder.inverse_transform(predicted_class) return predicted_label