Improve model card for MambaVision-L-21K (#1)

Browse files

- Improve model card for MambaVision-L-21K (f35a964dc7dae696e1460bdd944b23ed9a46f917)

Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +9 -6

README.md CHANGED Viewed

@@ -1,18 +1,21 @@
 ---
 license: other
 license_name: nvclv1
 license_link: LICENSE
-datasets:
-- ILSVRC/imagenet-21k
-pipeline_tag: image-feature-extraction
 ---
-[**MambaVision: A Hybrid Mamba-Transformer Vision Backbone**](https://arxiv.org/abs/2407.08083).
-## Model Overview
-We have developed the first hybrid model for computer vision which leverages the strengths of Mamba and Transformers. Specifically, our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conducted a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self-attention blocks at the final layers greatly improves the modeling capacity to capture long-range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria.
 ## Model Performance

 ---
+datasets:
+- ILSVRC/imagenet-21k
 license: other
 license_name: nvclv1
 license_link: LICENSE
+pipeline_tag: image-classification
+library_name: transformers
 ---
+# MambaVision: A Hybrid Mamba-Transformer Vision Backbone
+[**MambaVision: A Hybrid Mamba-Transformer Vision Backbone**](https://arxiv.org/abs/2407.08083)
+## Model Description
+We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self-attention blocks at the final layers greatly improves the modeling capacity to capture long-range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For Image classification on ImageNet-1K dataset, MambaVision model variants achieve a new State-of-the-Art (SOTA) performance in terms of Top-1 accuracy and image throughput. In downstream tasks such as object detection, instance segmentation and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably-sized backbones and demonstrates more favorable performance. Code: https://github.com/NVlabs/MambaVision.
 ## Model Performance