Update pipeline tag and add library name, results and installation instructions (#2)
Browse files- Update pipeline tag and add library name, results and installation instructions (cdced992acf416477c9e40b19f3b51ee742d245d)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
@@ -1,15 +1,17 @@
|
|
1 |
---
|
|
|
|
|
2 |
license: other
|
3 |
license_name: nvclv1
|
4 |
license_link: LICENSE
|
5 |
-
|
6 |
-
|
7 |
-
pipeline_tag: image-feature-extraction
|
8 |
---
|
9 |
|
10 |
-
|
11 |
[**MambaVision: A Hybrid Mamba-Transformer Vision Backbone**](https://arxiv.org/abs/2407.08083).
|
12 |
|
|
|
|
|
13 |
## Model Overview
|
14 |
|
15 |
We have developed the first hybrid model for computer vision which leverages the strengths of Mamba and Transformers. Specifically, our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conducted a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self-attention blocks at the final layers greatly improves the modeling capacity to capture long-range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria.
|
@@ -37,7 +39,6 @@ MambaVision-L3-512-21K is pretrained on ImageNet-21K dataset and finetuned on Im
|
|
37 |
<td>512x512</td>
|
38 |
</tr>
|
39 |
|
40 |
-
|
41 |
</table>
|
42 |
|
43 |
In addition, the MambaVision models demonstrate a strong performance by achieving a new SOTA Pareto-front in
|
@@ -48,12 +49,10 @@ terms of Top-1 accuracy and throughput.
|
|
48 |
class="center">
|
49 |
</p>
|
50 |
|
51 |
-
|
52 |
## Model Usage
|
53 |
|
54 |
It is highly recommended to install the requirements for MambaVision by running the following:
|
55 |
|
56 |
-
|
57 |
```Bash
|
58 |
pip install mambavision
|
59 |
```
|
@@ -66,13 +65,11 @@ In the following example, we demonstrate how MambaVision can be used for image c
|
|
66 |
|
67 |
Given the following image from [COCO dataset](https://cocodataset.org/#home) val set as an input:
|
68 |
|
69 |
-
|
70 |
<p align="center">
|
71 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/4duSnqLf4lrNiAHczSmAN.jpeg" width=70% height=70%
|
72 |
class="center">
|
73 |
</p>
|
74 |
|
75 |
-
|
76 |
The following snippet can be used for image classification:
|
77 |
|
78 |
```Python
|
@@ -147,7 +144,176 @@ print("Size of extracted features in stage 1:", features[0].size()) # torch.Size
|
|
147 |
print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 1568, 16, 16])
|
148 |
```
|
149 |
|
150 |
-
|
151 |
### License:
|
152 |
|
153 |
-
[NVIDIA Source Code License-NC](https://huggingface.co/nvidia/MambaVision-L3-512-21K/blob/main/LICENSE)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
datasets:
|
3 |
+
- ILSVRC/imagenet-21k
|
4 |
license: other
|
5 |
license_name: nvclv1
|
6 |
license_link: LICENSE
|
7 |
+
pipeline_tag: image-classification
|
8 |
+
library_name: transformers
|
|
|
9 |
---
|
10 |
|
|
|
11 |
[**MambaVision: A Hybrid Mamba-Transformer Vision Backbone**](https://arxiv.org/abs/2407.08083).
|
12 |
|
13 |
+
[Project page](https://github.com/NVlabs/MambaVision)
|
14 |
+
|
15 |
## Model Overview
|
16 |
|
17 |
We have developed the first hybrid model for computer vision which leverages the strengths of Mamba and Transformers. Specifically, our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conducted a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self-attention blocks at the final layers greatly improves the modeling capacity to capture long-range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria.
|
|
|
39 |
<td>512x512</td>
|
40 |
</tr>
|
41 |
|
|
|
42 |
</table>
|
43 |
|
44 |
In addition, the MambaVision models demonstrate a strong performance by achieving a new SOTA Pareto-front in
|
|
|
49 |
class="center">
|
50 |
</p>
|
51 |
|
|
|
52 |
## Model Usage
|
53 |
|
54 |
It is highly recommended to install the requirements for MambaVision by running the following:
|
55 |
|
|
|
56 |
```Bash
|
57 |
pip install mambavision
|
58 |
```
|
|
|
65 |
|
66 |
Given the following image from [COCO dataset](https://cocodataset.org/#home) val set as an input:
|
67 |
|
|
|
68 |
<p align="center">
|
69 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/4duSnqLf4lrNiAHczSmAN.jpeg" width=70% height=70%
|
70 |
class="center">
|
71 |
</p>
|
72 |
|
|
|
73 |
The following snippet can be used for image classification:
|
74 |
|
75 |
```Python
|
|
|
144 |
print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 1568, 16, 16])
|
145 |
```
|
146 |
|
|
|
147 |
### License:
|
148 |
|
149 |
+
[NVIDIA Source Code License-NC](https://huggingface.co/nvidia/MambaVision-L3-512-21K/blob/main/LICENSE)
|
150 |
+
|
151 |
+
## Results + Pretrained Models
|
152 |
+
|
153 |
+
### ImageNet-21K
|
154 |
+
|
155 |
+
<table>
|
156 |
+
<tr>
|
157 |
+
<th>Name</th>
|
158 |
+
<th>Acc@1(%)</th>
|
159 |
+
<th>Acc@5(%)</th>
|
160 |
+
<th>#Params(M)</th>
|
161 |
+
<th>FLOPs(G)</th>
|
162 |
+
<th>Resolution</th>
|
163 |
+
<th>HF</th>
|
164 |
+
<th>Download</th>
|
165 |
+
</tr>
|
166 |
+
|
167 |
+
<tr>
|
168 |
+
<td>MambaVision-B-21K</td>
|
169 |
+
<td>84.9</td>
|
170 |
+
<td>97.5</td>
|
171 |
+
<td>97.7</td>
|
172 |
+
<td>15.0</td>
|
173 |
+
<td>224x224</td>
|
174 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-B-21K">link</a></td>
|
175 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-B-21K/resolve/main/mambavision_base_21k.pth.tar">model</a></td>
|
176 |
+
</tr>
|
177 |
+
|
178 |
+
<tr>
|
179 |
+
<td>MambaVision-L-21K</td>
|
180 |
+
<td>86.1</td>
|
181 |
+
<td>97.9</td>
|
182 |
+
<td>227.9</td>
|
183 |
+
<td>34.9</td>
|
184 |
+
<td>224x224</td>
|
185 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-L-21K">link</a></td>
|
186 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-L-21K/resolve/main/mambavision_large_21k.pth.tar">model</a></td>
|
187 |
+
</tr>
|
188 |
+
|
189 |
+
<tr>
|
190 |
+
<td>MambaVision-L2-512-21K</td>
|
191 |
+
<td>87.3</td>
|
192 |
+
<td>98.4</td>
|
193 |
+
<td>241.5</td>
|
194 |
+
<td>196.3</td>
|
195 |
+
<td>512x512</td>
|
196 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-L2-512-21K">link</a></td>
|
197 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-L2-512-21K/resolve/main/mambavision_L2_21k_240m_512.pth.tar">model</a></td>
|
198 |
+
</tr>
|
199 |
+
|
200 |
+
<tr>
|
201 |
+
<td>MambaVision-L3-256-21K</td>
|
202 |
+
<td>87.3</td>
|
203 |
+
<td>98.3</td>
|
204 |
+
<td>739.6</td>
|
205 |
+
<td>122.3</td>
|
206 |
+
<td>256x256</td>
|
207 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-L3-256-21K">link</a></td>
|
208 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-L3-256-21K/resolve/main/mambavision_L3_21k_740m_256.pth.tar">model</a></td>
|
209 |
+
</tr>
|
210 |
+
|
211 |
+
<tr>
|
212 |
+
<td>MambaVision-L3-512-21K</td>
|
213 |
+
<td>88.1</td>
|
214 |
+
<td>98.6</td>
|
215 |
+
<td>739.6</td>
|
216 |
+
<td>489.1</td>
|
217 |
+
<td>512x512</td>
|
218 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-L3-512-21K">link</a></td>
|
219 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-L3-512-21K/resolve/main/mambavision_L3_21k_740m_512.pth.tar">model</a></td>
|
220 |
+
</tr>
|
221 |
+
|
222 |
+
</table>
|
223 |
+
|
224 |
+
### ImageNet-1K
|
225 |
+
|
226 |
+
<table>
|
227 |
+
<tr>
|
228 |
+
<th>Name</th>
|
229 |
+
<th>Acc@1(%)</th>
|
230 |
+
<th>Acc@5(%)</th>
|
231 |
+
<th>Throughput(Img/Sec)</th>
|
232 |
+
<th>Resolution</th>
|
233 |
+
<th>#Params(M)</th>
|
234 |
+
<th>FLOPs(G)</th>
|
235 |
+
<th>HF</th>
|
236 |
+
<th>Download</th>
|
237 |
+
</tr>
|
238 |
+
|
239 |
+
<tr>
|
240 |
+
<td>MambaVision-T</td>
|
241 |
+
<td>82.3</td>
|
242 |
+
<td>96.2</td>
|
243 |
+
<td>6298</td>
|
244 |
+
<td>224x224</td>
|
245 |
+
<td>31.8</td>
|
246 |
+
<td>4.4</td>
|
247 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-T-1K">link</a></td>
|
248 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-T-1K/resolve/main/mambavision_tiny_1k.pth.tar">model</a></td>
|
249 |
+
</tr>
|
250 |
+
|
251 |
+
<tr>
|
252 |
+
<td>MambaVision-T2</td>
|
253 |
+
<td>82.7</td>
|
254 |
+
<td>96.3</td>
|
255 |
+
<td>5990</td>
|
256 |
+
<td>224x224</td>
|
257 |
+
<td>35.1</td>
|
258 |
+
<td>5.1</td>
|
259 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-T2-1K">link</a></td>
|
260 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-T2-1K/resolve/main/mambavision_tiny2_1k.pth.tar">model</a></td>
|
261 |
+
</tr>
|
262 |
+
|
263 |
+
<tr>
|
264 |
+
<td>MambaVision-S</td>
|
265 |
+
<td>83.3</td>
|
266 |
+
<td>96.5</td>
|
267 |
+
<td>4700</td>
|
268 |
+
<td>224x224</td>
|
269 |
+
<td>50.1</td>
|
270 |
+
<td>7.5</td>
|
271 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-S-1K">link</a></td>
|
272 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-S-1K/resolve/main/mambavision_small_1k.pth.tar">model</a></td>
|
273 |
+
</tr>
|
274 |
+
|
275 |
+
<tr>
|
276 |
+
<td>MambaVision-B</td>
|
277 |
+
<td>84.2</td>
|
278 |
+
<td>96.9</td>
|
279 |
+
<td>3670</td>
|
280 |
+
<td>224x224</td>
|
281 |
+
<td>97.7</td>
|
282 |
+
<td>15.0</td>
|
283 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-B-1K">link</a></td>
|
284 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-B-1K/resolve/main/mambavision_base_1k.pth.tar">model</a></td>
|
285 |
+
</tr>
|
286 |
+
|
287 |
+
<tr>
|
288 |
+
<td>MambaVision-L</td>
|
289 |
+
<td>85.0</td>
|
290 |
+
<td>97.1</td>
|
291 |
+
<td>2190</td>
|
292 |
+
<td>224x224</td>
|
293 |
+
<td>227.9</td>
|
294 |
+
<td>34.9</td>
|
295 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-L-1K">link</a></td>
|
296 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-L-1K/resolve/main/mambavision_large_1k.pth.tar">model</a></td>
|
297 |
+
</tr>
|
298 |
+
|
299 |
+
<tr>
|
300 |
+
<td>MambaVision-L2</td>
|
301 |
+
<td>85.3</td>
|
302 |
+
<td>97.2</td>
|
303 |
+
<td>1021</td>
|
304 |
+
<td>224x224</td>
|
305 |
+
<td>241.5</td>
|
306 |
+
<td>37.5</td>
|
307 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-L2-1K">link</a></td>
|
308 |
+
<td><a href="https://huggingface.co/nvidia/MambaVision-L2-1K/resolve/main/mambavision_large2_1k.pth.tar">model</a></td>
|
309 |
+
</tr>
|
310 |
+
|
311 |
+
</table>
|
312 |
+
|
313 |
+
## Installation
|
314 |
+
|
315 |
+
We provide a [docker file](./Dockerfile). In addition, assuming that a recent [PyTorch](https://pytorch.org/get-started/locally/) package is installed, the dependencies can be installed by running:
|
316 |
+
|
317 |
+
```bash
|
318 |
+
pip install -r requirements.txt
|
319 |
+
```
|