File size: 5,623 Bytes

ea4e6ef
b0ec723
 
ea4e6ef
b0ec723
 
 
 
ea4e6ef
 
b0ec723
 
ea4e6ef
b0ec723
 
ea4e6ef
 
 
b0ec723
 
 
 
ea4e6ef
b0ec723
 
 
ea4e6ef
b0ec723
 
ea4e6ef
b0ec723
 
ea4e6ef
b0ec723
 
 
 
ea4e6ef
b0ec723
ea4e6ef
b0ec723
ea4e6ef
b0ec723
 
 
 
 
 
ea4e6ef
b0ec723
 
 
 
ea4e6ef
b0ec723
 
 
 
 
ea4e6ef
 
b0ec723
 
ea4e6ef
 
 
 
b0ec723

---
license: bsd-3-clause
pipeline_tag: audio-classification
library_name: transformers
tags:
  - PyTorch
  - State-space
  - Mamba
---

# DASS: Distilled Audio State-space Models
This version is distilled from SSLAM (88M), DASS small (30M) mAP: 50.1 and DASS medium (49M) mAP: 50.2.

It is introduced in the paper [DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners](https://arxiv.org/pdf/2407.04082) and 
first released in [this repository](https://github.com/Saurabhbhati/DASS).


## How to Get Started with the Model
```python
import torch
import librosa
from transformers import AutoConfig, AutoModelForAudioClassification, AutoFeatureExtractor

config = AutoConfig.from_pretrained('saurabhati/DASS_small_AudioSet_50.1',trust_remote_code=True)
audio_model = AutoModelForAudioClassification.from_pretrained('saurabhati/DASS_small_AudioSet_50.1',trust_remote_code=True)
feature_extractor = AutoFeatureExtractor.from_pretrained('saurabhati/DASS_small_AudioSet_50.1',trust_remote_code=True)

waveform, sr = librosa.load("audio/eval/_/_/--4gqARaEJE_0.000.flac", sr=16000)
inputs = feature_extractor(waveform,sr, return_tensors='pt')

with torch.no_grad():
    logits = torch.sigmoid(audio_model(**inputs).logits)

predicted_class_ids = torch.where(logits[0] > 0.5)[0]
predicted_label = [audio_model.config.id2label[i.item()] for i in predicted_class_ids]
predicted_label
['Animal', 'Domestic animals, pets', 'Dog']

```

## Model Details

<!-- Provide a quick summary of what the model is/does. -->
DASS: Distilled Audio State-space Models is an audio classification model finetuned on AudioSet-2M. 
DASS is the first state-space model that outperforms transformer-based audio classifiers such as AST (Audio Spectrogram Transformer), HTS-AT, and Audio-MAE.
DASS achieves state-of-the-art performance on the audio-classification
task on Audioset while significantly reducing the model size. For example, compared to AST which contains approximately 87M
parameters, DASS-small contains one-third, 30M, parameters and outperforms the AST model (AudioSet-2M map: 45.9 vs DASS small mAP: 47.2). 

It is available in two sizes: DASS small (30M) mAP: 47.2 and DASS medium (49M) mAP: 47.6.
DASSv2 model is distilled from an ensemble of AST and 
HTS-AT which sigificantly boosts the performance on Audio classification task.
New performance: DASS small (30M) mAP: 48.6 and DASS medium (49M) mAP: 48.9.

It is also significantly more duration robust (training on shorter audio and testing on long audio without fine-tuning on longer audio) than the AST model. 
For example, for both AST and DASS models training on 10-second long audios, the performance of AST models drops to less than 5 mAP when 
the input is 50 seconds, which is < 12% of the performance for 10-second input, while DASS’s performance is 45.5 mAP (96%) in the same setting. 
On a single A6000 GPU, DASS can take up to 2.5-hours of audio input and still maintain 62% of its
performance compared to a 10-second input.


DASS model in based on the [VMamba: Visual State Space Model](https://arxiv.org/pdf/2401.10166) applied to audio. 
It is trained with binary cross entropy loss w.r.t. ground truth labels and kl-divergence loss w.r.t teacher AST model. 


### Results

Below are the results for DASS models finetuned and evaluated on AudioSet-2M. 

|                                           | Params | Pretrain |  mAP |
|-------------------------------------------|:------:|:--------:|:----:|
| Transformer based models                                             |
| [AST](https://arxiv.org/pdf/2104.01778)                 |   87M  |   IN SL  | 45.9 |
| [HTS-AT](https://arxiv.org/pdf/2202.00874)              |   31M  |   IN SL  | 47.1 |
| [PaSST](https://arxiv.org/pdf/2110.05069)               |        |   IN SL  | 47.1 |
| [Audio-MAE](https://arxiv.org/pdf/2207.06405)           |   86M  |    SSL   | 47.3 |
| [BEATS_iter3](https://arxiv.org/pdf/2212.09058)         |   90M  |  AS SSL  | 48.6 |
| [EAT](https://arxiv.org/pdf/2401.03497v1)               |   88M  |  AS SSL  | 48.6 |
| [SSLAM](https://openreview.net/pdf?id=odU59TxdiB)       |   88M  |  AS SSL  | 50.2
| Concurrent SSM models                                   |        |          |      |
| [AuM](https://arxiv.org/pdf/2406.03344)                 |   26M  |   IN SL  | 39.7 |
| [Audio Mamba](https://arxiv.org/pdf/2405.13636)         |   40M  |   IN SL  | 44.0 |
| DASS-Small                                              |   30M  |   IN SL  | 47.2 |
| DASS-Medium                                             |   49M  |   IN SL  | 47.6 |
| DASS-Small (teach: AST + HTS-AT)                        |   30M  |   IN SL  | 48.6 |
| DASS-Medium (teach: AST + HTS-AT)                       |   49M  |   IN SL  | 48.9 |
| DASS-Small (teach: SSLAM)                               |   30M  |   IN SL  | 50.1 |
| DASS-Medium (teach: SSLAM)                              |   49M  |   IN SL  | 50.2 |

## Citation

```bibtex
@article{bhati2024dass,
  title={DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners},
  author={Bhati, Saurabhchand and Gong, Yuan and Karlinsky, Leonid and Kuehne, Hilde and Feris, Rogerio and Glass, James},
  journal={arXiv preprint arXiv:2407.04082},
  year={2024}
}
```

## Acknowledgements 

This project is based on AST([paper](https://arxiv.org/pdf/2104.01778), [code](https://github.com/YuanGongND/ast/tree/master)), 
VMamba([paper](https://arxiv.org/pdf/2401.10166), [code](https://github.com/MzeroMiko/VMamba/tree/main)) thanks for their excellant works.
Please make sure to check them out.