{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "yS2xVGrLrphl" }, "outputs": [], "source": [ "\"\"\"\n", "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", "\n", "Instructions for setting up Colab are as follows:\n", "1. Open a new Python 3 notebook.\n", "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", "4. Run this cell to set up dependencies.\n", "\"\"\"\n", "# If you're using Google Colab and not running locally, run this cell.\n", "\n", "## Install dependencies\n", "!apt-get install sox libsndfile1 ffmpeg\n", "!pip install wget\n", "!pip install text-unidecode\n", "\n", "# ## Install NeMo\n", "BRANCH = 'r1.17.0'\n", "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "## Grab the config we'll use in this example\n", "!mkdir configs" ] }, { "cell_type": "markdown", "metadata": { "id": "ivKMObsjy9Om" }, "source": [ "# Adapters Support in NeMo Models\n", "\n", "In NeMo, we often train models and fine-tune them for a specific task. This is a reasonable approach when the models are just a few million parameters. However, this approach quickly becomes infeasible when approaching hundreds of millions or even billions of parameters. \n", "\n", "As a potential solution to such a scenario, where fine-tuning a massive model is no longer feasible, we look to specialized [Adapters](https://arxiv.org/abs/1902.00751) to specialize our model on a specific domain or task. Adapters require a fraction of the total number of parameters as the original model and are much more efficient to fine-tune.\n", "\n", "In this tutorial, we will discuss how to update any torch.nn.Module to support Adapters, and going further, how to enable NeMo models with Adapter support for their components.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "lZ7mPouNEXjB" }, "source": [ "## What are Adapters?\n", "\n", "Adapters are a straightforward concept - one formulation can be shown by the diagram below. At their simplest, they are residual Feedforward layers that compress the input dimension ($D$) to a small bottleneck dimension ($H$), such that $R^D \\text{->} R^H$, compute an activation (such as ReLU), finally mapping $R^H \\text{->} R^D$ with another Feedforward layer. This output is then added to the input via a simple residual connection.\n", "\n", "