Initial commit

Browse files

Files changed (16) hide show

.gitattributes +35 -0
README.md +117 -0
added_tokens.json +3 -0
config.json +191 -0
configuration_hymba.py +116 -0
generation_config.json +8 -0
images/macro_arch.png +0 -0
images/module.png +0 -0
images/performance1.png +0 -0
images/performance2.png +0 -0
modeling_hymba.py +0 -0
setup.sh +44 -0
special_tokens_map.json +30 -0
tokenizer.json +0 -0
tokenizer.model +3 -0
tokenizer_config.json +52 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,117 @@

+---
+license: other
+license_name: nvidia-open-model-license
+license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
+---
+# Hymba-1.5B-Base
+## Model Overview
+Hymba-1.5B-Base is a base text-to-text model that can be adopted for a variety of natural language generation tasks.
+The model has hybrid architecture with Mamba and Attention heads running in parallel. Meta tokens, a set of learnable tokens prepended to every prompt, help improve the efficacy of the model. The model shares KV cache between 2 layers and between heads in a single layer. 90% of attention layers are sliding window attention.
+This model is ready for commercial use.
+**[Model Weights Coming Soon]**
+**[Caution] During generation, the batch size needs to be 1. Our current implementation does not fully support padding of Meta tokens + SWA; this is a work in progress. Training and pre-filling support any batch size.**
+**Model Developer:** NVIDIA
+**Model Dates:** Hymba-1.5B-Base was trained between September 1, 2024 and November 10th, 2024.
+**License:**
+This model is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
+## Model Architecture
+Hymba-1.5B-Base has a model embedding size of 1600, 25 attention heads, and an MLP intermediate dimension of 5504, with 32 layers in total, 16 SSM states, 3 full attention layers, the rest are sliding window attention. Unlike the standard Transformer, each attention layer in Hymba has a hybrid combination of standard attention heads and Mamba heads in parallel.  Additionally, it uses Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE).
+Features of this architecture:
+- Fuse attention heads and SSM heads within the same layer, offering parallel and complementary processing of the same inputs.
+<div align="center">
+<img src="https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/images/module.png" alt="Hymba Module" width="600">
+</div>
+- Introduce meta tokens that are prepended to the input sequences and interact with all subsequent tokens, thus storing important information and alleviating the burden of "forced-to-attend" in attention.
+- Integrate with cross-layer KV sharing and global-local attention to further boost memory and computation efficiency.
+<div align="center">
+<img src="https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/images/macro_arch.png" alt="Hymba Model" width="600">
+</div>
+## Performance Highlights
+- Hymba-1.5B-Base outperforms all sub-2B public models.
+<div align="center">
+<img src="https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/images/performance1.png" alt="Compare with SoTA Small LMs" width="800">
+</div>
+<div align="center">
+<img src="https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/images/performance2.png" alt="Compare with SoTA Small LMs" width="800">
+</div>
+## Model Usage
+### Step 1: Environment Setup
+Since Hymba-1.5B-Instruct employs [FlexAttention](https://pytorch.org/blog/flexattention/), which relies on Pytorch2.5 and other related dependencies, please use the provided `setup.sh` (support CUDA 12.1/12.4) to install the related packages:
+```
+wget --header="Authorization: Bearer YOUR_HF_TOKEN" https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/setup.sh
+bash setup.sh
+```
+### Step 2: Chat with Hymba-1.5B-Base
+After setting up the environment, you can use the following script to chat with our Model
+```
+from transformers import LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer, AutoModel
+import torch
+# Load the tokenizer and model
+repo_name = "nvidia/Hymba-1.5B-Base"
+tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
+model = model.cuda().to(torch.bfloat16)
+# Chat with Hymba
+prompt = input()
+inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
+outputs = model.generate(**inputs, max_length=64, do_sample=True, temperature=0.7, use_cache=True)
+response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
+print(f"Model response: {response}")
+```
+## Limitations
+The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
+## Ethical Considerations
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
+## Citation
+```
+@article{hymba2024,
+      title={A Hybrid-head Architecture for Small Language Models},
+      author={Xin Dong and Yonggan Fu and Shizhe Diao and Wonmin Byeon and Zijia Chen and Ameya Sunil Mahabaleshwarkar and Shih-Yang Liu and Matthijs Van Keirsbilck and Min-Hung Chen and Yoshi Suhara and Yingyan Celine Lin and Jan Kautz and Pavlo Molchanov},
+      journal={arXiv preprint arXiv:xxxx},
+      year={2024},
+      url={https://arxiv.org/abs/xxxx},
+}
+```

added_tokens.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "[PAD]": 32000
+}

config.json ADDED Viewed

	@@ -0,0 +1,191 @@

+{
+  "architectures": [
+    "HymbaForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "attn_hidden_size": -1,
+  "attn_implementation": "flex",
+  "attn_implementation_new": "flex",
+  "auto_map": {
+    "AutoConfig": "configuration_hymba.HymbaConfig",
+    "AutoModelForCausalLM": "modeling_hymba.HymbaForCausalLM"
+  },
+  "bos_token_id": 1,
+  "calc_logits_for_entire_prompt": false,
+  "conv_dim": {
+    "0": 3200,
+    "1": 3200,
+    "2": 3200,
+    "3": 3200,
+    "4": 3200,
+    "5": 3200,
+    "6": 3200,
+    "7": 3200,
+    "8": 3200,
+    "9": 3200,
+    "10": 3200,
+    "11": 3200,
+    "12": 3200,
+    "13": 3200,
+    "14": 3200,
+    "15": 3200,
+    "16": 3200,
+    "17": 3200,
+    "18": 3200,
+    "19": 3200,
+    "20": 3200,
+    "21": 3200,
+    "22": 3200,
+    "23": 3200,
+    "24": 3200,
+    "25": 3200,
+    "26": 3200,
+    "27": 3200,
+    "28": 3200,
+    "29": 3200,
+    "30": 3200,
+    "31": 3200
+  },
+  "eos_token_id": 2,
+  "global_attn_idx": [
+    0,
+    15,
+    31
+  ],
+  "hidden_act": "silu",
+  "hidden_size": 1600,
+  "initializer_range": 0.02,
+  "intermediate_size": 5504,
+  "kq_head_dim": -1,
+  "kq_norm": "none",
+  "kv_reuse_every_i_layer": -1,
+  "kv_reuse_group": [
+    [
+      1,
+      2
+    ],
+    [
+      3,
+      4
+    ],
+    [
+      5,
+      6
+    ],
+    [
+      7,
+      8
+    ],
+    [
+      9,
+      10
+    ],
+    [
+      11,
+      12
+    ],
+    [
+      13,
+      14
+    ],
+    [
+      16,
+      17,
+      18
+    ],
+    [
+      19,
+      20
+    ],
+    [
+      21,
+      22
+    ],
+    [
+      23,
+      24
+    ],
+    [
+      25,
+      26
+    ],
+    [
+      27,
+      28
+    ],
+    [
+      29,
+      30
+    ]
+  ],
+  "kv_weight_reuse": false,
+  "layer_type": [
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h",
+    "h"
+  ],
+  "mamba_conv_bias": true,
+  "mamba_d_conv": 4,
+  "mamba_d_state": 16,
+  "mamba_dt_rank": 100,
+  "mamba_expand": 2,
+  "mamba_inner_layernorms": true,
+  "mamba_proj_bias": false,
+  "max_position_embeddings": 8192,
+  "memory_tokens_interspersed_every": 0,
+  "mlp_hidden_act": "silu",
+  "model_type": "hymba",
+  "num_attention_heads": 25,
+  "num_experts": 1,
+  "num_experts_per_tok": 1,
+  "num_hidden_layers": 32,
+  "num_key_value_heads": 5,
+  "num_mamba": 1,
+  "num_memory_tokens": 128,
+  "orig_max_position_embeddings": 2048,
+  "output_router_logits": false,
+  "pad_token_id": 0,
+  "rms_norm_eps": 1e-06,
+  "rope": true,
+  "rope_theta": 10000.0,
+  "rope_type": "ntk",
+  "router_aux_loss_coef": 0.001,
+  "seq_length": 8192,
+  "sliding_window": 1024,
+  "tie_word_embeddings": true,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.44.0",
+  "use_cache": false,
+  "use_mamba_kernels": true,
+  "v_head_dim": 128,
+  "vocab_size": 32001
+}

configuration_hymba.py ADDED Viewed

	@@ -0,0 +1,116 @@

+import math
+from transformers.configuration_utils import PretrainedConfig
+class HymbaConfig(PretrainedConfig):
+    model_type = "hymba"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+            self,
+            vocab_size=65536,
+            tie_word_embeddings=False,
+            hidden_size=4096,
+            intermediate_size=14336,
+            num_hidden_layers=32,
+            num_attention_heads=32,
+            num_key_value_heads=8,
+            hidden_act="silu",
+            initializer_range=0.02,
+            rms_norm_eps=1e-6,
+            use_cache=True,
+            calc_logits_for_entire_prompt=False,
+            output_router_logits=False,
+            router_aux_loss_coef=0.001,
+            pad_token_id=0,
+            bos_token_id=1,
+            eos_token_id=2,
+            sliding_window=None,
+            max_position_embeddings=262144,
+            orig_max_position_embeddings=None,
+            attention_dropout=0.0,
+            num_experts_per_tok=2,
+            num_experts=16,
+            use_mamba_kernels=True,
+            mamba_d_state=16,
+            mamba_d_conv=4,
+            mamba_expand=2,
+            mamba_dt_rank="auto",
+            mamba_conv_bias=True,
+            mamba_proj_bias=False,
+            mamba_inner_layernorms=True,
+            kv_reuse_every_i_layer=-1,
+            kv_reuse_group=None,
+            kv_weight_reuse=False,
+            global_attn_idx=None,
+            num_mamba=1,
+            attn_implementation_new='sdpa',
+            rope_type=None,
+            **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.tie_word_embeddings = tie_word_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.sliding_window = sliding_window
+        self.max_position_embeddings = max_position_embeddings
+        self.orig_max_position_embeddings = orig_max_position_embeddings
+        self.attention_dropout = attention_dropout
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.use_cache = use_cache
+        self.calc_logits_for_entire_prompt = calc_logits_for_entire_prompt
+        self.output_router_logits = output_router_logits
+        self.router_aux_loss_coef = router_aux_loss_coef
+        self.num_experts_per_tok = num_experts_per_tok
+        self.num_experts = num_experts
+        self.use_mamba_kernels = use_mamba_kernels
+        self.mamba_d_state = mamba_d_state
+        self.mamba_d_conv = mamba_d_conv
+        self.mamba_expand = mamba_expand
+        self.mamba_dt_rank = math.ceil(self.hidden_size / 16) if mamba_dt_rank == "auto" else mamba_dt_rank
+        self.mamba_conv_bias = mamba_conv_bias
+        self.mamba_proj_bias = mamba_proj_bias
+        self.mamba_inner_layernorms = mamba_inner_layernorms
+        self.attn_hidden_size = kwargs.pop("attn_hidden_size", -1)
+        self.kq_head_dim = kwargs.pop("kq_head_dim", -1)
+        self.v_head_dim = kwargs.pop("v_head_dim", -1)
+        self.kq_norm = kwargs.pop("kq_norm", None)
+        self.rope = kwargs.pop("rope", False)
+        self.rope_theta = kwargs.pop("rope_theta", 10000.0)
+        self.num_memory_tokens = kwargs.pop("num_memory_tokens", 0)
+        self.memory_tokens_interspersed_every = kwargs.pop("memory_tokens_interspersed_every", 0)
+        self.kv_reuse_every_i_layer = kv_reuse_every_i_layer
+        self.kv_reuse_group = kv_reuse_group
+        self.kv_weight_reuse = kv_weight_reuse
+        self.global_attn_idx = global_attn_idx
+        self.num_mamba = num_mamba
+        self.attn_implementation_new = attn_implementation_new
+        self.rope_type = rope_type
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

generation_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 0,
+  "transformers_version": "4.44.0",
+  "use_cache": false
+}

images/macro_arch.png ADDED Viewed

images/module.png ADDED Viewed

images/performance1.png ADDED Viewed

images/performance2.png ADDED Viewed

modeling_hymba.py ADDED Viewed

The diff for this file is too large to render. See raw diff

setup.sh ADDED Viewed

	@@ -0,0 +1,44 @@

+#!/bin/bash
+# Prompt user to specify CUDA version
+read -p "Enter CUDA version (12.1 or 12.4): " cuda_version
+# Verify CUDA version input
+if [[ "$cuda_version" != "12.1" && "$cuda_version" != "12.4" ]]; then
+  echo "Invalid CUDA version specified. Please choose either 12.1 or 12.4."
+  exit 1
+fi
+# Install PyTorch with the specified CUDA version
+conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=$cuda_version -c pytorch -c nvidia
+# Install other packages
+pip install --upgrade transformers
+pip install tiktoken
+pip install sentencepiece
+pip install protobuf
+pip install ninja einops triton packaging
+# Clone and install Mamba
+git clone https://github.com/state-spaces/mamba.git
+cd mamba
+pip install -e .
+cd ..
+# Clone and install causal-conv1d with specified CUDA version
+git clone https://github.com/Dao-AILab/causal-conv1d.git
+cd causal-conv1d
+export CUDA_HOME=/usr/local/cuda-$cuda_version
+TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;8.9;9.0" python setup.py install
+cd ..
+# Clone and install attention-gym
+git clone https://github.com/pytorch-labs/attention-gym.git
+cd attention-gym
+pip install .
+cd ..
+# Install Flash Attention
+pip install flash_attn
+echo "Installation completed with CUDA $cuda_version."

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
+size 499723

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "add_prefix_space": true,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "32000": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "chat_template": "{{'<extra_id_0>System'}}{% for message in messages %}{% if message['role'] == 'system' %}{{'\n' + message['content'].strip()}}{% if tools or contexts %}{{'\n'}}{% endif %}{% endif %}{% endfor %}{% if tools %}{% for tool in tools %}{{ '\n<tool> ' + tool|tojson + ' </tool>' }}{% endfor %}{% endif %}{% if contexts %}{% if tools %}{{'\n'}}{% endif %}{% for context in contexts %}{{ '\n<context> ' + context.strip() + ' </context>' }}{% endfor %}{% endif %}{{'\n\n'}}{% for message in messages %}{% if message['role'] == 'user' %}{{ '<extra_id_1>User\n' + message['content'].strip() + '\n' }}{% elif message['role'] == 'assistant' %}{{ '<extra_id_1>Assistant\n' + message['content'].strip() + '\n' }}{% elif message['role'] == 'tool' %}{{ '<extra_id_1>Tool\n' + message['content'].strip() + '\n' }}{% endif %}{% endfor %}{%- if add_generation_prompt %}{{'<extra_id_1>Assistant\n'}}{%- endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "legacy": true,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "[PAD]",
+  "padding_side": "left",
+  "sp_model_kwargs": {},
+  "spaces_between_special_tokens": false,
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": "<unk>",
+  "use_default_system_prompt": false
+}