robgreenberg3 commited on
Commit
596c161
·
verified ·
1 Parent(s): 855e3b8

Update README.md (#3)

Browse files

- Update README.md (a187a0b6b753d930e535ba29690605f9035b9f2e)

Files changed (1) hide show
  1. README.md +161 -2
README.md CHANGED
@@ -16,7 +16,14 @@ license: llama3.1
16
  base_model: nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
17
  ---
18
 
19
- # Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic
 
 
 
 
 
 
 
20
 
21
  ## Model Overview
22
  - **Model Architecture:** Llama-3.1-Nemotron
@@ -53,7 +60,7 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
53
  from vllm import LLM, SamplingParams
54
  from transformers import AutoTokenizer
55
 
56
- model_id = "neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic"
57
  number_gpus = 2
58
 
59
  sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
@@ -77,6 +84,158 @@ print(generated_text)
77
 
78
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  ## Creation
81
 
82
  This model was created by applying [LLM-Compressor](https://github.com/vllm-project/llm-compressor), as presented in the code snipet below.
 
16
  base_model: nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
17
  ---
18
 
19
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
20
+ Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic
21
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
22
+ </h1>
23
+
24
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
25
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
26
+ </a>
27
 
28
  ## Model Overview
29
  - **Model Architecture:** Llama-3.1-Nemotron
 
60
  from vllm import LLM, SamplingParams
61
  from transformers import AutoTokenizer
62
 
63
+ model_id = "RedHatAI/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic"
64
  number_gpus = 2
65
 
66
  sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
 
84
 
85
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
86
 
87
+ <details>
88
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
89
+
90
+ ```bash
91
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
92
+ --ipc=host \
93
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
94
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
95
+ --name=vllm \
96
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
97
+ vllm serve \
98
+ --tensor-parallel-size 8 \
99
+ --max-model-len 32768 \
100
+ --enforce-eager --model RedHatAI/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic
101
+ ```
102
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
103
+ </details>
104
+
105
+ <details>
106
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
107
+
108
+ ```bash
109
+ # Download model from Red Hat Registry via docker
110
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
111
+ ilab model download --repository docker://registry.redhat.io/rhelai1/llama-3-1-nemotron-70b-instruct-hf-fp8-dynamic:1.5
112
+ ```
113
+
114
+ ```bash
115
+ # Serve model via ilab
116
+ ilab model serve --model-path ~/.cache/instructlab/models/llama-3-1-nemotron-70b-instruct-hf-fp8-dynamic
117
+
118
+ # Chat with model
119
+ ilab model chat --model ~/.cache/instructlab/models/llama-3-1-nemotron-70b-instruct-hf-fp8-dynamic
120
+ ```
121
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
122
+ </details>
123
+
124
+ <details>
125
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
126
+
127
+ ```python
128
+ # Setting up vllm server with ServingRuntime
129
+ # Save as: vllm-servingruntime.yaml
130
+ apiVersion: serving.kserve.io/v1alpha1
131
+ kind: ServingRuntime
132
+ metadata:
133
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
134
+ annotations:
135
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
136
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
137
+ labels:
138
+ opendatahub.io/dashboard: 'true'
139
+ spec:
140
+ annotations:
141
+ prometheus.io/port: '8080'
142
+ prometheus.io/path: '/metrics'
143
+ multiModel: false
144
+ supportedModelFormats:
145
+ - autoSelect: true
146
+ name: vLLM
147
+ containers:
148
+ - name: kserve-container
149
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
150
+ command:
151
+ - python
152
+ - -m
153
+ - vllm.entrypoints.openai.api_server
154
+ args:
155
+ - "--port=8080"
156
+ - "--model=/mnt/models"
157
+ - "--served-model-name={{.Name}}"
158
+ env:
159
+ - name: HF_HOME
160
+ value: /tmp/hf_home
161
+ ports:
162
+ - containerPort: 8080
163
+ protocol: TCP
164
+ ```
165
+
166
+ ```python
167
+ # Attach model to vllm server. This is an NVIDIA template
168
+ # Save as: inferenceservice.yaml
169
+ apiVersion: serving.kserve.io/v1beta1
170
+ kind: InferenceService
171
+ metadata:
172
+ annotations:
173
+ openshift.io/display-name: Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic # OPTIONAL CHANGE
174
+ serving.kserve.io/deploymentMode: RawDeployment
175
+ name: Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic # specify model name. This value will be used to invoke the model in the payload
176
+ labels:
177
+ opendatahub.io/dashboard: 'true'
178
+ spec:
179
+ predictor:
180
+ maxReplicas: 1
181
+ minReplicas: 1
182
+ model:
183
+ modelFormat:
184
+ name: vLLM
185
+ name: ''
186
+ resources:
187
+ limits:
188
+ cpu: '2' # this is model specific
189
+ memory: 8Gi # this is model specific
190
+ nvidia.com/gpu: '1' # this is accelerator specific
191
+ requests: # same comment for this block
192
+ cpu: '1'
193
+ memory: 4Gi
194
+ nvidia.com/gpu: '1'
195
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
196
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-nemotron-70b-instruct-hf-fp8-dynamic:1.5
197
+ tolerations:
198
+ - effect: NoSchedule
199
+ key: nvidia.com/gpu
200
+ operator: Exists
201
+ ```
202
+
203
+ ```bash
204
+ # make sure first to be in the project where you want to deploy the model
205
+ # oc project <project-name>
206
+ # apply both resources to run model
207
+ # Apply the ServingRuntime
208
+ oc apply -f vllm-servingruntime.yaml
209
+ # Apply the InferenceService
210
+ oc apply -f qwen-inferenceservice.yaml
211
+ ```
212
+
213
+ ```python
214
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
215
+ # - Run `oc get inferenceservice` to find your URL if unsure.
216
+ # Call the server using curl:
217
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
218
+ -H "Content-Type: application/json" \
219
+ -d '{
220
+ "model": "Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic",
221
+ "stream": true,
222
+ "stream_options": {
223
+ "include_usage": true
224
+ },
225
+ "max_tokens": 1,
226
+ "messages": [
227
+ {
228
+ "role": "user",
229
+ "content": "How can a bee fly when its wings are so small?"
230
+ }
231
+ ]
232
+ }'
233
+ ```
234
+
235
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
236
+ </details>
237
+
238
+
239
  ## Creation
240
 
241
  This model was created by applying [LLM-Compressor](https://github.com/vllm-project/llm-compressor), as presented in the code snipet below.