doc-update (#23)
Browse files- doc: update doc for vLLM 256k support, align chinese doc with en doc. (96a5535fe4fd7f047e939dc803e1bb8e549daecd)
- add vllm lastest image doc. (f3ef13f76465e2643eae45139b15620e67f7aa66)
Co-authored-by: asher <asherszhang@users.noreply.huggingface.co>
- README.md +36 -9
- README_CN.md +95 -225
README.md
CHANGED
@@ -227,9 +227,7 @@ We provide a pre-built Docker image containing vLLM 0.8.5 with full support for
|
|
227 |
- To get started:
|
228 |
|
229 |
```
|
230 |
-
docker pull
|
231 |
-
or
|
232 |
-
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
|
233 |
```
|
234 |
|
235 |
- Download Model file:
|
@@ -247,8 +245,7 @@ docker run --rm --ipc=host \
|
|
247 |
--net=host \
|
248 |
--gpus=all \
|
249 |
-it \
|
250 |
-
-
|
251 |
-
--entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
|
252 |
-m vllm.entrypoints.openai.api_server \
|
253 |
--host 0.0.0.0 \
|
254 |
--tensor-parallel-size 4 \
|
@@ -265,8 +262,7 @@ docker run --rm --ipc=host \
|
|
265 |
--net=host \
|
266 |
--gpus=all \
|
267 |
-it \
|
268 |
-
-
|
269 |
-
--entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
|
270 |
-m vllm.entrypoints.openai.api_server \
|
271 |
--host 0.0.0.0 \
|
272 |
--tensor-parallel-size 4 \
|
@@ -281,6 +277,38 @@ Support for this model has been added via this [PR 20114](https://github.com/vl
|
|
281 |
You can build and run vLLM from source after merging this pull request into your local repository.
|
282 |
|
283 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
284 |
|
285 |
#### Tool Calling with vLLM
|
286 |
|
@@ -331,7 +359,6 @@ docker run --gpus all \
|
|
331 |
-m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
|
332 |
```
|
333 |
|
334 |
-
|
335 |
## Contact Us
|
336 |
|
337 |
-
If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email (hunyuan_opensource@tencent.com).
|
|
|
227 |
- To get started:
|
228 |
|
229 |
```
|
230 |
+
docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
|
|
|
|
231 |
```
|
232 |
|
233 |
- Download Model file:
|
|
|
245 |
--net=host \
|
246 |
--gpus=all \
|
247 |
-it \
|
248 |
+
--entrypoint python3 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1 \
|
|
|
249 |
-m vllm.entrypoints.openai.api_server \
|
250 |
--host 0.0.0.0 \
|
251 |
--tensor-parallel-size 4 \
|
|
|
262 |
--net=host \
|
263 |
--gpus=all \
|
264 |
-it \
|
265 |
+
--entrypoint python3 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1 \
|
|
|
266 |
-m vllm.entrypoints.openai.api_server \
|
267 |
--host 0.0.0.0 \
|
268 |
--tensor-parallel-size 4 \
|
|
|
277 |
You can build and run vLLM from source after merging this pull request into your local repository.
|
278 |
|
279 |
|
280 |
+
### Model Context Length Support
|
281 |
+
|
282 |
+
The Hunyuan A13B model supports a maximum context length of **256K tokens (262,144 token positions)**. However, due to GPU memory constraints on most hardware setups, the default configuration in `config.json` limits the context length to **32K tokens** to prevent out-of-memory (OOM) errors.
|
283 |
+
|
284 |
+
#### Extending Context Length to 256K
|
285 |
+
|
286 |
+
To enable full 256K context support, you can manually modify the `max_position_embeddings` field in the model's `config.json` file as follows:
|
287 |
+
|
288 |
+
```json
|
289 |
+
{
|
290 |
+
...
|
291 |
+
"max_position_embeddings": 262144,
|
292 |
+
...
|
293 |
+
}
|
294 |
+
```
|
295 |
+
|
296 |
+
When serving the model using **vLLM**, you can also explicitly set the maximum model length by adding the following flag to your server launch command:
|
297 |
+
|
298 |
+
```bash
|
299 |
+
--max-model-len 262144
|
300 |
+
```
|
301 |
+
|
302 |
+
#### Recommended Configuration for 256K Context Length
|
303 |
+
|
304 |
+
The following configuration is recommended for deploying the model with 256K context length support on systems equipped with **NVIDIA H20 GPUs (96GB VRAM)**:
|
305 |
+
|
306 |
+
| Model DType | KV-Cache Dtype | Number of Devices | Model Length |
|
307 |
+
|----------------|----------------|--------------------|--------------|
|
308 |
+
| `bfloat16` | `bfloat16` | 4 | 262,144 |
|
309 |
+
|
310 |
+
> ⚠️ **Note:** Using FP8 quantization for KV-cache may impact generation quality. The above settings are suggested configurations for stable 256K-length service deployment.
|
311 |
+
|
312 |
|
313 |
#### Tool Calling with vLLM
|
314 |
|
|
|
359 |
-m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
|
360 |
```
|
361 |
|
|
|
362 |
## Contact Us
|
363 |
|
364 |
+
If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email (hunyuan_opensource@tencent.com).
|
README_CN.md
CHANGED
@@ -176,281 +176,151 @@ print(response)
|
|
176 |
目前 TensorRT-LLM 的 fp8 和 int4 量化模型正在支持中,敬请期待。
|
177 |
|
178 |
|
179 |
-
##
|
180 |
-
### Docker:
|
181 |
|
182 |
-
|
183 |
|
184 |
-
|
185 |
-
```shell
|
186 |
-
# 拉取
|
187 |
-
docker pull hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm
|
188 |
-
# 起镜像
|
189 |
-
docker run --name hunyuanLLM_infer -itd --privileged --user root --net=host --ipc=host --gpus=8 hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm
|
190 |
-
```
|
191 |
-
|
192 |
-
注: Docker容器权限管理。以上代码采用特权模式(--privileged)启动Docker容器会赋予容器较高的权限,增加数据泄露和集群安全风险。建议在非必要情况下避免使用特权模式,以降低安全威胁。对于必须使用特权模式的场景,应进行严格的安全评估,并实施相应的安全监控、加固措施。
|
193 |
|
|
|
194 |
|
195 |
-
### BF16部署
|
196 |
-
|
197 |
-
BF16可以在2张显存超过80G的GPU卡上部署,如果长文推荐TP4。按如下步骤执行:
|
198 |
-
|
199 |
-
运行命令前请先设置如下环境变量:
|
200 |
-
|
201 |
-
```shell
|
202 |
-
export MODEL_PATH=PATH_TO_MODEL
|
203 |
```
|
204 |
-
|
205 |
-
#### Step1:执行推理
|
206 |
-
|
207 |
-
#### 方式1:命令行推理
|
208 |
-
|
209 |
-
下面我们展示一个代码片段,采用`vLLM`快速请求chat model:
|
210 |
-
|
211 |
-
注: vLLM组件远程代码执行防护。下列代码中vLLM组件的trust-remote-code配置项若被启用,将允许加载并执行来自远程模型仓库的代码,这可能导致恶意代码的执行。除非业务需求明确要求,否则建议该配置项处于禁用状态,以降低潜在的安全威胁。
|
212 |
-
|
213 |
-
|
214 |
-
```python
|
215 |
-
import os
|
216 |
-
from typing import List, Optional
|
217 |
-
from vllm import LLM, SamplingParams
|
218 |
-
from vllm.inputs import PromptType
|
219 |
-
from transformers import AutoTokenizer
|
220 |
-
|
221 |
-
model_path=os.environ.get('MODEL_PATH')
|
222 |
-
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
223 |
-
|
224 |
-
llm = LLM(model=model_path,
|
225 |
-
tokenizer=model_path,
|
226 |
-
trust_remote_code=True,
|
227 |
-
dtype='bfloat16',
|
228 |
-
tensor_parallel_size=4,
|
229 |
-
gpu_memory_utilization=0.9)
|
230 |
-
|
231 |
-
sampling_params = SamplingParams(
|
232 |
-
temperature=0.7, top_p=0.8, max_tokens=4096, top_k=20, repetition_penalty=1.05)
|
233 |
-
|
234 |
-
messages = [
|
235 |
-
{
|
236 |
-
"role": "system",
|
237 |
-
"content": "You are a helpful assistant.",
|
238 |
-
},
|
239 |
-
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
|
240 |
-
]
|
241 |
-
|
242 |
-
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
|
243 |
-
|
244 |
-
dummy_inputs: List[PromptType] = [{
|
245 |
-
"prompt_token_ids": batch
|
246 |
-
} for batch in tokenized_chat.numpy().tolist()]
|
247 |
-
|
248 |
-
outputs = llm.generate(dummy_inputs, sampling_params)
|
249 |
-
|
250 |
-
# Print the outputs.
|
251 |
-
for output in outputs:
|
252 |
-
prompt = output.prompt
|
253 |
-
generated_text = output.outputs[0].text
|
254 |
-
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
255 |
```
|
256 |
|
257 |
-
|
258 |
-
|
259 |
-
|
260 |
-
|
261 |
-
|
262 |
-
|
263 |
-
```
|
264 |
-
|
265 |
-
|
266 |
-
|
267 |
-
|
268 |
-
|
269 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
270 |
```
|
271 |
|
272 |
-
|
273 |
-
|
274 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
275 |
```
|
276 |
|
277 |
-
注意修改`openapi.sh`中的`${LOCAL_IP}`和`${MODEL_PATH}`为服务对应值。
|
278 |
|
|
|
279 |
|
280 |
-
|
281 |
|
282 |
-
|
283 |
|
284 |
-
镜像:部署镜像同BF16。
|
285 |
|
|
|
286 |
|
287 |
-
|
288 |
-
部署Int8-weight-only版本HunYuan-A13B模型只需设置`run_server_int8.sh`中的环境变量:
|
289 |
-
```SHELL
|
290 |
-
export MODEL_PATH=PATH_TO_BF16_MODEL
|
291 |
-
```
|
292 |
|
293 |
-
|
294 |
-
```shell
|
295 |
-
sh run_server_int8.sh
|
296 |
-
```
|
297 |
|
298 |
-
|
299 |
-
```shell
|
300 |
-
sh openapi.sh
|
301 |
-
```
|
302 |
|
303 |
-
|
304 |
-
|
305 |
-
|
306 |
-
|
|
|
|
|
307 |
```
|
308 |
|
309 |
-
|
310 |
-
```shell
|
311 |
-
sh run_server_int4.sh
|
312 |
-
```
|
313 |
|
314 |
-
|
315 |
-
|
316 |
-
sh openapi.sh
|
317 |
```
|
318 |
|
319 |
-
####
|
320 |
-
部署W8A8C8版本HunYuan-A13B模型只需设置`run_server_int8.sh`中的环境变量:
|
321 |
-
```shell
|
322 |
-
export MODEL_PATH=PATH_TO_FP8_MODEL
|
323 |
-
```
|
324 |
|
325 |
-
|
326 |
-
```shell
|
327 |
-
sh run_server_fp8.sh
|
328 |
-
```
|
329 |
|
330 |
-
|
331 |
-
|
332 |
-
|
333 |
-
```
|
334 |
|
335 |
-
|
336 |
|
337 |
-
本部分介绍采用vLLM部署各个模型(原始模型和量化模型)的效率测试结果,包括不同Batchsize下的推理速度(tokens/s), 测试环境(腾讯云,H80(96G)GPU x 卡数):
|
338 |
|
339 |
-
|
340 |
-
```python
|
341 |
-
python3 benchmark_throughput.py --backend vllm \
|
342 |
-
--input-len 2048 \
|
343 |
-
--output-len 14336 \
|
344 |
-
--model $MODEL_PATH \
|
345 |
-
--tensor-parallel-size $TP \
|
346 |
-
--use-v2-block-manager \
|
347 |
-
--async-engine \
|
348 |
-
--trust-remote-code \
|
349 |
-
--num_prompts $BATCH_SIZE \
|
350 |
-
--max-num-seqs $BATCH_SIZE
|
351 |
-
```
|
352 |
|
353 |
-
|
354 |
-
|------|-----------------------------|-----------|-------------------------|---------------------|----------------------|----------------------|
|
355 |
-
| vLLM | Hunyuan-A13B-Instruct | 8 | 2048 | 190.84 | 1246.54 | 1981.99 |
|
356 |
-
| vLLM | Hunyuan-A13B-Instruct | 4 | 2048 | 158.90 | 779.10 | 1301.75 |
|
357 |
-
| vLLM | Hunyuan-A13B-Instruct | 2 | 2048 | 111.72 | 327.31 | 346.54 |
|
358 |
-
| vLLM | Hunyuan-A13B-Instruct(int8 weight only) | 2 | 2048 | 109.10 | 444.17 | 721.93 |
|
359 |
-
| vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8) | 2 | 2048 | 91.83 | 372.01 | 617.70 |
|
360 |
-
| vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8) | 1 | 2048 | 60.07 | 148.80 | 160.41 |
|
361 |
|
|
|
|
|
362 |
|
363 |
-
|
364 |
|
365 |
-
|
|
|
|
|
|
|
366 |
|
367 |
-
|
368 |
|
369 |
-
#### 方式1:命令行推理
|
370 |
|
371 |
-
|
372 |
|
|
|
373 |
|
374 |
-
```python
|
375 |
-
import sglang as sgl
|
376 |
-
from transformers import AutoTokenizer
|
377 |
-
|
378 |
-
model_path=os.environ.get('MODEL_PATH')
|
379 |
-
|
380 |
-
|
381 |
-
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
382 |
-
|
383 |
-
messages = [
|
384 |
-
{
|
385 |
-
"role": "system",
|
386 |
-
"content": "You are a helpful assistant.",
|
387 |
-
},
|
388 |
-
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
|
389 |
-
]
|
390 |
-
prompts = []
|
391 |
-
prompts.append(tokenizer.apply_chat_template(
|
392 |
-
messages,
|
393 |
-
tokenize=False,
|
394 |
-
add_generation_prompt=True
|
395 |
-
))
|
396 |
-
print(prompts)
|
397 |
-
|
398 |
-
llm = sgl.Engine(
|
399 |
-
model_path=model_path,
|
400 |
-
tp_size=4,
|
401 |
-
trust_remote_code=True,
|
402 |
-
mem_fraction_static=0.7,
|
403 |
-
)
|
404 |
|
405 |
-
|
406 |
-
outputs = llm.generate(prompts, sampling_params)
|
407 |
-
for prompt, output in zip(prompts, outputs):
|
408 |
-
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
|
409 |
-
```
|
410 |
|
411 |
-
|
412 |
|
413 |
-
|
414 |
|
415 |
-
|
416 |
-
|
417 |
-
|
418 |
-
|
419 |
-
|
420 |
-
|
|
|
|
|
421 |
```
|
422 |
|
423 |
-
|
424 |
-
```python
|
425 |
-
import openai
|
426 |
-
client = openai.Client(
|
427 |
-
base_url="http://localhost:30000/v1", api_key="EMPTY")
|
428 |
|
429 |
-
|
430 |
-
|
431 |
-
|
432 |
-
|
433 |
-
|
434 |
-
|
435 |
-
|
436 |
-
extra_body={"top_p": 0.8, "top_k": 20}
|
437 |
-
)
|
438 |
-
print(response)
|
439 |
```
|
440 |
|
441 |
-
#### FP8/Int4量化模型部署:
|
442 |
-
目前 sglang 的 fp8 和 int4 量化模型正在支持中,敬请期待。
|
443 |
|
444 |
## 交互式Demo Web
|
445 |
hunyuan-A13B 现已开放网页demo。访问 https://hunyuan.tencent.com/?model=hunyuan-a13b 即可简单体验我们的模型。
|
446 |
|
447 |
-
<br>
|
448 |
-
|
449 |
-
## 引用
|
450 |
-
如果你觉得我们的工作对你有帮助,欢迎引用我们的<a href="report/Hunyuan_A13B_Technical_Report.pdf">技术报告</a>!
|
451 |
-
|
452 |
-
<br>
|
453 |
-
|
454 |
-
|
455 |
## 联系我们
|
456 |
-
如果你想给我们的研发和产品团队留言,欢迎联系我们腾讯混元LLM团队。你可以通过邮件(hunyuan_opensource@tencent.com)联系我们。
|
|
|
176 |
目前 TensorRT-LLM 的 fp8 和 int4 量化模型正在支持中,敬请期待。
|
177 |
|
178 |
|
179 |
+
## vLLM 部署
|
|
|
180 |
|
181 |
+
### Docker 镜像
|
182 |
|
183 |
+
我们提供了一个基于官方 vLLM 0.8.5 版本的 Docker 镜像方便快速部署和测试。**注意:该镜像要求使用 CUDA 12.4 版本。**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
184 |
|
185 |
+
- 快速开始方式如下:
|
186 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
187 |
```
|
188 |
+
docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
189 |
```
|
190 |
|
191 |
+
- 下载模型文件:
|
192 |
+
- Huggingface:vLLM 会自动下载。
|
193 |
+
- ModelScope:`modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
|
194 |
+
|
195 |
+
- 启动 API 服务(从 Huggingface 下载模型):
|
196 |
+
|
197 |
+
```bash
|
198 |
+
docker run --rm --ipc=host \
|
199 |
+
-v ~/.cache:/root/.cache/ \
|
200 |
+
--security-opt seccomp=unconfined \
|
201 |
+
--net=host \
|
202 |
+
--gpus=all \
|
203 |
+
-it \
|
204 |
+
--entrypoint python3 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1 \
|
205 |
+
-m vllm.entrypoints.openai.api_server \
|
206 |
+
--host 0.0.0.0 \
|
207 |
+
--tensor-parallel-size 4 \
|
208 |
+
--port 8000 \
|
209 |
+
--model tencent/Hunyuan-A13B-Instruct \
|
210 |
+
--trust_remote_code
|
211 |
```
|
212 |
|
213 |
+
- 启动 API 服务(从 ModelScope 下载模型):
|
214 |
+
|
215 |
+
```bash
|
216 |
+
docker run --rm --ipc=host \
|
217 |
+
-v ~/.cache/modelscope:/root/.cache/modelscope \
|
218 |
+
--security-opt seccomp=unconfined \
|
219 |
+
--net=host \
|
220 |
+
--gpus=all \
|
221 |
+
-it \
|
222 |
+
--entrypoint python3 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1 \
|
223 |
+
-m vllm.entrypoints.openai.api_server \
|
224 |
+
--host 0.0.0.0 \
|
225 |
+
--tensor-parallel-size 4 \
|
226 |
+
--port 8000 \
|
227 |
+
--model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/ \
|
228 |
+
--trust_remote_code
|
229 |
```
|
230 |
|
|
|
231 |
|
232 |
+
### 源码部署
|
233 |
|
234 |
+
对本模型的支持已通过 [PR 20114](https://github.com/vllm-project/vllm/pull/20114) 提交至 vLLM 项目。
|
235 |
|
236 |
+
你可以在本地仓库中合并此 PR 后,从源码构建并运行 vLLM。
|
237 |
|
|
|
238 |
|
239 |
+
### 模型上下文长度支持
|
240 |
|
241 |
+
Hunyuan A13B 模型支持最大 **256K token(即 262,144 个位置)** 的上下文长度。但由于大多数 GPU 硬件配置的显存限制,默认 `config.json` 中将上下文长度限制为 **32K token**,以避免出现显存溢出(OOM)问题。
|
|
|
|
|
|
|
|
|
242 |
|
243 |
+
#### 将上下文长度扩展至 256K
|
|
|
|
|
|
|
244 |
|
245 |
+
如需启用完整的 256K 上下文支持,请手动修改模型 `config.json` 文件中的 `max_position_embeddings` 字段如下:
|
|
|
|
|
|
|
246 |
|
247 |
+
```json
|
248 |
+
{
|
249 |
+
...
|
250 |
+
"max_position_embeddings": 262144,
|
251 |
+
...
|
252 |
+
}
|
253 |
```
|
254 |
|
255 |
+
当使用 **vLLM** 进行服务部署时,也可以通过添加以下参数来明确设置最大模型长度:
|
|
|
|
|
|
|
256 |
|
257 |
+
```bash
|
258 |
+
--max-model-len 262144
|
|
|
259 |
```
|
260 |
|
261 |
+
#### 推荐的 256K 上下文长度配置
|
|
|
|
|
|
|
|
|
262 |
|
263 |
+
以下是在配备 **NVIDIA H20 显卡(96GB 显存)** 的系统上部署 256K 上下文长度服务的推荐配置:
|
|
|
|
|
|
|
264 |
|
265 |
+
| 模型数据类型 | KV-Cache 数据类型 | 设备数量 | 模型长度 |
|
266 |
+
|----------------|-------------------|------------|--------------|
|
267 |
+
| `bfloat16` | `bfloat16` | 4 | 262,144 |
|
|
|
268 |
|
269 |
+
> ⚠️ **注意:** 使用 FP8 对 KV-cache 进行量化可能会影响生成质量。上述配置是用于稳定部署 256K 长度服务的建议设置。
|
270 |
|
|
|
271 |
|
272 |
+
### 使用 vLLM 调用工具
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
273 |
|
274 |
+
为了支持基于 Agent 的工作流和函数调用能力,本模型包含专门的解析机制,用于处理工具调用及内部推理步骤。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
275 |
|
276 |
+
关于如何在 Agent 场景中实现和使用这些功能的完整示例,请参见我们的 GitHub 示例代码:
|
277 |
+
🔗 [Hunyuan A13B Agent 示例](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/)
|
278 |
|
279 |
+
在使用 **vLLM** 部署模型时,可以使用以下参数配置工具解析行为:
|
280 |
|
281 |
+
| 参数名 | 值 |
|
282 |
+
|-------------------------|--------------------------------------------------------------------|
|
283 |
+
| `--tool-parser-plugin` | [本地 Hunyuan A13B 工具解析器文件](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/hunyuan_tool_parser.py) |
|
284 |
+
| `--tool-call-parser` | `hunyuan` |
|
285 |
|
286 |
+
这些设置可使 vLLM 根据预期格式正确解析和路由模型生成的工具调用。
|
287 |
|
|
|
288 |
|
289 |
+
### Reasoning Parser(推理解析器)
|
290 |
|
291 |
+
目前,Hunyuan A13B 模型在 vLLM 中的推理解析器支持仍在开发中。
|
292 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
293 |
|
294 |
+
## SGLang
|
|
|
|
|
|
|
|
|
295 |
|
296 |
+
### Docker 镜像
|
297 |
|
298 |
+
我们还提供基于 SGLang 最新版本构建的 Docker 镜像。
|
299 |
|
300 |
+
快速开始方式如下:
|
301 |
+
|
302 |
+
- 拉取 Docker 镜像:
|
303 |
+
|
304 |
+
```
|
305 |
+
docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang
|
306 |
+
或
|
307 |
+
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang
|
308 |
```
|
309 |
|
310 |
+
- 启动 API 服务:
|
|
|
|
|
|
|
|
|
311 |
|
312 |
+
```bash
|
313 |
+
docker run --gpus all \
|
314 |
+
--shm-size 32g \
|
315 |
+
-p 30000:30000 \
|
316 |
+
--ipc=host \
|
317 |
+
docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang \
|
318 |
+
-m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
|
|
|
|
|
|
|
319 |
```
|
320 |
|
|
|
|
|
321 |
|
322 |
## 交互式Demo Web
|
323 |
hunyuan-A13B 现已开放网页demo。访问 https://hunyuan.tencent.com/?model=hunyuan-a13b 即可简单体验我们的模型。
|
324 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
325 |
## 联系我们
|
326 |
+
如果你想给我们的研发和产品团队留言,欢迎联系我们腾讯混元LLM团队。你可以通过邮件(hunyuan_opensource@tencent.com)联系我们。
|