doc-update (#23)

Browse files

- doc: update doc for vLLM 256k support, align chinese doc with en doc. (96a5535fe4fd7f047e939dc803e1bb8e549daecd)
- add vllm lastest image doc. (f3ef13f76465e2643eae45139b15620e67f7aa66)

Co-authored-by: asher <asherszhang@users.noreply.huggingface.co>

Files changed (2) hide show

README.md +36 -9
README_CN.md +95 -225

README.md CHANGED Viewed

@@ -227,9 +227,7 @@ We provide a pre-built Docker image containing vLLM 0.8.5 with full support for
 - To get started:
 ```
-docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-vllm
-or
-docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
 ```
 - Download Model file:
@@ -247,8 +245,7 @@ docker run --rm  --ipc=host \
         --net=host \
         --gpus=all \
         -it \
-        -e VLLM_USE_V1=0 \
-        --entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
         -m vllm.entrypoints.openai.api_server \
         --host 0.0.0.0 \
         --tensor-parallel-size 4 \
@@ -265,8 +262,7 @@ docker run --rm  --ipc=host \
         --net=host \
         --gpus=all \
         -it \
-        -e VLLM_USE_V1=0 \
-        --entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
         -m vllm.entrypoints.openai.api_server \
         --host 0.0.0.0 \
         --tensor-parallel-size 4 \
@@ -281,6 +277,38 @@ Support for this model has been added via  this [PR 20114](https://github.com/vl
 You can build and run vLLM from source after merging this pull request into your local repository.
 #### Tool Calling with vLLM
@@ -331,7 +359,6 @@ docker run --gpus all \
     -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
 ```
 ## Contact Us
-If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email (hunyuan_opensource@tencent.com).

 - To get started:
 ```
+docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
 ```
 - Download Model file:
         --net=host \
         --gpus=all \
         -it \
+        --entrypoint python3 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1 \
         -m vllm.entrypoints.openai.api_server \
         --host 0.0.0.0 \
         --tensor-parallel-size 4 \
         --net=host \
         --gpus=all \
         -it \
+        --entrypoint python3 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1 \
         -m vllm.entrypoints.openai.api_server \
         --host 0.0.0.0 \
         --tensor-parallel-size 4 \
 You can build and run vLLM from source after merging this pull request into your local repository.
+### Model Context Length Support
+The Hunyuan A13B model supports a maximum context length of **256K tokens (262,144 token positions)**. However, due to GPU memory constraints on most hardware setups, the default configuration in `config.json` limits the context length to **32K tokens** to prevent out-of-memory (OOM) errors.
+#### Extending Context Length to 256K
+To enable full 256K context support, you can manually modify the `max_position_embeddings` field in the model's `config.json` file as follows:
+```json
+{
+  ...
+  "max_position_embeddings": 262144,
+  ...
+}
+```
+When serving the model using **vLLM**, you can also explicitly set the maximum model length by adding the following flag to your server launch command:
+```bash
+--max-model-len 262144
+```
+#### Recommended Configuration for 256K Context Length
+The following configuration is recommended for deploying the model with 256K context length support on systems equipped with **NVIDIA H20 GPUs (96GB VRAM)**:
+| Model DType    | KV-Cache Dtype | Number of Devices | Model Length |
+|----------------|----------------|--------------------|--------------|
+| `bfloat16`     | `bfloat16`     | 4                  | 262,144      |
+> ⚠️ **Note:** Using FP8 quantization for KV-cache may impact generation quality. The above settings are suggested configurations for stable 256K-length service deployment.
 #### Tool Calling with vLLM
     -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
 ```
 ## Contact Us
+If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email (hunyuan_opensource@tencent.com).

README_CN.md CHANGED Viewed

@@ -176,281 +176,151 @@ print(response)
 目前 TensorRT-LLM 的 fp8 和 int4 量化模型正在支持中，敬请期待。
-## 使用vLLM推理
-### Docker:
-为了简化部署过程，HunyuanLLM提供了预构建docker镜像：
-[hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm](https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags) 。您只需要下载模型文件并用下面代码启动docker即可开始推理模型。
-```shell
-# 拉取
-docker pull hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm
-# 起镜像
-docker run --name hunyuanLLM_infer -itd --privileged --user root  --net=host --ipc=host --gpus=8 hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm
-```
-注: Docker容器权限管理。以上代码采用特权模式（--privileged）启动Docker容器会赋予容器较高的权限，增加数据泄露和集群安全风险。建议在非必要情况下避免使用特权模式，以降低安全威胁。对于必须使用特权模式的场景，应进行严格的安全评估，并实施相应的安全监控、加固措施。
-### BF16部署
-BF16可以在2张显存超过80G的GPU卡上部署，如果长文推荐TP4。按如下步骤执行：
-运行命令前请先设置如下环境变量：
-```shell
-export MODEL_PATH=PATH_TO_MODEL
 ```
-#### Step1：执行推理
-#### 方式1：命令行推理
-下面我们展示一个代码片段，采用`vLLM`快速请求chat model：
-注: vLLM组件远程代码执行防护。下列代码中vLLM组件的trust-remote-code配置项若被启用，将允许加载并执行来自远程模型仓库的代码，这可能导致恶意代码的执行。除非业务需求明确要求，否则建议该配置项处于禁用状态，以降低潜在的安全威胁。
-```python
-import os
-from typing import List, Optional
-from vllm import LLM, SamplingParams
-from vllm.inputs import PromptType
-from transformers import AutoTokenizer
-model_path=os.environ.get('MODEL_PATH')
-tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
-llm = LLM(model=model_path,
-        tokenizer=model_path,
-        trust_remote_code=True,
-        dtype='bfloat16',
-        tensor_parallel_size=4,
-        gpu_memory_utilization=0.9)
-sampling_params = SamplingParams(
-    temperature=0.7, top_p=0.8, max_tokens=4096, top_k=20, repetition_penalty=1.05)
-messages = [
-    {
-        "role": "system",
-        "content": "You are a helpful assistant.",
-    },
-    {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
-]
-tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
-dummy_inputs: List[PromptType] = [{
-    "prompt_token_ids": batch
-} for batch in tokenized_chat.numpy().tolist()]
-outputs = llm.generate(dummy_inputs, sampling_params)
-# Print the outputs.
-for output in outputs:
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 ```
-#### 方式2：服务化推理
-下面我们展示使用`vLLM`服务化的方式部署模型并请求
-在主节点上运行：
-```shell
-export VLLM_HOST_IP=${LOCAL_IP}
-```
-接着我们启动服务，运行 :
-```shell
-cd inference
-sh run_server.sh
 ```
-运行`run_server.sh`成功后, 运行请求脚本：
-```shell
-sh openapi.sh
 ```
-注意修改`openapi.sh`中的`${LOCAL_IP}`和`${MODEL_PATH}`为服务对应值。
-### 量化模型部署��
-本部分介绍采用vLLM部署量化后模型的流程。
-镜像：部署镜像同BF16。
-#### Int8量化模型部署：
-部署Int8-weight-only版本HunYuan-A13B模型只需设置`run_server_int8.sh`中的环境变量：
-```SHELL
-export MODEL_PATH=PATH_TO_BF16_MODEL
-```
-接着我们启动Int8服务。运行：
-```shell
-sh run_server_int8.sh
-```
-运行`run_server_int8.sh`成功后, 运行请求脚本：
-```shell
-sh openapi.sh
-```
-#### Int4量化模型部署：
-部署Int4-weight-only版本HunYuan-A13B模型只需设置`run_server_int4.sh`中的环境变量，采用GPTQ方式：
-```SHELL
-export MODEL_PATH=PATH_TO_INT4_MODEL
 ```
-接着我们启动Int4服务。运行：
-```shell
-sh run_server_int4.sh
-```
-运行`run_server_int4.sh`成功后, 运行请求脚本：
-```shell
-sh openapi.sh
 ```
-#### FP8量化模型部署：
-部署W8A8C8版本HunYuan-A13B模型只需设置`run_server_int8.sh`中的环境变量：
-```shell
-export MODEL_PATH=PATH_TO_FP8_MODEL
-```
-接着我们启动FP8服务。运行：
-```shell
-sh run_server_fp8.sh
-```
-运行`run_server_fp8.sh`成功后, 运行请求脚本：
-```shell
-sh openapi.sh
-```
-### 性能评估：
-本部分介绍采用vLLM部署各个模型（原始模型和量化模型）的效率测试结果，包括不同Batchsize下的推理速度(tokens/s), 测试环境（腾讯云，H80（96G）GPU x 卡数）:
-测试命令：
-```python
-python3 benchmark_throughput.py --backend vllm \
-         --input-len 2048 \
-         --output-len 14336 \
-         --model $MODEL_PATH \
-         --tensor-parallel-size $TP \
-         --use-v2-block-manager \
-         --async-engine \
-         --trust-remote-code \
-         --num_prompts $BATCH_SIZE \
-         --max-num-seqs $BATCH_SIZE
-```
-| 推理框架 | 模型                          | 部署卡数   | input_length | batch=1             | batch=16              | batch=32       |
-|------|-----------------------------|-----------|-------------------------|---------------------|----------------------|----------------------|
-| vLLM | Hunyuan-A13B-Instruct                   |    8     | 2048                  |      190.84     |       1246.54      |       1981.99     |
-| vLLM | Hunyuan-A13B-Instruct                   |    4     | 2048                  |     158.90      |       779.10       |    1301.75        |
-| vLLM | Hunyuan-A13B-Instruct                   |    2     | 2048                  |    111.72       |      327.31        |    346.54         |
-| vLLM | Hunyuan-A13B-Instruct(int8 weight only) |    2      | 2048                  |   109.10       |      444.17        |     721.93        |
-| vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8)       |    2      | 2048                  |    91.83       |      372.01        |      617.70       |
-| vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8)       |    1      | 2048                  |     60.07      |         148.80     |      160.41       |
-## 使用sglang推理
-### BF16部署
-#### Step1：执行推理
-#### 方式1：命令行推理
-下面我们展示一个代码片段，采用`sglang`快速请求chat model：
-```python
-import sglang as sgl
-from transformers import AutoTokenizer
-model_path=os.environ.get('MODEL_PATH')
-tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
-messages = [
-    {
-        "role": "system",
-        "content": "You are a helpful assistant.",
-    },
-    {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
-]
-prompts = []
-prompts.append(tokenizer.apply_chat_template(
-    messages,
-    tokenize=False,
-    add_generation_prompt=True
-))
-print(prompts)
-llm = sgl.Engine(
-    model_path=model_path,
-    tp_size=4,
-    trust_remote_code=True,
-    mem_fraction_static=0.7,
-)
-sampling_params = {"temperature": 0.7, "top_p": 0.8, "top_k": 20, "max_new_tokens": 4096}
-outputs = llm.generate(prompts, sampling_params)
-for prompt, output in zip(prompts, outputs):
-    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
-```
-#### 方式2：服务化推理
-下面我们展示使用`sglang`服务化的方式部署模型和请求。
-```shell
-model_path="HunyuanLLM模型路径"
-python3 -u -m sglang.launch_server \
-    --model-path $model_path \
-    --tp 4 \
-    --trust-remote-code \
 ```
-服务启动成功后, 运行请求脚本：
-```python
-import openai
-client = openai.Client(
-    base_url="http://localhost:30000/v1", api_key="EMPTY")
-response = client.chat.completions.create(
-    model="default",
-    messages= [
-        {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
-    ],
-    temperature=0.7,
-    max_tokens=4096,
-    extra_body={"top_p": 0.8, "top_k": 20}
-)
-print(response)
 ```
-#### FP8/Int4量化模型部署：
-目前 sglang 的 fp8 和 int4 量化模型正在支持中，敬请期待。
 ## 交互式Demo Web
 hunyuan-A13B 现已开放网页demo。访问 https://hunyuan.tencent.com/?model=hunyuan-a13b 即可简单体验我们的模型。
-<br>
-## 引用
-如果你觉得我们的工作对你有帮助，欢迎引用我们的<a href="report/Hunyuan_A13B_Technical_Report.pdf">技术报告</a>！
-<br>
 ## 联系我们
-如果你想给我们的研发和产品团队留言，欢迎联系我们腾讯混元LLM团队。你可以通过邮件（hunyuan_opensource@tencent.com）联系我们。

 目前 TensorRT-LLM 的 fp8 和 int4 量化模型正在支持中，敬请期待。
+## vLLM 部署
+### Docker 镜像
+我们提供了一个基于官方 vLLM 0.8.5 版本的 Docker 镜像方便快速部署和测试。**注意：该镜像要求使用 CUDA 12.4 版本。**
+- 快速开始方式如下：
 ```
+docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
 ```
+- 下载模型文件：
+  - Huggingface：vLLM 会自动下载。
+  - ModelScope：`modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
+- 启动 API 服务（从 Huggingface 下载模型）：
+```bash
+docker run --rm  --ipc=host \
+        -v ~/.cache:/root/.cache/ \
+        --security-opt seccomp=unconfined \
+        --net=host \
+        --gpus=all \
+        -it \
+        --entrypoint python3 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1 \
+        -m vllm.entrypoints.openai.api_server \
+        --host 0.0.0.0 \
+        --tensor-parallel-size 4 \
+        --port 8000 \
+        --model tencent/Hunyuan-A13B-Instruct  \
+        --trust_remote_code
 ```
+- 启动 API 服务（从 ModelScope 下载模型）：
+```bash
+docker run --rm  --ipc=host \
+        -v ~/.cache/modelscope:/root/.cache/modelscope \
+        --security-opt seccomp=unconfined \
+        --net=host \
+        --gpus=all \
+        -it \
+        --entrypoint python3 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1 \
+        -m vllm.entrypoints.openai.api_server \
+        --host 0.0.0.0 \
+        --tensor-parallel-size 4 \
+        --port 8000 \
+        --model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/  \
+        --trust_remote_code
 ```
+### 源码部署
+对本模型的支持已通过 [PR 20114](https://github.com/vllm-project/vllm/pull/20114) 提交至 vLLM 项目。
+你可以在本地仓库中合并此 PR 后，从源码构建并运行 vLLM。
+### 模型上下文长度支持
+Hunyuan A13B 模型支持最大 **256K token（即 262,144 个位置）** 的上下文长度。但由于大多数 GPU 硬件配置的显存限制，默认 `config.json` 中将上下文长度限制为 **32K token**，以避免出现显存溢出（OOM）问题。
+#### 将上下文长度扩展至 256K
+如需启用完整的 256K 上下文支持，请手动修改模型 `config.json` 文件中的 `max_position_embeddings` 字段如下：
+```json
+{
+  ...
+  "max_position_embeddings": 262144,
+  ...
+}
 ```
+当使用 **vLLM** 进行服务部署时，也可以通过添加以下参数来明确设置最大模型长度：
+```bash
+--max-model-len 262144
 ```
+#### 推荐的 256K 上下文长度配置
+以下是在配备 **NVIDIA H20 显卡（96GB 显存）** 的系统上部署 256K 上下文长度服务的推荐配置：
+| 模型数据类型 | KV-Cache 数据类型 | 设备数量 | 模型长度 |
+|----------------|-------------------|------------|--------------|
+| `bfloat16`     | `bfloat16`        | 4          | 262,144      |
+> ⚠️ **注意：** 使用 FP8 对 KV-cache 进行量化可能会影响生成质量。上述配置是用于稳定部署 256K 长度服务的建议设置。
+### 使用 vLLM 调用工具
+为了支持基于 Agent 的工作流和函数调用能力，本模型包含专门的解析机制，用于处理工具调用及内部推理步骤。
+关于如何在 Agent 场景中实现和使用这些功能的完整示例，请参见我们的 GitHub 示例代码：
+🔗 [Hunyuan A13B Agent 示例](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/)
+在使用 **vLLM** 部署模型时，可以使用以下参数配置工具解析行为：
+| 参数名                  | 值                                                                 |
+|-------------------------|--------------------------------------------------------------------|
+| `--tool-parser-plugin`  | [本地 Hunyuan A13B 工具解析器文件](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/hunyuan_tool_parser.py) |
+| `--tool-call-parser`    | `hunyuan`                                                         |
+这些设置可使 vLLM 根据预期格式正确解析和路由模型生成的工具调用。
+### Reasoning Parser（推理解析器）
+目前，Hunyuan A13B 模型在 vLLM 中的推理解析器支持仍在开发中。
+## SGLang
+### Docker 镜像
+我们还提供基于 SGLang 最新版本构建的 Docker 镜像。
+快速开始方式如下：
+- 拉取 Docker 镜像：
+```
+docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang
+或
+docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang
 ```
+- 启动 API 服务：
+```bash
+docker run --gpus all \
+    --shm-size 32g \
+    -p 30000:30000 \
+    --ipc=host \
+    docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang \
+    -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
 ```
 ## 交互式Demo Web
 hunyuan-A13B 现已开放网页demo。访问 https://hunyuan.tencent.com/?model=hunyuan-a13b 即可简单体验我们的模型。
 ## 联系我们
+如果你想给我们的研发和产品团队留言，欢迎联系我们腾讯混元LLM团队。你可以通过邮件（hunyuan_opensource@tencent.com）联系我们。