bench

Files changed (2) hide show

README.md +289 -21
configuration.json +0 -1

README.md CHANGED Viewed

@@ -7,39 +7,307 @@ pipeline_tag: text-generation
 library_name: transformers
 ---
-# GLM-4-9B-Chat-0414
 ## Introduction
-Based on our latest technological advancements, we have trained a `GLM-4-0414` series model. During pretraining, we incorporated more code-related and reasoning-related data. In the alignment phase, we optimized the model specifically for agent capabilities. As a result, the model's performance in agent tasks such as tool use, web search, and coding has been significantly improved.
-## Inference Code
-Make Sure Using `transforemrs>=4.51.3`.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
-MODEL_PATH = "THUDM/GLM-4-9B-Chat-0414"
 tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
 model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto")
-message = [{"role": "user", "content": "hello!"}]
-inputs = tokenizer.apply_chat_template(
-    message,
-    return_tensors="pt",
-    add_generation_prompt=True,
-    return_dict=True,
-).to(model.device)
-generate_kwargs = {
-    "input_ids": inputs["input_ids"],
-    "attention_mask": inputs["attention_mask"],
-    "max_new_tokens": 128,
-    "do_sample": False,
-}
-out = model.generate(**generate_kwargs)
-print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
 ```

 library_name: transformers
 ---
 ## Introduction
+The GLM family welcomes new members, the **GLM-4-32B-0414** series models, featuring 32 billion parameters. Its performance is comparable to OpenAI’s GPT series and DeepSeek’s V3/R1 series. It also supports very user-friendly local deployment features. GLM-4-32B-Base-0414 was pre-trained on 15T of high-quality data, including substantial reasoning-type synthetic data. This lays the foundation for subsequent reinforcement learning extensions. In the post-training stage, we employed human preference alignment for dialogue scenarios. Additionally, using techniques like rejection sampling and reinforcement learning, we enhanced the model’s performance in instruction following, engineering code, and function calling, thus strengthening the atomic capabilities required for agent tasks. GLM-4-32B-0414 achieves good results in engineering code, Artifact generation, function calling, search-based Q&A, and report generation. In particular, on several benchmarks, such as code generation or specific Q&A tasks, GLM-4-32B-Base-0414 achieves comparable performance with those larger models like GPT-4o and DeepSeek-V3-0324 (671B).
+**GLM-Z1-32B-0414** is a reasoning model with deep thinking capabilities. This was developed based on GLM-4-32B-0414 through cold start, extended reinforcement learning, and further training on tasks including mathematics, code, and logic. Compared to the base model, GLM-Z1-32B-0414 significantly improves mathematical abilities and the capability to solve complex tasks. During training, we also introduced general reinforcement learning based on pairwise ranking feedback, which enhances the model's general capabilities.
+**GLM-Z1-Rumination-32B-0414** is a deep reasoning model with rumination capabilities (against OpenAI's Deep Research). Unlike typical deep thinking models, the rumination model is capable of deeper and longer thinking to solve more open-ended and complex problems (e.g., writing a comparative analysis of AI development in two cities and their future development plans). Z1-Rumination is trained through scaling end-to-end reinforcement learning with responses graded by the ground truth answers or rubrics and can make use of search tools during its deep thinking process to handle complex tasks. The model shows significant improvements in research-style writing and complex  tasks.
+Finally, **GLM-Z1-9B-0414** is a surprise. We employed all the aforementioned techniques to train a small model (9B). GLM-Z1-9B-0414  exhibits excellent capabilities in mathematical reasoning and general tasks. Its overall performance is top-ranked among all open-source models of the same size. Especially in resource-constrained scenarios, this model achieves an excellent balance between efficiency and effectiveness, providing a powerful option for users seeking lightweight deployment.
+## Showcase
+### Animation Generation
+<table>
+  <tr>
+    <td style="text-align: center; font-size: 16px; font-weight: bold; padding: 10px; width: 420px;">
+      GLM-Z1-32B-0414
+    </td>
+    <td style="text-align: center; font-size: 16px; font-weight: bold; padding: 10px; width: 420px;">
+      GLM-4-32B-0414
+    </td>
+  </tr>
+  <tr>
+    <td style="vertical-align: top; padding: 10px; width: 420px;">
+      <video src="https://github.com/user-attachments/assets/849ff9fd-b54d-4c74-9ee5-3412e1a09e32"
+             style="width: 400px; height: 300px; object-fit: contain;" autoplay loop muted playsinline></video>
+      <div style="margin-top: 10px; font-size: 14px; color: #333; width: 400px;">
+        write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically
+      </div>
+    </td>
+    <td style="vertical-align: top; padding: 10px; width: 420px;">
+      <video src="https://github.com/user-attachments/assets/8dccdb9d-cc44-4732-b438-74a4e3cb9dfb"
+             style="width: 400px; height: 300px; object-fit: contain;" autoplay loop muted playsinline></video>
+      <div style="margin-top: 10px; font-size: 14px; color: #333; width: 400px;">
+         Use HTML to simulate the scenario of a small ball released from the center of a rotating hexagon. Consider the collision between the ball and the hexagon's edges, the gravity acting on the ball, and assume all collisions are perfectly elastic. (Prompt translated from Chinese)
+      </div>
+    </td>
+  </tr>
+</table>
+### Web Design
+<table>
+  <tr>
+    <td style="text-align: center; font-size: 16px; font-weight: bold; padding: 10px; width: 420px;">
+      GLM-4-32B-0414
+    </td>
+    <td style="text-align: center; font-size: 16px; font-weight: bold; padding: 10px; width: 420px;">
+      GLM-4-32B-0414
+    </td>
+  </tr>
+  <tr>
+    <td style="vertical-align: top; padding: 10px; width: 420px;">
+      <img src="https://github.com/user-attachments/assets/bd9c1fc1-c784-4e8f-9c76-5f7389a715f1"/>
+      <div style="margin-top: 10px; font-size: 14px; color: #333; width: 400px;">
+          Design a drawing board that supports custom function plotting, allowing adding and deleting custom functions, and assigning colors to functions. (Prompt translated from Chinese)
+      </div>
+    </td>
+    <td style="vertical-align: top; padding: 10px; width: 420px;">
+      <img src="https://github.com/user-attachments/assets/7ad12d52-9229-4278-8d1b-ffbf43e99070"/>
+      <div style="margin-top: 10px; font-size: 14px; color: #333; width: 400px;"> Design a UI for a mobile machine learning platform, which should include interfaces for training tasks, storage management, and personal statistics. The personal statistics interface should use charts to display the user's resource usage over a period. Use Tailwind CSS to style the page, and display these 3 mobile interfaces tiled on a single HTML page. (Prompt translated from Chinese) </div>
+    </td>
+  </tr>
+</table>
+### SVG Generation
+<table>
+  <tr>
+    <td style="text-align: center; font-size: 16px; font-weight: bold; padding: 10px; width: 420px;">
+      GLM-4-32B-0414
+    </td>
+    <td style="text-align: center; font-size: 16px; font-weight: bold; padding: 10px; width: 420px;">
+      GLM-4-32B-0414
+    </td>
+  </tr>
+  <tr>
+    <td style="vertical-align: top; padding: 10px; width: 420px;">
+      <img src="https://github.com/user-attachments/assets/9407e4c1-1876-4ab5-838c-839836fb418a"/>
+      <div style="margin-top: 10px; font-size: 14px; color: #333; width: 400px;">
+          Create a misty Jiangnan scene using SVG. (Prompt translated from Chinese)
+      </div>
+    </td>
+    <td style="vertical-align: top; padding: 10px; width: 420px;">
+      <img src="https://github.com/user-attachments/assets/bcce8c5a-cedf-45c8-b666-ddb023d5b49c"/>
+      <div style="margin-top: 10px; font-size: 14px; color: #333; width: 400px;"> Use SVG to illustrate the training process of an LLM. (Prompt translated from Chinese) </div>
+    </td>
+  </tr>
+</table>
+### Search-Based Writing
+For search-based writing tasks, we use the following system prompt to have the model respond based on search results:
+```
+请根据所给搜索返回结果对用户问题进行作答。
+## 注意
+1. 充分利用和整理收集到的信息，而不是简单的复制粘贴，生成符合用户要求且有深度的专业答案。
+2. 所提供信息充分的情况下，你的回答需尽可能延长，从用户意图角度出发，提供具有足够信息量和多角度的回复。
+3. 另外，并非所有的搜索结果都与用户问题密切相关，请仔细的甄别、筛选和利用。
+4. 客观类问答的答案通常非常简短，你可以适当补充一到两句相关信息，以丰富内容。
+5. 请确保你的回复格式美观、可读性强。对于多实体对比或列举，善用列表格式来帮助用户更好的理解信息。
+6. 除非用户要求，否则你回答的语言请于用户提问语言保持一致。
+7. 在适当情况下在句子末尾使用例如:【0†source】的格式引用搜索结果。
+```
+When using, you can obtain search results through methods such as `RAG` or `WebSearch`, and wrap them in `observation`, for example:
+```json
+[
+    {
+        "role": "user",
+        "content": "Explore the common characteristics of children's literature, with a focus on its narrative techniques and thematic tendencies. This includes narrative techniques: common approaches in children's literature such as first-person, third-person, omniscient narrator, and interactive narration, and their influence on young readers. It also includes thematic tendencies: recurring themes in children's literature such as growth, adventure, friendship, and family, with an analysis of how these themes impact children's cognitive and emotional development. Additionally, other universal features such as the use of personification, repetitive language, symbolism and metaphor, and educational value should be considered. Please provide a detailed analytical report based on academic research, classic examples of children's literature, and expert opinions."
+    },
+    {
+        "role": "observation",
+        "content": "【{id}†{title}†{url}】\n{content}"
+    },
+    ...
+]
+```
+For the above prompt, we use an internal or external search model to obtain the search results. Using the format shown above, we can generate the following analysis report:
+<div style="height: 400px; width: 100%; overflow: auto; border: 5px solid #ddd; padding: 20px;">
+</div>
+### Function Call
+GLM-4-32B-0414 supports calling external tools in JSON format. This can be done via HuggingFace Transformers, vLLM, or sgLang.
+The message format for tool calling is as follows:
+```json=
+{
+    "role": "asssitant",
+    "metadata": function_name,
+    "content": json.dumps(call_arguments, ensure_ascii=False)
+}
+```
+The message format for tool execution results is as follows:
+```json=
+{
+    "role": "observation",
+    "content": json.dumps(tool_response, ensure_ascii=False) if not isinstance(tool_response, str) else tool_response
+}
+```
+The following example demonstrates the process of GLM-4-9B-0414 calling a tool and generating a final response using HuggingFace Transformers.
 ```python
+import json
+import re
+import ast
 from transformers import AutoModelForCausalLM, AutoTokenizer
+MODEL_PATH = "THUDM/GLM-4-9B-0414"
 tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
 model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto")
+def is_function_call(single_message):
+    """Determine whether the current system message is a function call."""
+    pattern = re.compile(r'([^\n`]*?)\n({.*?})(?=\w*\n|$)', re.DOTALL)
+    matches = pattern.findall(single_message)
+    if not matches:
+        return False
+    func_name, args_str = matches[0]
+    func_name = func_name.strip()
+    try:
+        parsed_args = json.loads(args_str)
+    except json.JSONDecodeError:
+        try:
+            parsed_args = ast.literal_eval(args_str)
+        except:
+            return False
+    return {"name": func_name, "arguments": parsed_args}
+def realtime_aqi(city):
+    """Weather Query Tool"""
+    if '北京' in city.lower():
+        return json.dumps({'city': '北京', 'aqi': '10', 'unit': 'celsius'}, ensure_ascii=False)
+    elif '上海' in city.lower():
+        return json.dumps({'city': '上海', 'aqi': '72', 'unit': 'fahrenheit'}, ensure_ascii=False)
+    else:
+        return json.dumps({'city': city, 'aqi': 'unknown'}, ensure_ascii=False)
+def build_system_prompt(tools):
+    """Construct system prompt based on the list of available tools."""
+    if tools is None:
+        tools = []
+    value = "# 可用工具"
+    contents = []
+    for tool in tools:
+        content = f"\n\n## {tool['function']['name']}\n\n{json.dumps(tool['function'], ensure_ascii=False, indent=4)}"
+        content += "\n在调用上述函数时，请使用 Json 格式表示调用的参数。"
+        contents.append(content)
+    value += "".join(contents)
+    return value
+tools = [
+  {
+    "type": "function",
+    "function": {
+      "name": "realtime_aqi",
+      "description": "天气预报。获取实时空气质量。当前空气质量，PM2.5，PM10信息",
+      "parameters": {
+          "type": "object",
+          "properties": {
+              "city": {
+                  "description": "城市名"
+              }
+          },
+          "required": [
+              "city"
+          ]
+      }
+	}
+  }
+]
+system_prompt = build_system_prompt(tools)
+message = [
+    {"role": "system", "content": system_prompt},
+    {"role": "user", "content": "北京和上海今天的天气情况"}
+]
+print(f"User Message: {message[-1]['content']}")
+while True:
+    inputs = tokenizer.apply_chat_template(
+        message,
+        return_tensors="pt",
+        add_generation_prompt=True,
+        return_dict=True,
+    ).to(model.device)
+    generate_kwargs = {
+        "input_ids": inputs["input_ids"],
+        "attention_mask": inputs["attention_mask"],
+        "max_new_tokens": 1024,
+        "do_sample": True,
+    }
+    out = model.generate(**generate_kwargs)
+    generate_resp = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:-1], skip_special_tokens=False)
+    stop_sequence = tokenizer.decode(out[0][-1:], skip_speical_tokens=False)
+    if stop_sequence == "<|user|>":
+        print(f"Assistant Response: {generate_resp.strip()}")
+        break
+    function_calls = []
+    for m in generate_resp.split("<|assistant|>"):
+        fc_decode = is_function_call(m.strip())
+        if fc_decode:
+            message.append({"role": "assistant", "metadata": fc_decode['name'], "content": json.dumps(fc_decode['arguments'], ensure_ascii=False)})
+            print(f"Function Call: {fc_decode}")
+            function_calls.append(fc_decode)
+        else:
+            message.append({"role": "assistant", "content": m})
+            print(f"Assistant Response: {m.strip()}")
+    for fc in function_calls:
+        function_response = realtime_aqi(
+            city=fc["arguments"]["city"],
+        )
+        print(f"Function Response: {function_response}")
+        message.append({"role": "observation", "content": function_response})
 ```
+## Evaluation Results
+<div style="text-align: center;">
+  <img src="https://raw.githubusercontent.com/THUDM/GLM-4/refs/heads/main/resources/Bench-32B.png" style="width: 80%;" />
+</div>
+### GLM-4-0414 Series
+| 模型             | IFEval | BFCL-v3 (Overall) | BFCL-v3 (MultiTurn) | TAU-Bench (Retail) | TAU-Bench (Airline) | SimpleQA | HotpotQA |
+| ---------------- | ------ | ----------------- | ------------------- | ------------------ | ------------------- | -------- | -------- |
+| Qwen2.5-Max      | 85.6   | 50.9              | 30.5                | 58.3               | 22.0                | 79.0     | 52.8     |
+| GPT-4o-1120      | 81.9   | 69.6              | 41.0                | 62.8               | 46.0                | 82.8     | 63.9     |
+| DeepSeek-V3-0324 | 83.4   | 66.2              | 35.8                | 60.7               | 32.4                | 82.6     | 54.6     |
+| DeepSeek-R1      | 84.3   | 57.5              | 12.4                | 33.0               | 37.3                | 83.9     | 63.1     |
+| GLM-4-32B-0414   | 87.6   | 69.6              | 41.5                | 68.7               | 51.2                | 88.1     | 63.8     |
+> For `SimpleQA` and `HotpotQA`, we sampled nearly 500 test cases from each test set, provided all models with basic `search` and `click` tools, ensured other settings remained consistent, and averaged the results over 3 runs.
+| Model  | Framework  | [SWE-bench Verified](https://openai.com/index/introducing-swe-bench-verified/)  | [SWE-bench Verified mini](https://github.com/mariushobbhahn/SWEBench-verified-mini) |
+|---|---|---|---|
+| GLM-4-32B-0414  | Moatless<sup>[1]</sup> | 33.8 | 38.0 |
+| GLM-4-32B-0414  | Agentless<sup>[2]</sup>  | 30.7 | 34.0 |
+| GLM-4-32B-0414  | OpenHands<sup>[3]</sup> | 27.2  | 28.0  |
+[1] [Moatless v0.0.3](https://github.com/aorwall/moatless-tools) used the following parameters: `response_format="react", thoughts_in_action=False, max_interations=30`. No retries on failed trajectories; other settings are default.
+[2] [Agentless v1.5.0](https://github.com/OpenAutoCoder/Agentless) used [BGE](https://github.com/FlagOpen/FlagEmbedding/blob/master/README.md) as the embedding model and [FAISS](https://github.com/facebookresearch/faiss) for similarity search. To speed up patch verification while maintaining performance, the timeout for running a single instance was changed from the default 300s to 180s.
+[3] [OpenHands v0.29.1](https://github.com/All-Hands-AI/OpenHands/tree/main) did not use YaRN context extension but limited runs to a maximum of 60 iterations and summarized the history to prevent exceeding the 32K context limit. Summarization was configured as `llm_config="condenser", keep_first=1, max_size=32`. No retries on failed trajectories.

configuration.json DELETED Viewed

	@@ -1 +0,0 @@
1	- {"framework":"Pytorch","task":"text-generation"}