cgus commited on
Commit
db190f9
·
verified ·
1 Parent(s): 40e7d0d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +254 -0
README.md ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: exllamav2
4
+ base_model:
5
+ - deepcogito/cogito-v1-preview-qwen-14B
6
+ pipeline_tag: text-generation
7
+ ---
8
+ # cogito-v1-preview-qwen-14B-exl2
9
+ Original model: [cogito-v1-preview-qwen-14B](https://huggingface.co/deepcogito/cogito-v1-preview-qwen-14B) by [Deep Cogito](https://huggingface.co/deepcogito)
10
+ Based on: [Qwen2.5-14B](https://huggingface.co/Qwen/Qwen2.5-14B) by [Qwen](https://huggingface.co/Qwen)
11
+ ## Quants
12
+ [4bpw h6 (main)](https://huggingface.co/cgus/cogito-v1-preview-qwen-14B-exl2/tree/main)
13
+ [4.5bpw h6](https://huggingface.co/cgus/cogito-v1-preview-qwen-14B-exl2/tree/4.5bpw-h6)
14
+ [5bpw h6](https://huggingface.co/cgus/cogito-v1-preview-qwen-14B-exl2/tree/5bpw-h6)
15
+ [6bpw h6](https://huggingface.co/cgus/cogito-v1-preview-qwen-14B-exl2/tree/6bpw-h6)
16
+ [8bpw h8](https://huggingface.co/cgus/cogito-v1-preview-qwen-14B-exl2/tree/8bpw-h8)
17
+ ## Quantization notes
18
+ Made with Exllamav2 0.2.8 with default dataset.
19
+ These quants can be used with TabbyAPI or Text-Generation-WebUI with RTX GPU (Windows) or RTX/ROCm (Linux).
20
+ Quants have to fit your VRAM, if you need RAM offloading then choose GGUF quants instead.
21
+ # Originial model card
22
+ <p align="center">
23
+ <img src="images/deep-cogito-logo.png" alt="Logo" width="40%">
24
+ </p>
25
+
26
+
27
+ # Cogito v1 preview - 14B
28
+
29
+ [Blog Post](https://www.deepcogito.com/research/cogito-v1-preview)
30
+
31
+ The Cogito LLMs are instruction tuned generative models (text in/text out). All models are released under an open license for commercial use.
32
+
33
+ - Cogito models are hybrid reasoning models. Each model can answer directly (standard LLM), or self-reflect before answering (like reasoning models).
34
+ - The LLMs are trained using **Iterated Distillation and Amplification (IDA)** - an scalable and efficient alignment strategy for superintelligence using iterative self-improvement.
35
+ - The models have been optimized for coding, STEM, instruction following and general helpfulness, and have significantly higher multilingual, coding and tool calling capabilities than size equivalent counterparts.
36
+ - In both standard and reasoning modes, Cogito v1-preview models outperform their size equivalent counterparts on common industry benchmarks.
37
+ - Each model is trained in over 30 languages and supports a context length of 128k.
38
+
39
+ # Evaluations
40
+ We compare our models against the state of the art size equivalent models in direct mode as well as the reasoning mode. For the direct mode, we compare against Llama / Qwen instruct counterparts. For reasoning, we use Deepseek's R1 distilled counterparts / Qwen's QwQ model.
41
+
42
+ <p align="left">
43
+ <img src="images/14b_benchmarks.png" alt="Logo" width="90%">
44
+ </p>
45
+
46
+ **Livebench Global Average:**
47
+ <p align="left">
48
+ <img src="images/livebench_global_average.png" alt="Logo" width="80%">
49
+ </p>
50
+
51
+ For detailed evaluations, please refer to the [Blog Post](https://www.deepcogito.com/research/cogito-v1-preview).
52
+
53
+
54
+ # Usage
55
+ Here is a snippet below for usage with Transformers:
56
+
57
+ ```python
58
+ import transformers
59
+ import torch
60
+
61
+ model_id = "deepcogito/cogito-v1-preview-qwen-14B"
62
+
63
+ pipeline = transformers.pipeline(
64
+ "text-generation",
65
+ model=model_id,
66
+ model_kwargs={"torch_dtype": torch.bfloat16},
67
+ device_map="auto",
68
+ )
69
+
70
+ messages = [
71
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
72
+ {"role": "user", "content": "Give me a short introduction to LLMs."},
73
+ ]
74
+
75
+ outputs = pipeline(
76
+ messages,
77
+ max_new_tokens=512,
78
+ )
79
+
80
+ print(outputs[0]["generated_text"][-1])
81
+ ```
82
+
83
+
84
+
85
+ ## Implementing extended thinking
86
+ - By default, the model will answer in the standard mode.
87
+ - To enable thinking, you can do any one of the two methods:
88
+ - Add a specific system prompt, or
89
+ - Set `enable_thinking=True` while applying the chat template.
90
+
91
+
92
+ ### Method 1 - Add a specific system prompt.
93
+ To enable thinking, simply use this in the system prompt `system_instruction = 'Enable deep thinking subroutine.'`
94
+
95
+ If you already have a system_instruction, then use `system_instruction = 'Enable deep thinking subroutine.' + '\n\n' + system_instruction`.
96
+
97
+ Here is an example -
98
+
99
+ ```python
100
+ import transformers
101
+ import torch
102
+
103
+ model_id = "deepcogito/cogito-v1-preview-qwen-14B"
104
+
105
+ pipeline = transformers.pipeline(
106
+ "text-generation",
107
+ model=model_id,
108
+ model_kwargs={"torch_dtype": torch.bfloat16},
109
+ device_map="auto",
110
+ )
111
+
112
+ DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
113
+
114
+ messages = [
115
+ {"role": "system", "content": DEEP_THINKING_INSTRUCTION},
116
+ {"role": "user", "content": "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."},
117
+ ]
118
+
119
+ outputs = pipeline(
120
+ messages,
121
+ max_new_tokens=512,
122
+ )
123
+
124
+ print(outputs[0]["generated_text"][-1])
125
+ ```
126
+
127
+
128
+ Similarly, if you have a system prompt, you can append the `DEEP_THINKING_INSTRUCTION` to the beginning in this way -
129
+
130
+ ```python
131
+ DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
132
+
133
+ system_prompt = "Reply to each prompt with only the actual code - no explanations."
134
+ prompt = "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."
135
+
136
+ messages = [
137
+ {"role": "system", "content": DEEP_THINKING_INSTRUCTION + '\n\n' + system_prompt},
138
+ {"role": "user", "content": prompt}
139
+ ]
140
+ ```
141
+
142
+ ### Method 2 - Set enable_thinking=True in the tokenizer
143
+ If you are using Huggingface tokenizers, then you can simply use add the argument `enable_thinking=True` to the tokenization (this option is added to the chat template).
144
+
145
+ Here is an example -
146
+ ```python
147
+ from transformers import AutoModelForCausalLM, AutoTokenizer
148
+
149
+ model_name = "deepcogito/cogito-v1-preview-qwen-14B"
150
+
151
+ model = AutoModelForCausalLM.from_pretrained(
152
+ model_name,
153
+ torch_dtype="auto",
154
+ device_map="auto"
155
+ )
156
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
157
+
158
+ prompt = "Give me a short introduction to LLMs."
159
+ messages = [
160
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
161
+ {"role": "user", "content": prompt}
162
+ ]
163
+
164
+ text = tokenizer.apply_chat_template(
165
+ messages,
166
+ tokenize=False,
167
+ add_generation_prompt=True,
168
+ enable_thinking=True
169
+ )
170
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
171
+
172
+ generated_ids = model.generate(
173
+ **model_inputs,
174
+ max_new_tokens=512
175
+ )
176
+ generated_ids = [
177
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
178
+ ]
179
+
180
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
181
+ print(response)
182
+ ```
183
+
184
+ # Tool Calling
185
+ Cogito models support tool calling (single, parallel, multiple and parallel_multiple) both in standard and extended thinking mode.
186
+
187
+ Here is a snippet -
188
+
189
+ ```python
190
+ # First, define a tool
191
+ def get_current_temperature(location: str) -> float:
192
+ """
193
+ Get the current temperature at a location.
194
+
195
+ Args:
196
+ location: The location to get the temperature for, in the format "City, Country"
197
+ Returns:
198
+ The current temperature at the specified location in the specified units, as a float.
199
+ """
200
+ return 22. # A real function should probably actually get the temperature!
201
+
202
+ # Next, create a chat and apply the chat template
203
+ messages = [
204
+ {"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
205
+ ]
206
+
207
+ model_inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)
208
+
209
+ text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
210
+ inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
211
+ outputs = model.generate(**inputs, max_new_tokens=512)
212
+ output_text = tokenizer.batch_decode(outputs)[0][len(text):]
213
+ print(output_text)
214
+ ```
215
+
216
+ This will result in the output -
217
+ ```
218
+ <tool_call>
219
+ {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
220
+ </tool_call><|im_end|>
221
+ ```
222
+
223
+ You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:
224
+
225
+ ```python
226
+ tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
227
+ messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
228
+ ```
229
+
230
+ and then call the tool and append the result, with the `tool` role, like so:
231
+
232
+ ```python
233
+ messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
234
+ ```
235
+
236
+ After that, you can `generate()` again to let the model use the tool result in the chat:
237
+
238
+ ```python
239
+ text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
240
+ inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
241
+ outputs = model.generate(**inputs, max_new_tokens=512)
242
+ output_text = tokenizer.batch_decode(outputs)[0][len(text):]
243
+ ```
244
+
245
+ This should result in the string -
246
+ ```
247
+ 'The current temperature in Paris is 22.0 degrees.<|im_end|>'
248
+ ```
249
+
250
+ ## License
251
+ This repository and the model weights are licensed under the Apache 2.0 License Agreement.
252
+
253
+ ## Contact
254
+ If you would like to reach out to our team, send an email to [contact@deepcogito.com](contact@deepcogito.com).