cogito-v1-preview-qwen-14B-exl2 / README.md

Create README.md

db190f9 verified 19 days ago

9.26 kB

	---
	license: apache-2.0
	library_name: exllamav2
	base_model:
	- deepcogito/cogito-v1-preview-qwen-14B
	pipeline_tag: text-generation
	---
	# cogito-v1-preview-qwen-14B-exl2
	Original model: [cogito-v1-preview-qwen-14B](https://huggingface.co/deepcogito/cogito-v1-preview-qwen-14B) by [Deep Cogito](https://huggingface.co/deepcogito)
	Based on: [Qwen2.5-14B](https://huggingface.co/Qwen/Qwen2.5-14B) by [Qwen](https://huggingface.co/Qwen)
	## Quants
	[4bpw h6 (main)](https://huggingface.co/cgus/cogito-v1-preview-qwen-14B-exl2/tree/main)
	[4.5bpw h6](https://huggingface.co/cgus/cogito-v1-preview-qwen-14B-exl2/tree/4.5bpw-h6)
	[5bpw h6](https://huggingface.co/cgus/cogito-v1-preview-qwen-14B-exl2/tree/5bpw-h6)
	[6bpw h6](https://huggingface.co/cgus/cogito-v1-preview-qwen-14B-exl2/tree/6bpw-h6)
	[8bpw h8](https://huggingface.co/cgus/cogito-v1-preview-qwen-14B-exl2/tree/8bpw-h8)
	## Quantization notes
	Made with Exllamav2 0.2.8 with default dataset.
	These quants can be used with TabbyAPI or Text-Generation-WebUI with RTX GPU (Windows) or RTX/ROCm (Linux).
	Quants have to fit your VRAM, if you need RAM offloading then choose GGUF quants instead.
	# Originial model card
	<p align="center">
	<img src="images/deep-cogito-logo.png" alt="Logo" width="40%">
	</p>


	# Cogito v1 preview - 14B

	[Blog Post](https://www.deepcogito.com/research/cogito-v1-preview)

	The Cogito LLMs are instruction tuned generative models (text in/text out). All models are released under an open license for commercial use.

	- Cogito models are hybrid reasoning models. Each model can answer directly (standard LLM), or self-reflect before answering (like reasoning models).
	- The LLMs are trained using Iterated Distillation and Amplification (IDA) - an scalable and efficient alignment strategy for superintelligence using iterative self-improvement.
	- The models have been optimized for coding, STEM, instruction following and general helpfulness, and have significantly higher multilingual, coding and tool calling capabilities than size equivalent counterparts.
	- In both standard and reasoning modes, Cogito v1-preview models outperform their size equivalent counterparts on common industry benchmarks.
	- Each model is trained in over 30 languages and supports a context length of 128k.

	# Evaluations
	We compare our models against the state of the art size equivalent models in direct mode as well as the reasoning mode. For the direct mode, we compare against Llama / Qwen instruct counterparts. For reasoning, we use Deepseek's R1 distilled counterparts / Qwen's QwQ model.

	<p align="left">
	<img src="images/14b_benchmarks.png" alt="Logo" width="90%">
	</p>

	Livebench Global Average:
	<p align="left">
	<img src="images/livebench_global_average.png" alt="Logo" width="80%">
	</p>

	For detailed evaluations, please refer to the [Blog Post](https://www.deepcogito.com/research/cogito-v1-preview).


	# Usage
	Here is a snippet below for usage with Transformers:

	```python
	import transformers
	import torch

	model_id = "deepcogito/cogito-v1-preview-qwen-14B"

	pipeline = transformers.pipeline(
	"text-generation",
	model=model_id,
	model_kwargs={"torch_dtype": torch.bfloat16},
	device_map="auto",
	)

	messages = [
	{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
	{"role": "user", "content": "Give me a short introduction to LLMs."},
	]

	outputs = pipeline(
	messages,
	max_new_tokens=512,
	)

	print(outputs[0]["generated_text"][-1])
	```



	## Implementing extended thinking
	- By default, the model will answer in the standard mode.
	- To enable thinking, you can do any one of the two methods:
	- Add a specific system prompt, or
	- Set `enable_thinking=True` while applying the chat template.


	### Method 1 - Add a specific system prompt.
	To enable thinking, simply use this in the system prompt `system_instruction = 'Enable deep thinking subroutine.'`

	If you already have a system_instruction, then use `system_instruction = 'Enable deep thinking subroutine.' + '\n\n' + system_instruction`.

	Here is an example -

	```python
	import transformers
	import torch

	model_id = "deepcogito/cogito-v1-preview-qwen-14B"

	pipeline = transformers.pipeline(
	"text-generation",
	model=model_id,
	model_kwargs={"torch_dtype": torch.bfloat16},
	device_map="auto",
	)

	DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."

	messages = [
	{"role": "system", "content": DEEP_THINKING_INSTRUCTION},
	{"role": "user", "content": "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."},
	]

	outputs = pipeline(
	messages,
	max_new_tokens=512,
	)

	print(outputs[0]["generated_text"][-1])
	```


	Similarly, if you have a system prompt, you can append the `DEEP_THINKING_INSTRUCTION` to the beginning in this way -

	```python
	DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."

	system_prompt = "Reply to each prompt with only the actual code - no explanations."
	prompt = "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."

	messages = [
	{"role": "system", "content": DEEP_THINKING_INSTRUCTION + '\n\n' + system_prompt},
	{"role": "user", "content": prompt}
	]
	```

	### Method 2 - Set enable_thinking=True in the tokenizer
	If you are using Huggingface tokenizers, then you can simply use add the argument `enable_thinking=True` to the tokenization (this option is added to the chat template).

	Here is an example -
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "deepcogito/cogito-v1-preview-qwen-14B"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	prompt = "Give me a short introduction to LLMs."
	messages = [
	{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
	{"role": "user", "content": prompt}
	]

	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	enable_thinking=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=512
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)
	```

	# Tool Calling
	Cogito models support tool calling (single, parallel, multiple and parallel_multiple) both in standard and extended thinking mode.

	Here is a snippet -

	```python
	# First, define a tool
	def get_current_temperature(location: str) -> float:
	"""
	Get the current temperature at a location.

	Args:
	location: The location to get the temperature for, in the format "City, Country"
	Returns:
	The current temperature at the specified location in the specified units, as a float.
	"""
	return 22. # A real function should probably actually get the temperature!

	# Next, create a chat and apply the chat template
	messages = [
	{"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
	]

	model_inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)

	text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
	inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=512)
	output_text = tokenizer.batch_decode(outputs)[0][len(text):]
	print(output_text)
	```

	This will result in the output -
	```
	<tool_call>
	{"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
	</tool_call><\|im_end\|>
	```

	You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:

	```python
	tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
	messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
	```

	and then call the tool and append the result, with the `tool` role, like so:

	```python
	messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
	```

	After that, you can `generate()` again to let the model use the tool result in the chat:

	```python
	text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
	inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=512)
	output_text = tokenizer.batch_decode(outputs)[0][len(text):]
	```

	This should result in the string -
	```
	'The current temperature in Paris is 22.0 degrees.<\|im_end\|>'
	```

	## License
	This repository and the model weights are licensed under the Apache 2.0 License Agreement.

	## Contact
	If you would like to reach out to our team, send an email to [contact@deepcogito.com](contact@deepcogito.com).