--- license: mit base_model: Qwen/Qwen3-4B-Instruct-2507 datasets: - Salesforce/xlam-function-calling-60k language: - en pipeline_tag: text-generation quantized_by: Manojb tags: - function-calling - tool-calling - codex - local-llm - gguf - 4gb-vram - llama-cpp - code-assistant - api-tools - openai-alternative - qwen3 - qwen - instruct --- # Qwen3-4B Tool Calling with llama-cpp-python A specialized 4B parameter model fine-tuned for function calling and tool usage, optimized for local deployment with llama-cpp-python. ## Features - **4B Parameters** - Sweet spot for local deployment - **Function Calling** - Fine-tuned on 60K function calling examples - **GGUF Format** - Optimized for CPU/GPU inference - **3.99GB Download** - Fits on any modern system - **262K Context** - Large context window for complex tasks - **VRAM** - Full context within 6GB! ## Model Details - **Base Model**: Qwen3-4B-Instruct-2507 - **Fine-tuning**: LoRA on Salesforce xlam-function-calling-60k dataset - **Quantization**: Q8_0 (8-bit) for optimal performance/size ratio - **Architecture**: Qwen3 with specialized tool calling tokens - **License**: Apache 2.0 ## Installation ### Quick Install ```bash # Clone the repository git clone https://huggingface.co/Manojb/qwen3-4b-toolcall-gguf-llamacpp-codex cd qwen3-4b-toolcall-llamacpp-codex # Run the installation script ./install.sh ``` ### Manual Installation #### Prerequisites - Python 3.8+ - 6GB+ RAM (8GB+ recommended) - 5GB+ free disk space #### Install Dependencies ```bash pip install -r requirements.txt ``` #### Download Model ```bash # Download the model file huggingface-cli download Manojb/qwen3-4b-toolcall-gguf-llamacpp-codex Qwen3-4B-Function-Calling-Pro.gguf ``` ### Alternative: Install with specific llama-cpp-python build For better performance, you can install llama-cpp-python with specific optimizations: ```bash # For CPU-only (default) pip install llama-cpp-python # For CUDA support (if you have NVIDIA GPU) CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python # For OpenBLAS support CMAKE_ARGS="-DLLAMA_BLAS=on -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python ``` ## Quick Start ### Option 1: Using the Run Script ```bash # Interactive mode (default) ./run_model.sh # or source ./run_model.sh # Start Codex server ./run_model.sh server # or source ./run_model.sh server # Show help ./run_model.sh help # or source ./run_model.sh help ``` ### Option 2: Direct Python Usage ```python from llama_cpp import Llama # Load the model llm = Llama( model_path="Qwen3-4B-Function-Calling-Pro.gguf", n_ctx=2048, n_threads=8, temperature=0.7 ) # Simple chat response = llm("What's the weather like in London?", max_tokens=200) print(response['choices'][0]['text']) ``` ### Option 3: Quick Start Demo ```bash python3 quick_start.py ``` ### Tool Calling Example ```python import json import re from llama_cpp import Llama def extract_tool_calls(text): """Extract tool calls from model response""" tool_calls = [] json_pattern = r'\[.*?\]' matches = re.findall(json_pattern, text) for match in matches: try: parsed = json.loads(match) if isinstance(parsed, list): for item in parsed: if isinstance(item, dict) and 'name' in item: tool_calls.append(item) except json.JSONDecodeError: continue return tool_calls # Initialize model llm = Llama( model_path="Qwen3-4B-Function-Calling-Pro.gguf", n_ctx=2048, temperature=0.7 ) # Chat with tool calling prompt = "Get the weather for New York" formatted_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n" response = llm(formatted_prompt, max_tokens=200, stop=["<|im_end|>", "<|im_start|>"]) response_text = response['choices'][0]['text'] # Extract tool calls tool_calls = extract_tool_calls(response_text) print(f"Tool calls: {tool_calls}") ``` ## Examples ### 1. Weather Tool Calling ```python # The model will generate: # [{"name": "get_weather", "arguments": {"q": "London"}}] ``` ### 2. Hotel Search ```python # The model will generate: # [{"name": "search_stays", "arguments": {"check_in": "2023-04-01", "check_out": "2023-04-08", "city": "Paris"}}] ``` ### 3. Flight Booking ```python # The model will generate: # [{"name": "flights_search", "arguments": {"q": "New York to Tokyo"}}] ``` ### 4. News Search ```python # The model will generate: # [{"name": "search_news", "arguments": {"q": "AI", "gl": "us"}}] ``` ## Codex Integration ### Setting up Codex Server To use this model with Codex, you need to run a local server that Codex can connect to: #### 1. Install llama-cpp-python with server support ```bash pip install llama-cpp-python[server] ``` #### 2. Start the Codex-compatible server ```bash python -m llama_cpp.server \ --model Qwen3-4B-Function-Calling-Pro.gguf \ --host 0.0.0.0 \ --port 8000 \ --n_ctx 2048 \ --n_threads 8 \ --temperature 0.7 ``` #### 3. Configure Codex to use the local server In your Codex configuration, set: - **Server URL**: `http://localhost:8000` - **API Key**: (not required for local server) - **Model**: `Qwen3-4B-Function-Calling-Pro` ### Codex Integration Example ```python # codex_integration.py import requests import json class CodexClient: def __init__(self, base_url="http://localhost:8000"): self.base_url = base_url self.session = requests.Session() def chat_completion(self, messages, tools=None, temperature=0.7): """Send chat completion request to Codex""" payload = { "model": "Qwen3-4B-Function-Calling-Pro", "messages": messages, "temperature": temperature, "max_tokens": 512, "stop": ["<|im_end|>", "<|im_start|>"] } if tools: payload["tools"] = tools response = self.session.post( f"{self.base_url}/v1/chat/completions", json=payload, headers={"Content-Type": "application/json"} ) return response.json() def extract_tool_calls(self, response): """Extract tool calls from Codex response""" tool_calls = [] if "choices" in response and len(response["choices"]) > 0: message = response["choices"][0]["message"] if "tool_calls" in message: tool_calls = message["tool_calls"] return tool_calls # Usage with Codex codex = CodexClient() # Define tools for Codex tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "City name" } }, "required": ["location"] } } } ] # Send request messages = [{"role": "user", "content": "What's the weather in London?"}] response = codex.chat_completion(messages, tools=tools) tool_calls = codex.extract_tool_calls(response) print(f"Response: {response}") print(f"Tool calls: {tool_calls}") ``` ### Docker Setup for Codex Create a `Dockerfile` for easy deployment: ```dockerfile FROM python:3.11-slim WORKDIR /app # Install dependencies COPY requirements.txt . RUN pip install -r requirements.txt # Install llama-cpp-python with server support RUN pip install llama-cpp-python[server] # Copy model and scripts COPY . . # Expose port EXPOSE 8000 # Start server CMD ["python", "-m", "llama_cpp.server", \ "--model", "Qwen3-4B-Function-Calling-Pro.gguf", \ "--host", "0.0.0.0", \ "--port", "8000", \ "--n_ctx", "2048"] ``` Build and run: ```bash docker build -t qwen3-codex-server . docker run -p 8000:8000 qwen3-codex-server ``` ## Advanced Usage ### Custom Tool Calling Class ```python class Qwen3ToolCalling: def __init__(self, model_path): self.llm = Llama( model_path=model_path, n_ctx=2048, n_threads=8, temperature=0.7, verbose=False ) def chat(self, message, system_message=None): # Build prompt with proper formatting prompt_parts = [] if system_message: prompt_parts.append(f"<|im_start|>system\n{system_message}<|im_end|>") prompt_parts.append(f"<|im_start|>user\n{message}<|im_end|>") prompt_parts.append("<|im_start|>assistant\n") formatted_prompt = "\n".join(prompt_parts) # Generate response response = self.llm( formatted_prompt, max_tokens=512, stop=["<|im_end|>", "<|im_start|>"], temperature=0.7 ) response_text = response['choices'][0]['text'] tool_calls = self.extract_tool_calls(response_text) return { 'response': response_text, 'tool_calls': tool_calls } ``` ## Performance ### System Requirements | Component | Minimum | Recommended | |-----------|---------|-------------| | RAM | 6GB | 8GB+ | | Storage | 5GB | 10GB+ | | CPU | 4 cores | 8+ cores | | GPU | Optional | NVIDIA RTX 3060+ | ### Benchmarks - **Inference Speed**: ~75-100 tokens/second (CPU) - **Memory Usage**: ~4GB RAM - **Model Size**: 3.99GB (Q8_0 quantized) - **Context Length**: 262K tokens - **Function Call Accuracy**: 94%+ on test set ## Use Cases - **AI Agents** - Building intelligent agents that can use tools - **Local Coding Assistants** - Function calling without cloud dependencies - **API Integration** - Seamless tool orchestration - **Privacy-Sensitive Development** - 100% local processing - **Learning Function Calling** - Educational purposes ## Model Architecture ### Special Tokens The model includes specialized tokens for tool calling: - `` - Start of tool call - `` - End of tool call - `` - Start of tool response - `` - End of tool response ### Chat Template The model uses a custom chat template optimized for tool calling: ``` <|im_start|>system {system_message}<|im_end|> <|im_start|>user {user_message}<|im_end|> <|im_start|>assistant {assistant_response}<|im_end|> ``` ## Repository Structure ``` qwen3-4b-toolcall-llamacpp/ ├── Qwen3-4B-Function-Calling-Pro.gguf # Main model file ├── qwen3_toolcalling_example.py # Complete example ├── quick_start.py # Quick start demo ├── codex_integration.py # Codex integration example ├── run_model.sh # Run script for llama-cpp ├── install.sh # Installation script ├── requirements.txt # Python dependencies ├── README.md # This file ├── config.json # Model configuration ├── tokenizer_config.json # Tokenizer configuration ├── special_tokens_map.json # Special tokens mapping ├── added_tokens.json # Added tokens ├── chat_template.jinja # Chat template ├── Dockerfile # Docker configuration ├── docker-compose.yml # Docker Compose setup └── .gitignore # Git ignore file ``` ```bibtex @model{Manojb/Qwen3-4b-toolcall-gguf-llamacpp-codex, title={Qwen3-4B-toolcalling-gguf-codex: Local Function Calling}, author={Manojb}, year={2025}, url={https://huggingface.co/Manojb/Qwen3-4b-toolcall-gguf-llamacpp-codex} } ``` ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Related Projects - [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) - Python bindings for llama.cpp - [Qwen3](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) - Base model - [xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) - Training dataset --- **Built with ❤️ for the developer community**