Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -10,6 +10,189 @@ tags:
|
|
10 |
library_name: transformers
|
11 |
---
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
# Qwen2.5-1.5B-Instruct
|
14 |
|
15 |
## Introduction
|
|
|
10 |
library_name: transformers
|
11 |
---
|
12 |
|
13 |
+
# <span style="color: #7FFF7F;">Qwen2.5-1.5B-Instruct GGUF Models</span>
|
14 |
+
|
15 |
+
## **Choosing the Right Model Format**
|
16 |
+
|
17 |
+
Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
|
18 |
+
|
19 |
+
### **BF16 (Brain Float 16) β Use if BF16 acceleration is available**
|
20 |
+
- A 16-bit floating-point format designed for **faster computation** while retaining good precision.
|
21 |
+
- Provides **similar dynamic range** as FP32 but with **lower memory usage**.
|
22 |
+
- Recommended if your hardware supports **BF16 acceleration** (check your device's specs).
|
23 |
+
- Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
|
24 |
+
|
25 |
+
π **Use BF16 if:**
|
26 |
+
β Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
|
27 |
+
β You want **higher precision** while saving memory.
|
28 |
+
β You plan to **requantize** the model into another format.
|
29 |
+
|
30 |
+
π **Avoid BF16 if:**
|
31 |
+
β Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
|
32 |
+
β You need compatibility with older devices that lack BF16 optimization.
|
33 |
+
|
34 |
+
---
|
35 |
+
|
36 |
+
### **F16 (Float 16) β More widely supported than BF16**
|
37 |
+
- A 16-bit floating-point **high precision** but with less of range of values than BF16.
|
38 |
+
- Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
|
39 |
+
- Slightly lower numerical precision than BF16 but generally sufficient for inference.
|
40 |
+
|
41 |
+
π **Use F16 if:**
|
42 |
+
β Your hardware supports **FP16** but **not BF16**.
|
43 |
+
β You need a **balance between speed, memory usage, and accuracy**.
|
44 |
+
β You are running on a **GPU** or another device optimized for FP16 computations.
|
45 |
+
|
46 |
+
π **Avoid F16 if:**
|
47 |
+
β Your device lacks **native FP16 support** (it may run slower than expected).
|
48 |
+
β You have memory limitations.
|
49 |
+
|
50 |
+
---
|
51 |
+
|
52 |
+
### **Quantized Models (Q4_K, Q6_K, Q8, etc.) β For CPU & Low-VRAM Inference**
|
53 |
+
Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
|
54 |
+
- **Lower-bit models (Q4_K)** β **Best for minimal memory usage**, may have lower precision.
|
55 |
+
- **Higher-bit models (Q6_K, Q8_0)** β **Better accuracy**, requires more memory.
|
56 |
+
|
57 |
+
π **Use Quantized Models if:**
|
58 |
+
β You are running inference on a **CPU** and need an optimized model.
|
59 |
+
β Your device has **low VRAM** and cannot load full-precision models.
|
60 |
+
β You want to reduce **memory footprint** while keeping reasonable accuracy.
|
61 |
+
|
62 |
+
π **Avoid Quantized Models if:**
|
63 |
+
β You need **maximum accuracy** (full-precision models are better for this).
|
64 |
+
β Your hardware has enough VRAM for higher-precision formats (BF16/F16).
|
65 |
+
|
66 |
+
---
|
67 |
+
|
68 |
+
### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
|
69 |
+
These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
|
70 |
+
|
71 |
+
- **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
|
72 |
+
- **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
|
73 |
+
- **Trade-off**: Lower accuracy compared to higher-bit quantizations.
|
74 |
+
|
75 |
+
- **IQ3_S**: Small block size for **maximum memory efficiency**.
|
76 |
+
- **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
|
77 |
+
|
78 |
+
- **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
|
79 |
+
- **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
|
80 |
+
|
81 |
+
- **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
|
82 |
+
- **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
|
83 |
+
|
84 |
+
- **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
|
85 |
+
- **Use case**: Best for **ARM-based devices** or **low-memory environments**.
|
86 |
+
|
87 |
+
---
|
88 |
+
|
89 |
+
### **Summary Table: Model Format Selection**
|
90 |
+
|
91 |
+
| Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
|
92 |
+
|--------------|------------|---------------|----------------------|---------------|
|
93 |
+
| **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
|
94 |
+
| **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn't available |
|
95 |
+
| **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
|
96 |
+
| **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
|
97 |
+
| **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
|
98 |
+
| **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
|
99 |
+
| **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
|
100 |
+
|
101 |
+
---
|
102 |
+
|
103 |
+
## **Included Files & Details**
|
104 |
+
|
105 |
+
### `Qwen2.5-1.5B-Instruct-bf16.gguf`
|
106 |
+
- Model weights preserved in **BF16**.
|
107 |
+
- Use this if you want to **requantize** the model into a different format.
|
108 |
+
- Best if your device supports **BF16 acceleration**.
|
109 |
+
|
110 |
+
### `Qwen2.5-1.5B-Instruct-f16.gguf`
|
111 |
+
- Model weights stored in **F16**.
|
112 |
+
- Use if your device supports **FP16**, especially if BF16 is not available.
|
113 |
+
|
114 |
+
### `Qwen2.5-1.5B-Instruct-bf16-q8_0.gguf`
|
115 |
+
- **Output & embeddings** remain in **BF16**.
|
116 |
+
- All other layers quantized to **Q8_0**.
|
117 |
+
- Use if your device supports **BF16** and you want a quantized version.
|
118 |
+
|
119 |
+
### `Qwen2.5-1.5B-Instruct-f16-q8_0.gguf`
|
120 |
+
- **Output & embeddings** remain in **F16**.
|
121 |
+
- All other layers quantized to **Q8_0**.
|
122 |
+
|
123 |
+
### `Qwen2.5-1.5B-Instruct-q4_k.gguf`
|
124 |
+
- **Output & embeddings** quantized to **Q8_0**.
|
125 |
+
- All other layers quantized to **Q4_K**.
|
126 |
+
- Good for **CPU inference** with limited memory.
|
127 |
+
|
128 |
+
### `Qwen2.5-1.5B-Instruct-q4_k_s.gguf`
|
129 |
+
- Smallest **Q4_K** variant, using less memory at the cost of accuracy.
|
130 |
+
- Best for **very low-memory setups**.
|
131 |
+
|
132 |
+
### `Qwen2.5-1.5B-Instruct-q6_k.gguf`
|
133 |
+
- **Output & embeddings** quantized to **Q8_0**.
|
134 |
+
- All other layers quantized to **Q6_K** .
|
135 |
+
|
136 |
+
### `Qwen2.5-1.5B-Instruct-q8_0.gguf`
|
137 |
+
- Fully **Q8** quantized model for better accuracy.
|
138 |
+
- Requires **more memory** but offers higher precision.
|
139 |
+
|
140 |
+
### `Qwen2.5-1.5B-Instruct-iq3_xs.gguf`
|
141 |
+
- **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
|
142 |
+
- Best for **ultra-low-memory devices**.
|
143 |
+
|
144 |
+
### `Qwen2.5-1.5B-Instruct-iq3_m.gguf`
|
145 |
+
- **IQ3_M** quantization, offering a **medium block size** for better accuracy.
|
146 |
+
- Suitable for **low-memory devices**.
|
147 |
+
|
148 |
+
### `Qwen2.5-1.5B-Instruct-q4_0.gguf`
|
149 |
+
- Pure **Q4_0** quantization, optimized for **ARM devices**.
|
150 |
+
- Best for **low-memory environments**.
|
151 |
+
- Prefer IQ4_NL for better accuracy.
|
152 |
+
|
153 |
+
# <span id="testllm" style="color: #7F7FFF;">π If you find these models useful</span>
|
154 |
+
β€ **Please click "Like" if you find this useful!**
|
155 |
+
Help me test my **AI-Powered Network Monitor Assistant** with **quantum-ready security checks**:
|
156 |
+
π [Free Network Monitor](https://freenetworkmonitor.click/dashboard)
|
157 |
+
|
158 |
+
π¬ **How to test**:
|
159 |
+
1. Click the **chat icon** (bottom right on any page)
|
160 |
+
2. Choose an **AI assistant type**:
|
161 |
+
- `TurboLLM` (GPT-4-mini)
|
162 |
+
- `FreeLLM` (Open-source)
|
163 |
+
- `TestLLM` (Experimental CPU-only)
|
164 |
+
|
165 |
+
### **What Iβm Testing**
|
166 |
+
Iβm pushing the limits of **small open-source models for AI network monitoring**, specifically:
|
167 |
+
- **Function calling** against live network services
|
168 |
+
- **How small can a model go** while still handling:
|
169 |
+
- Automated **Nmap scans**
|
170 |
+
- **Quantum-readiness checks**
|
171 |
+
- **Metasploit integration**
|
172 |
+
|
173 |
+
π‘ **TestLLM** β Current experimental model (llama.cpp on 6 CPU threads):
|
174 |
+
- β
**Zero-configuration setup**
|
175 |
+
- β³ 30s load time (slow inference but **no API costs**)
|
176 |
+
- π§ **Help wanted!** If youβre into **edge-device AI**, letβs collaborate!
|
177 |
+
|
178 |
+
### **Other Assistants**
|
179 |
+
π’ **TurboLLM** β Uses **gpt-4-mini** for:
|
180 |
+
- **Real-time network diagnostics**
|
181 |
+
- **Automated penetration testing** (Nmap/Metasploit)
|
182 |
+
- π Get more tokens by [downloading our Free Network Monitor Agent](https://freenetworkmonitor.click/download)
|
183 |
+
|
184 |
+
π΅ **HugLLM** β Open-source models (β8B params):
|
185 |
+
- **2x more tokens** than TurboLLM
|
186 |
+
- **AI-powered log analysis**
|
187 |
+
- π Runs on Hugging Face Inference API
|
188 |
+
|
189 |
+
### π‘ **Example AI Commands to Test**:
|
190 |
+
1. `"Give me info on my websites SSL certificate"`
|
191 |
+
2. `"Check if my server is using quantum safe encyption for communication"`
|
192 |
+
3. `"Run a quick Nmap vulnerability test"`
|
193 |
+
|
194 |
+
|
195 |
+
|
196 |
# Qwen2.5-1.5B-Instruct
|
197 |
|
198 |
## Introduction
|