Mungert commited on
Commit
aa8ae70
Β·
verified Β·
1 Parent(s): 2d7b76d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +183 -0
README.md CHANGED
@@ -10,6 +10,189 @@ tags:
10
  library_name: transformers
11
  ---
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  # Qwen2.5-1.5B-Instruct
14
 
15
  ## Introduction
 
10
  library_name: transformers
11
  ---
12
 
13
+ # <span style="color: #7FFF7F;">Qwen2.5-1.5B-Instruct GGUF Models</span>
14
+
15
+ ## **Choosing the Right Model Format**
16
+
17
+ Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
18
+
19
+ ### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**
20
+ - A 16-bit floating-point format designed for **faster computation** while retaining good precision.
21
+ - Provides **similar dynamic range** as FP32 but with **lower memory usage**.
22
+ - Recommended if your hardware supports **BF16 acceleration** (check your device's specs).
23
+ - Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
24
+
25
+ πŸ“Œ **Use BF16 if:**
26
+ βœ” Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
27
+ βœ” You want **higher precision** while saving memory.
28
+ βœ” You plan to **requantize** the model into another format.
29
+
30
+ πŸ“Œ **Avoid BF16 if:**
31
+ ❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
32
+ ❌ You need compatibility with older devices that lack BF16 optimization.
33
+
34
+ ---
35
+
36
+ ### **F16 (Float 16) – More widely supported than BF16**
37
+ - A 16-bit floating-point **high precision** but with less of range of values than BF16.
38
+ - Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
39
+ - Slightly lower numerical precision than BF16 but generally sufficient for inference.
40
+
41
+ πŸ“Œ **Use F16 if:**
42
+ βœ” Your hardware supports **FP16** but **not BF16**.
43
+ βœ” You need a **balance between speed, memory usage, and accuracy**.
44
+ βœ” You are running on a **GPU** or another device optimized for FP16 computations.
45
+
46
+ πŸ“Œ **Avoid F16 if:**
47
+ ❌ Your device lacks **native FP16 support** (it may run slower than expected).
48
+ ❌ You have memory limitations.
49
+
50
+ ---
51
+
52
+ ### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**
53
+ Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
54
+ - **Lower-bit models (Q4_K)** β†’ **Best for minimal memory usage**, may have lower precision.
55
+ - **Higher-bit models (Q6_K, Q8_0)** β†’ **Better accuracy**, requires more memory.
56
+
57
+ πŸ“Œ **Use Quantized Models if:**
58
+ βœ” You are running inference on a **CPU** and need an optimized model.
59
+ βœ” Your device has **low VRAM** and cannot load full-precision models.
60
+ βœ” You want to reduce **memory footprint** while keeping reasonable accuracy.
61
+
62
+ πŸ“Œ **Avoid Quantized Models if:**
63
+ ❌ You need **maximum accuracy** (full-precision models are better for this).
64
+ ❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).
65
+
66
+ ---
67
+
68
+ ### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
69
+ These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
70
+
71
+ - **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
72
+ - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
73
+ - **Trade-off**: Lower accuracy compared to higher-bit quantizations.
74
+
75
+ - **IQ3_S**: Small block size for **maximum memory efficiency**.
76
+ - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
77
+
78
+ - **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
79
+ - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
80
+
81
+ - **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
82
+ - **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
83
+
84
+ - **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
85
+ - **Use case**: Best for **ARM-based devices** or **low-memory environments**.
86
+
87
+ ---
88
+
89
+ ### **Summary Table: Model Format Selection**
90
+
91
+ | Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
92
+ |--------------|------------|---------------|----------------------|---------------|
93
+ | **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
94
+ | **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn't available |
95
+ | **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
96
+ | **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
97
+ | **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
98
+ | **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
99
+ | **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
100
+
101
+ ---
102
+
103
+ ## **Included Files & Details**
104
+
105
+ ### `Qwen2.5-1.5B-Instruct-bf16.gguf`
106
+ - Model weights preserved in **BF16**.
107
+ - Use this if you want to **requantize** the model into a different format.
108
+ - Best if your device supports **BF16 acceleration**.
109
+
110
+ ### `Qwen2.5-1.5B-Instruct-f16.gguf`
111
+ - Model weights stored in **F16**.
112
+ - Use if your device supports **FP16**, especially if BF16 is not available.
113
+
114
+ ### `Qwen2.5-1.5B-Instruct-bf16-q8_0.gguf`
115
+ - **Output & embeddings** remain in **BF16**.
116
+ - All other layers quantized to **Q8_0**.
117
+ - Use if your device supports **BF16** and you want a quantized version.
118
+
119
+ ### `Qwen2.5-1.5B-Instruct-f16-q8_0.gguf`
120
+ - **Output & embeddings** remain in **F16**.
121
+ - All other layers quantized to **Q8_0**.
122
+
123
+ ### `Qwen2.5-1.5B-Instruct-q4_k.gguf`
124
+ - **Output & embeddings** quantized to **Q8_0**.
125
+ - All other layers quantized to **Q4_K**.
126
+ - Good for **CPU inference** with limited memory.
127
+
128
+ ### `Qwen2.5-1.5B-Instruct-q4_k_s.gguf`
129
+ - Smallest **Q4_K** variant, using less memory at the cost of accuracy.
130
+ - Best for **very low-memory setups**.
131
+
132
+ ### `Qwen2.5-1.5B-Instruct-q6_k.gguf`
133
+ - **Output & embeddings** quantized to **Q8_0**.
134
+ - All other layers quantized to **Q6_K** .
135
+
136
+ ### `Qwen2.5-1.5B-Instruct-q8_0.gguf`
137
+ - Fully **Q8** quantized model for better accuracy.
138
+ - Requires **more memory** but offers higher precision.
139
+
140
+ ### `Qwen2.5-1.5B-Instruct-iq3_xs.gguf`
141
+ - **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
142
+ - Best for **ultra-low-memory devices**.
143
+
144
+ ### `Qwen2.5-1.5B-Instruct-iq3_m.gguf`
145
+ - **IQ3_M** quantization, offering a **medium block size** for better accuracy.
146
+ - Suitable for **low-memory devices**.
147
+
148
+ ### `Qwen2.5-1.5B-Instruct-q4_0.gguf`
149
+ - Pure **Q4_0** quantization, optimized for **ARM devices**.
150
+ - Best for **low-memory environments**.
151
+ - Prefer IQ4_NL for better accuracy.
152
+
153
+ # <span id="testllm" style="color: #7F7FFF;">πŸš€ If you find these models useful</span>
154
+ ❀ **Please click "Like" if you find this useful!**
155
+ Help me test my **AI-Powered Network Monitor Assistant** with **quantum-ready security checks**:
156
+ πŸ‘‰ [Free Network Monitor](https://freenetworkmonitor.click/dashboard)
157
+
158
+ πŸ’¬ **How to test**:
159
+ 1. Click the **chat icon** (bottom right on any page)
160
+ 2. Choose an **AI assistant type**:
161
+ - `TurboLLM` (GPT-4-mini)
162
+ - `FreeLLM` (Open-source)
163
+ - `TestLLM` (Experimental CPU-only)
164
+
165
+ ### **What I’m Testing**
166
+ I’m pushing the limits of **small open-source models for AI network monitoring**, specifically:
167
+ - **Function calling** against live network services
168
+ - **How small can a model go** while still handling:
169
+ - Automated **Nmap scans**
170
+ - **Quantum-readiness checks**
171
+ - **Metasploit integration**
172
+
173
+ 🟑 **TestLLM** – Current experimental model (llama.cpp on 6 CPU threads):
174
+ - βœ… **Zero-configuration setup**
175
+ - ⏳ 30s load time (slow inference but **no API costs**)
176
+ - πŸ”§ **Help wanted!** If you’re into **edge-device AI**, let’s collaborate!
177
+
178
+ ### **Other Assistants**
179
+ 🟒 **TurboLLM** – Uses **gpt-4-mini** for:
180
+ - **Real-time network diagnostics**
181
+ - **Automated penetration testing** (Nmap/Metasploit)
182
+ - πŸ”‘ Get more tokens by [downloading our Free Network Monitor Agent](https://freenetworkmonitor.click/download)
183
+
184
+ πŸ”΅ **HugLLM** – Open-source models (β‰ˆ8B params):
185
+ - **2x more tokens** than TurboLLM
186
+ - **AI-powered log analysis**
187
+ - 🌐 Runs on Hugging Face Inference API
188
+
189
+ ### πŸ’‘ **Example AI Commands to Test**:
190
+ 1. `"Give me info on my websites SSL certificate"`
191
+ 2. `"Check if my server is using quantum safe encyption for communication"`
192
+ 3. `"Run a quick Nmap vulnerability test"`
193
+
194
+
195
+
196
  # Qwen2.5-1.5B-Instruct
197
 
198
  ## Introduction