File size: 2,283 Bytes
8fe0c23 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
---
language:
- en
- zh
- ru
- ja
- de
- fr
- es
- pt
- vi
- th
- ar
- ko
- it
- pl
- nl
- sv
- tr
- he
- cs
- uk
- ro
- bg
- hu
- el
- da
- fi
- nb
- sk
- sl
- hr
- lt
- lv
- et
- mt
pipeline_tag: sentence-similarity
tags:
- qwen
- embedding
- onnx
- int8
- quantized
- text-embeddings-inference
license: apache-2.0
---
# Qwen3-Embedding-0.6B ONNX INT8 for Text Embeddings Inference
This is an INT8 quantized ONNX version of [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) optimized specifically for [Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) with CPU acceleration.
## Key Features
- **INT8 Quantization**: ~8x smaller model size (0.56GB vs 4.7GB)
- **CPU Optimized**: 2-4x faster inference on CPU compared to float32
- **TEI Compatible**: Properly formatted for Text Embeddings Inference
- **Multilingual**: Supports 29 languages including English, Chinese, Russian, Japanese, etc.
- **Mean Pooling**: Configured for mean pooling (handled by TEI)
## Performance
- **Model size**: 0.56 GB (vs 4.7 GB float32)
- **Expected speedup**: 2-4x on CPU
- **Accuracy**: Minimal loss (1-3%) compared to float32
- **Best for**: CPU deployments, edge devices, high-throughput scenarios
## Usage with Text Embeddings Inference
### Docker Deployment (CPU)
```bash
docker run -p 8080:80 \
-e OMP_NUM_THREADS=$(nproc) \
-e KMP_AFFINITY=granularity=fine,compact,1,0 \
-e ORT_THREAD_POOL_SIZE=$(nproc) \
ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
--model-id YOUR_USERNAME/qwen3-embedding-0.6b-int8-tei-onnx
```
### Python Client
```python
from huggingface_hub import InferenceClient
client = InferenceClient("http://localhost:8080")
# Single embedding
response = client.post(
json={"inputs": "What is Deep Learning?"},
)
embedding = response.json()
# Batch embeddings
response = client.post(
json={"inputs": ["What is Deep Learning?", "深度学习是什么?"]},
)
embeddings = response.json()
```
## CPU Optimization
For optimal CPU performance, set these environment variables:
```bash
export OMP_NUM_THREADS=$(nproc) # Use all physical cores
export KMP_AFFINITY=granularity=fine,compact,1,0
export ORT_THREAD_POOL_SIZE=$(nproc)
```
## License
Apache 2.0
|