File size: 2,283 Bytes
8fe0c23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
language:
- en
- zh
- ru
- ja
- de
- fr
- es
- pt
- vi
- th
- ar
- ko
- it
- pl
- nl
- sv
- tr
- he
- cs
- uk
- ro
- bg
- hu
- el
- da
- fi
- nb
- sk
- sl
- hr
- lt
- lv
- et
- mt
pipeline_tag: sentence-similarity
tags:
- qwen
- embedding
- onnx
- int8
- quantized
- text-embeddings-inference
license: apache-2.0
---

# Qwen3-Embedding-0.6B ONNX INT8 for Text Embeddings Inference

This is an INT8 quantized ONNX version of [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) optimized specifically for [Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) with CPU acceleration.

## Key Features

- **INT8 Quantization**: ~8x smaller model size (0.56GB vs 4.7GB)
- **CPU Optimized**: 2-4x faster inference on CPU compared to float32
- **TEI Compatible**: Properly formatted for Text Embeddings Inference
- **Multilingual**: Supports 29 languages including English, Chinese, Russian, Japanese, etc.
- **Mean Pooling**: Configured for mean pooling (handled by TEI)

## Performance

- **Model size**: 0.56 GB (vs 4.7 GB float32)
- **Expected speedup**: 2-4x on CPU
- **Accuracy**: Minimal loss (1-3%) compared to float32
- **Best for**: CPU deployments, edge devices, high-throughput scenarios

## Usage with Text Embeddings Inference

### Docker Deployment (CPU)

```bash
docker run -p 8080:80 \
  -e OMP_NUM_THREADS=$(nproc) \
  -e KMP_AFFINITY=granularity=fine,compact,1,0 \
  -e ORT_THREAD_POOL_SIZE=$(nproc) \
  ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
  --model-id YOUR_USERNAME/qwen3-embedding-0.6b-int8-tei-onnx
```

### Python Client

```python
from huggingface_hub import InferenceClient

client = InferenceClient("http://localhost:8080")

# Single embedding
response = client.post(
    json={"inputs": "What is Deep Learning?"},
)
embedding = response.json()

# Batch embeddings
response = client.post(
    json={"inputs": ["What is Deep Learning?", "深度学习是什么?"]},
)
embeddings = response.json()
```

## CPU Optimization

For optimal CPU performance, set these environment variables:

```bash
export OMP_NUM_THREADS=$(nproc)          # Use all physical cores
export KMP_AFFINITY=granularity=fine,compact,1,0
export ORT_THREAD_POOL_SIZE=$(nproc)
```

## License

Apache 2.0