File size: 2,283 Bytes

8fe0c23

---
language:
- en
- zh
- ru
- ja
- de
- fr
- es
- pt
- vi
- th
- ar
- ko
- it
- pl
- nl
- sv
- tr
- he
- cs
- uk
- ro
- bg
- hu
- el
- da
- fi
- nb
- sk
- sl
- hr
- lt
- lv
- et
- mt
pipeline_tag: sentence-similarity
tags:
- qwen
- embedding
- onnx
- int8
- quantized
- text-embeddings-inference
license: apache-2.0
---

# Qwen3-Embedding-0.6B ONNX INT8 for Text Embeddings Inference

This is an INT8 quantized ONNX version of [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) optimized specifically for [Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) with CPU acceleration.

## Key Features

- **INT8 Quantization**: ~8x smaller model size (0.56GB vs 4.7GB)
- **CPU Optimized**: 2-4x faster inference on CPU compared to float32
- **TEI Compatible**: Properly formatted for Text Embeddings Inference
- **Multilingual**: Supports 29 languages including English, Chinese, Russian, Japanese, etc.
- **Mean Pooling**: Configured for mean pooling (handled by TEI)

## Performance

- **Model size**: 0.56 GB (vs 4.7 GB float32)
- **Expected speedup**: 2-4x on CPU
- **Accuracy**: Minimal loss (1-3%) compared to float32
- **Best for**: CPU deployments, edge devices, high-throughput scenarios

## Usage with Text Embeddings Inference

### Docker Deployment (CPU)

```bash
docker run -p 8080:80 \
  -e OMP_NUM_THREADS=$(nproc) \
  -e KMP_AFFINITY=granularity=fine,compact,1,0 \
  -e ORT_THREAD_POOL_SIZE=$(nproc) \
  ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
  --model-id YOUR_USERNAME/qwen3-embedding-0.6b-int8-tei-onnx
```

### Python Client

```python
from huggingface_hub import InferenceClient

client = InferenceClient("http://localhost:8080")

# Single embedding
response = client.post(
    json={"inputs": "What is Deep Learning?"},
)
embedding = response.json()

# Batch embeddings
response = client.post(
    json={"inputs": ["What is Deep Learning?", "深度学习是什么？"]},
)
embeddings = response.json()
```

## CPU Optimization

For optimal CPU performance, set these environment variables:

```bash
export OMP_NUM_THREADS=$(nproc)          # Use all physical cores
export KMP_AFFINITY=granularity=fine,compact,1,0
export ORT_THREAD_POOL_SIZE=$(nproc)
```

## License

Apache 2.0