metadata

language:
  - en
  - zh
  - ru
  - ja
  - de
  - fr
  - es
  - pt
  - vi
  - th
  - ar
  - ko
  - it
  - pl
  - nl
  - sv
  - tr
  - he
  - cs
  - uk
  - ro
  - bg
  - hu
  - el
  - da
  - fi
  - nb
  - sk
  - sl
  - hr
  - lt
  - lv
  - et
  - mt
pipeline_tag: sentence-similarity
tags:
  - qwen
  - embedding
  - onnx
  - int8
  - quantized
  - text-embeddings-inference
license: apache-2.0

Qwen3-Embedding-0.6B ONNX INT8 for Text Embeddings Inference

This is an INT8 quantized ONNX version of Qwen/Qwen3-Embedding-0.6B optimized specifically for Text Embeddings Inference (TEI) with CPU acceleration.

Key Features

INT8 Quantization: ~8x smaller model size (0.56GB vs 4.7GB)
CPU Optimized: 2-4x faster inference on CPU compared to float32
TEI Compatible: Properly formatted for Text Embeddings Inference
Multilingual: Supports 29 languages including English, Chinese, Russian, Japanese, etc.
Mean Pooling: Configured for mean pooling (handled by TEI)

Performance

Model size: 0.56 GB (vs 4.7 GB float32)
Expected speedup: 2-4x on CPU
Accuracy: Minimal loss (1-3%) compared to float32
Best for: CPU deployments, edge devices, high-throughput scenarios

Usage with Text Embeddings Inference

Docker Deployment (CPU)

docker run -p 8080:80 \
  -e OMP_NUM_THREADS=$(nproc) \
  -e KMP_AFFINITY=granularity=fine,compact,1,0 \
  -e ORT_THREAD_POOL_SIZE=$(nproc) \
  ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
  --model-id YOUR_USERNAME/qwen3-embedding-0.6b-int8-tei-onnx

Python Client

from huggingface_hub import InferenceClient

client = InferenceClient("http://localhost:8080")

# Single embedding
response = client.post(
    json={"inputs": "What is Deep Learning?"},
)
embedding = response.json()

# Batch embeddings
response = client.post(
    json={"inputs": ["What is Deep Learning?", "深度学习是什么？"]},
)
embeddings = response.json()

CPU Optimization

For optimal CPU performance, set these environment variables:

export OMP_NUM_THREADS=$(nproc)          # Use all physical cores
export KMP_AFFINITY=granularity=fine,compact,1,0
export ORT_THREAD_POOL_SIZE=$(nproc)

License

Apache 2.0