janni-t's picture
feat: model card
8fe0c23 verified
metadata
language:
  - en
  - zh
  - ru
  - ja
  - de
  - fr
  - es
  - pt
  - vi
  - th
  - ar
  - ko
  - it
  - pl
  - nl
  - sv
  - tr
  - he
  - cs
  - uk
  - ro
  - bg
  - hu
  - el
  - da
  - fi
  - nb
  - sk
  - sl
  - hr
  - lt
  - lv
  - et
  - mt
pipeline_tag: sentence-similarity
tags:
  - qwen
  - embedding
  - onnx
  - int8
  - quantized
  - text-embeddings-inference
license: apache-2.0

Qwen3-Embedding-0.6B ONNX INT8 for Text Embeddings Inference

This is an INT8 quantized ONNX version of Qwen/Qwen3-Embedding-0.6B optimized specifically for Text Embeddings Inference (TEI) with CPU acceleration.

Key Features

  • INT8 Quantization: ~8x smaller model size (0.56GB vs 4.7GB)
  • CPU Optimized: 2-4x faster inference on CPU compared to float32
  • TEI Compatible: Properly formatted for Text Embeddings Inference
  • Multilingual: Supports 29 languages including English, Chinese, Russian, Japanese, etc.
  • Mean Pooling: Configured for mean pooling (handled by TEI)

Performance

  • Model size: 0.56 GB (vs 4.7 GB float32)
  • Expected speedup: 2-4x on CPU
  • Accuracy: Minimal loss (1-3%) compared to float32
  • Best for: CPU deployments, edge devices, high-throughput scenarios

Usage with Text Embeddings Inference

Docker Deployment (CPU)

docker run -p 8080:80 \
  -e OMP_NUM_THREADS=$(nproc) \
  -e KMP_AFFINITY=granularity=fine,compact,1,0 \
  -e ORT_THREAD_POOL_SIZE=$(nproc) \
  ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
  --model-id YOUR_USERNAME/qwen3-embedding-0.6b-int8-tei-onnx

Python Client

from huggingface_hub import InferenceClient

client = InferenceClient("http://localhost:8080")

# Single embedding
response = client.post(
    json={"inputs": "What is Deep Learning?"},
)
embedding = response.json()

# Batch embeddings
response = client.post(
    json={"inputs": ["What is Deep Learning?", "深度学习是什么?"]},
)
embeddings = response.json()

CPU Optimization

For optimal CPU performance, set these environment variables:

export OMP_NUM_THREADS=$(nproc)          # Use all physical cores
export KMP_AFFINITY=granularity=fine,compact,1,0
export ORT_THREAD_POOL_SIZE=$(nproc)

License

Apache 2.0