Cactus: High-Performance AI Inference on Any Smartphone

Community Article Published October 3, 2025

TLDR

  • We built an AI inference engine from the ground up to run LLMs, VLMs, and Speech models on any smartphone, including low-range devices.
  • On-device mobile AI is fast (<100ms latency), private, works offline, and saves on API costs.
  • Run Qwen3-600m at up to 75 tok/sec on modern phones and 20 tok/sec on older models.
  • Check out our GitHub repo to start building. Cactus is source-available and free for personal, hobby, and SMB projects.

Hey Hugging Face community, Henry and Roman here, the builders of Cactus. We're excited to share the AI inference engine we've been working on. Designed from the ground up in C++, Cactus lets you deploy AI models in-app through our cross-platform SDKs.

The push for on-device AI is undeniable. The benefits are clear: near-zero latency, privacy by deafult, offline compatibility, and massive savings.

We are also seeing a fascinating trend in the model space. The community is proving that smaller, optimized models can be very powerful, especially in combination with the right tooling. We've been particularly impressed with DeepMind's Gemma3 270m and have found Qwen3 600m to excel at tool calling.

However! Bringing these models to mobile devices in production is a massive challenge. Existing frameworks often fall short because they are:

  • Optimized for flagship devices, leaving >70% of lower-end smartphone users behind.
  • Chunky, leading to bloated bundle sizes and battery drain.
  • Platform-specific, forcing developers to maintain multiple workflows for different operating systems.

image

Built from the Ground Up for Mobile

At Cactus, we've tackled these challenges by writing our own kernels, and inference engine. Every design choice, from accelerator support, to energy efficiency, quantization levels, and context management has been made with the constraints of mobile devices in mind.

We're launching:

  • Cactus Kernels: Low-level computation routines optimized for CPU and NPU.
  • Cactus Graph: Efficient graph representation for neural networks.
  • Cactus Engine: Runtime that executes the graph on-device.

We also provide streamlined SDKs in Flutter, Kotlin, and React Native, that allow developers to build complex workflows with agentic tool use, RAG, and more.

Performance (that Speaks for Itself)

We've been relentlessly focussing on performance. Here are our benchmarks for Qwen3-600m-int8 on CPU:

  • 16-20 tok/sec on devices like Pixel 6a or iPhone 11
  • 70+ tok/sec on the latest iPhone 17 and Galaxy S25 Ultra
  • Time-to-first-token as low as 50ms

NPU optimization yields even better performance.

1_tKCOQCFBURV10XcAOjfMrA

Source available

Our work is available at https://github.com/cactus-compute/cactus. Cactus is free for hobbyists, students, non-profits as well as small businesses. See the license for exact terms.

You can try our demo app, Cactus Chat, on the App Store and Google Play Store. We power a number of production deployments, collectively serving 500,000+ weekly inference tasks.

While Cactus is primarily focussed on mobile, we're huge fans of the amazing work being done by the broader community. For desktops and servers, we recommend the tools you already know and love like Ollama, llama.cpp, vLLM and MLX.

We're very excited to see what the community builds with Cactus. Learn more about us, and share any feedback in our Discord!

Community

On device is the future!

Sign up or log in to comment