Quantization

Pocket TTS supports dynamic int8 quantization to reduce runtime memory usage and improve inference speed on x86 CPUs.

Quick Start

CLI

pocket-tts generate --quantize --text "Hello world"
pocket-tts serve --quantize

Python API

from pocket_tts import TTSModel

model = TTSModel.load_model(quantize=True)
voice_state = model.get_state_for_audio_prompt("alba")
audio = model.generate_audio(voice_state, "Hello world!")

Installation

Quantization works out of the box on any supported PyTorch version (2.5+) using torch.ao.

For optimized performance, install torchao (requires torch 2.10+):

pip install pocket-tts[quantize]

The quantization module automatically selects the best available backend:

torchao (torch 2.10+): optimized C++ kernels, faster on both ARM and x86
torch.ao (torch 2.5-2.9): deprecated but functional fallback

Performance

Benchmarks on the full eval paragraph across 8 voices, 5 isolated runs per config.

x86 (FBGEMM, ubuntu-latest GitHub Actions runner)

Config	Runtime Memory	RTS	Speedup vs Baseline
baseline	450 MB	3.17x	--
attention_ffn (default)	234 MB	4.04x	1.27x
all	206 MB	4.01x	1.26x

ARM (QNNPACK, Apple M4 MacBook Air, torchao backend)

Config	Runtime Memory	RTS	Speedup vs Baseline
baseline	450 MB	6.33x	--
attention_ffn (default)	234 MB	7.76x	1.23x

With the torch.ao fallback (torch 2.5-2.9), ARM performance is ~16% slower than baseline rather than faster. Upgrading to torch 2.10+ with torchao is recommended for ARM users.

Quality

Quantization has no measurable impact on speech quality:

WER (Word Error Rate): WER delta for the default attention_ffn config is −0.022 ±0.032 - the ± range crosses zero, meaning the delta is indistinguishable from measurement noise.
Subjective listening: no audible difference across all 8 voices

What gets quantized

When quantize=True, the following layer groups in the FlowLM transformer are quantized to int8:

Group	Params	Description
`attention`	~25M	Q/K/V/output projections in transformer layers
`ffn`	~50M	Feed-forward linear layers in transformer layers

The flow matching network (flow_net, ~7M params) and the Mimi VAE decoder (convolutional) remain in float32.