voxcpm

We have hosted the application voxcpm in order to run this application in our online workstations with Wine or directly.

Run voxcpm online

Quick description about voxcpm:

VoxCPM is a tokenizer-free text-to-speech system that models speech in a continuous space, aiming for extremely realistic, context-aware synthesis and true-to-life zero-shot voice cloning. Instead of converting speech into discrete tokens, it uses an end-to-end diffusion-autoregressive architecture built on the MiniCPM-4 backbone, combining hierarchical language modeling, finite scalar quantization (FSQ), and local Diffusion Transformers. This design helps decouple semantic and acoustic information while preserving fine-grained prosody, leading to more stable and expressive generation than many discrete-token systems. Trained on a large 1.8-million-hour bilingual corpus, VoxCPM can infer appropriate speaking style from context, dynamically adjusting intonation, rhythm, and emotional tone. It supports zero-shot voice cloning from a short reference audio clip, capturing timbre, accent, and pacing to closely mimic a target speaker without per-speaker fine-tuning.

Features:

Tokenizer-free diffusion-autoregressive TTS that operates in continuous speech space
Context-aware expressive generation that adapts prosody, style, and emotion from input text
True zero-shot voice cloning from short reference audio clips without speaker-specific training
Streaming synthesis support with low real-time factor suitable for interactive applications
Python API and CLI for easy use, including options for guidance strength, timesteps, normalization, and denoising
Pretrained VoxCPM-0.5B weights released with a Gradio playground and integration hooks for enhancement and ASR tools

Programming Language: Python.
Categories:

Text to Speech

Page navigation:

By OD Group OU – Registry code: 1609791 -VAT number: EE102345621.