Zyphra
/

Zonos-v0.1-transformer

Text-to-Speech

Safetensors

Model card Files Files and versions Community

darios commited on 10 days ago

Commit

316ca1b

verified ·

1 Parent(s): 2ff4e70

Update README.md

Browse files

Files changed (1) hide show

README.md +94 -56

README.md CHANGED Viewed

@@ -11,11 +11,19 @@ license: apache-2.0
             object-position: center top;">
 </div>
-Zonos-v0.1 is a leading open-weight text-to-speech model, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.
-It enables highly naturalistic speech generation from text prompts when given a speaker embedding or audio prefix. With just 5 to 30 seconds of speech, Zonos can achieve high-fidelity voice cloning. It also allows conditioning based on speaking rate, pitch variation, audio quality, and emotions such as sadness, fear, anger, happiness, and joy. The model outputs speech natively at 44kHz.
-Trained on approximately 200,000 hours of primarily English speech data, Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone. An architecture overview can be seen below.
 <div align="center">
 <img src="https://github.com/Zyphra/Zonos/blob/main/assets/ArchitectureDiagram.png?raw=true"
@@ -25,84 +33,114 @@ Trained on approximately 200,000 hours of primarily English speech data, Zonos f
             object-position: center top;">
 </div>
-Read more about our models [here](https://www.zyphra.com/post/beta-release-of-zonos-v0-1).
-## Features
-* Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output
-* Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching. Audio prefixes can be used to elicit behaviours such as whispering which are challenging to obtain from pure voice cloning
-* Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German
-* Audio quality and emotion control: Zonos offers fine-grained control of many aspects of the generated audio. These include speaking rate, pitch, maximum frequency, audio quality, and various emotions such as happiness, anger, sadness, and fear.
-* Fast: our model runs with a real-time factor of ~2x on an RTX 4090
-* WebUI gradio interface: Zonos comes packaged with an easy to use gradio interface to generate speech
-* Simple installation and deployment: Zonos can be installed and deployed simply using the docker file packaged with our repository.
-## Docker Installation
-```bash
-git clone git@github.com:Zyphra/Zonos.git
-cd Zonos
-# For gradio
-docker compose up
-# Or for development you can do
-docker build -t Zonos .
-docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t Zonos
-cd /Zonos
-python3 sample.py # this will generate a sample.wav in /Zonos
 ```
-## DIY Installation
-### eSpeak
 ```bash
-apt install espeak-ng
 ```
-### Python dependencies
-Make sure you have a recent version of [uv](https://docs.astral.sh/uv/#installation), then run the following commands in sequence:
 ```bash
-uv venv
-uv sync --no-group main
-uv sync
 ```
-## Usage example
 ```bash
-Python3 sample.py
 ```
-This will produce `sample.wav` in the `Zonos` directory.
-## Getting started with Zonos in python
-Once you have Zonos installed try generating audio programmatically in python
-```python3
-import torch
-import torchaudio
-from zonos.model import Zonos
-from zonos.conditioning import make_cond_dict
-# Use the hybrid with "Zyphra/Zonos-v0.1-hybrid"
-model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
-model.bfloat16()
-wav, sampling_rate = torchaudio.load("./exampleaudio.mp3")
-spk_embedding = model.embed_spk_audio(wav, sampling_rate)
-torch.manual_seed(421)
-cond_dict = make_cond_dict(
-	text="Hello, world!",
-	speaker=spk_embedding.to(torch.bfloat16),
-	language="en-us",
-)
-conditioning = model.prepare_conditioning(cond_dict)
-codes = model.generate(conditioning)
-wavs = model.autoencoder.decode(codes).cpu()
-torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)
 ```

             object-position: center top;">
 </div>
+---
+Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.
+Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.
+##### For more details and speech samples, check out our blog [here](https://www.zyphra.com/post/beta-release-of-zonos-v0-1)
+##### We also have a hosted version available at [maia.zyphra.com/audio](https://maia.zyphra.com/audio)
+---
+Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone. An overview of the architecture can be seen below.
 <div align="center">
 <img src="https://github.com/Zyphra/Zonos/blob/main/assets/ArchitectureDiagram.png?raw=true"
             object-position: center top;">
 </div>
+---
+## Usage
+### Python
+```python
+import torch
+import torchaudio
+from zonos.model import Zonos
+from zonos.conditioning import make_cond_dict
+# model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-hybrid", device="cuda")
+model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
+wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
+speaker = model.make_speaker_embedding(wav, sampling_rate)
+cond_dict = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us")
+conditioning = model.prepare_conditioning(cond_dict)
+codes = model.generate(conditioning)
+wavs = model.autoencoder.decode(codes).cpu()
+torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)
 ```
+### Gradio interface (recommended)
 ```bash
+uv run gradio_interface.py
+# python gradio_interface.py
 ```
+This should produce a `sample.wav` file in your project root directory.
+_For repeated sampling we highly recommend using the gradio interface instead, as the minimal example needs to load the model every time it is run._
+## Features
+- Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output
+- Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching. Audio prefixes can be used to elicit behaviours such as whispering which can otherwise be challenging to replicate when cloning from speaker embeddings
+- Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German
+- Audio quality and emotion control: Zonos offers fine-grained control of many aspects of the generated audio. These include speaking rate, pitch, maximum frequency, audio quality, and various emotions such as happiness, anger, sadness, and fear.
+- Fast: our model runs with a real-time factor of ~2x on an RTX 4090
+- Gradio WebUI: Zonos comes packaged with an easy to use gradio interface to generate speech
+- Simple installation and deployment: Zonos can be installed and deployed simply using the docker file packaged with our repository.
+## Installation
+**At the moment this repository only supports Linux systems (preferably Ubuntu 22.04/24.04) with recent NVIDIA GPUs (3000-series or newer, 6GB+ VRAM).**
+See also [Docker Installation](#docker-installation)
+#### System dependencies
+Zonos depends on the eSpeak library phonemization. You can install it on Ubuntu with the following command:
 ```bash
+apt install -y espeak-ng
 ```
+#### Python dependencies
+We highly recommend using a recent version of [uv](https://docs.astral.sh/uv/#installation) for installation. If you don't have uv installed, you can install it via pip: `pip install -U uv`.
+##### Installing into a new uv virtual environment (recommended)
 ```bash
+uv sync
+uv sync --extra compile
 ```
+##### Installing into the system/actived environment using uv
+```bash
+uv pip install -e .
+uv pip install -e .[compile]
+```
+##### Installing into the system/actived environment using pip
+```bash
+pip install -e .
+pip install --no-build-isolation -e .[compile]
+```
+##### Confirm that it's working
+For convenience we provide a minimal example to check that the installation works:
+```bash
+uv run sample.py
+# python sample.py
 ```
+## Docker installation
+```bash
+git clone https://github.com/Zyphra/Zonos.git
+cd Zonos
+# For gradio
+docker compose up
+# Or for development you can do
+docker build -t Zonos .
+docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t Zonos
+cd /Zonos
+python sample.py # this will generate a sample.wav in /Zonos
+```