darios commited on
Commit
316ca1b
·
verified ·
1 Parent(s): 2ff4e70

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -56
README.md CHANGED
@@ -11,11 +11,19 @@ license: apache-2.0
11
  object-position: center top;">
12
  </div>
13
 
14
- Zonos-v0.1 is a leading open-weight text-to-speech model, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.
 
 
 
 
15
 
16
- It enables highly naturalistic speech generation from text prompts when given a speaker embedding or audio prefix. With just 5 to 30 seconds of speech, Zonos can achieve high-fidelity voice cloning. It also allows conditioning based on speaking rate, pitch variation, audio quality, and emotions such as sadness, fear, anger, happiness, and joy. The model outputs speech natively at 44kHz.
17
 
18
- Trained on approximately 200,000 hours of primarily English speech data, Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone. An architecture overview can be seen below.
 
 
 
 
19
 
20
  <div align="center">
21
  <img src="https://github.com/Zyphra/Zonos/blob/main/assets/ArchitectureDiagram.png?raw=true"
@@ -25,84 +33,114 @@ Trained on approximately 200,000 hours of primarily English speech data, Zonos f
25
  object-position: center top;">
26
  </div>
27
 
28
- Read more about our models [here](https://www.zyphra.com/post/beta-release-of-zonos-v0-1).
29
 
30
- ## Features
31
- * Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output
32
- * Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching. Audio prefixes can be used to elicit behaviours such as whispering which are challenging to obtain from pure voice cloning
33
- * Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German
34
- * Audio quality and emotion control: Zonos offers fine-grained control of many aspects of the generated audio. These include speaking rate, pitch, maximum frequency, audio quality, and various emotions such as happiness, anger, sadness, and fear.
35
- * Fast: our model runs with a real-time factor of ~2x on an RTX 4090
36
- * WebUI gradio interface: Zonos comes packaged with an easy to use gradio interface to generate speech
37
- * Simple installation and deployment: Zonos can be installed and deployed simply using the docker file packaged with our repository.
38
 
 
39
 
40
- ## Docker Installation
 
 
 
 
41
 
42
- ```bash
43
- git clone git@github.com:Zyphra/Zonos.git
44
- cd Zonos
45
 
46
- # For gradio
47
- docker compose up
48
 
49
- # Or for development you can do
50
- docker build -t Zonos .
51
- docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t Zonos
52
- cd /Zonos
53
- python3 sample.py # this will generate a sample.wav in /Zonos
 
 
54
  ```
55
 
56
- ## DIY Installation
57
- ### eSpeak
58
 
59
  ```bash
60
- apt install espeak-ng
 
61
  ```
62
 
63
- ### Python dependencies
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
- Make sure you have a recent version of [uv](https://docs.astral.sh/uv/#installation), then run the following commands in sequence:
 
 
 
 
66
 
67
  ```bash
68
- uv venv
69
- uv sync --no-group main
70
- uv sync
71
  ```
72
 
73
- ## Usage example
 
 
 
 
74
 
75
  ```bash
76
- Python3 sample.py
 
77
  ```
78
- This will produce `sample.wav` in the `Zonos` directory.
79
 
80
- ## Getting started with Zonos in python
81
- Once you have Zonos installed try generating audio programmatically in python
82
- ```python3
83
- import torch
84
- import torchaudio
85
- from zonos.model import Zonos
86
- from zonos.conditioning import make_cond_dict
87
 
88
- # Use the hybrid with "Zyphra/Zonos-v0.1-hybrid"
89
- model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
90
- model.bfloat16()
 
91
 
92
- wav, sampling_rate = torchaudio.load("./exampleaudio.mp3")
93
- spk_embedding = model.embed_spk_audio(wav, sampling_rate)
94
 
95
- torch.manual_seed(421)
 
 
 
96
 
97
- cond_dict = make_cond_dict(
98
- text="Hello, world!",
99
- speaker=spk_embedding.to(torch.bfloat16),
100
- language="en-us",
101
- )
102
- conditioning = model.prepare_conditioning(cond_dict)
103
 
104
- codes = model.generate(conditioning)
105
 
106
- wavs = model.autoencoder.decode(codes).cpu()
107
- torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)
 
108
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  object-position: center top;">
12
  </div>
13
 
14
+ ---
15
+
16
+ Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.
17
+
18
+ Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.
19
 
20
+ ##### For more details and speech samples, check out our blog [here](https://www.zyphra.com/post/beta-release-of-zonos-v0-1)
21
 
22
+ ##### We also have a hosted version available at [maia.zyphra.com/audio](https://maia.zyphra.com/audio)
23
+
24
+ ---
25
+
26
+ Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone. An overview of the architecture can be seen below.
27
 
28
  <div align="center">
29
  <img src="https://github.com/Zyphra/Zonos/blob/main/assets/ArchitectureDiagram.png?raw=true"
 
33
  object-position: center top;">
34
  </div>
35
 
36
+ ---
37
 
38
+ ## Usage
 
 
 
 
 
 
 
39
 
40
+ ### Python
41
 
42
+ ```python
43
+ import torch
44
+ import torchaudio
45
+ from zonos.model import Zonos
46
+ from zonos.conditioning import make_cond_dict
47
 
48
+ # model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-hybrid", device="cuda")
49
+ model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
 
50
 
51
+ wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
52
+ speaker = model.make_speaker_embedding(wav, sampling_rate)
53
 
54
+ cond_dict = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us")
55
+ conditioning = model.prepare_conditioning(cond_dict)
56
+
57
+ codes = model.generate(conditioning)
58
+
59
+ wavs = model.autoencoder.decode(codes).cpu()
60
+ torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)
61
  ```
62
 
63
+ ### Gradio interface (recommended)
 
64
 
65
  ```bash
66
+ uv run gradio_interface.py
67
+ # python gradio_interface.py
68
  ```
69
 
70
+ This should produce a `sample.wav` file in your project root directory.
71
+
72
+ _For repeated sampling we highly recommend using the gradio interface instead, as the minimal example needs to load the model every time it is run._
73
+
74
+ ## Features
75
+
76
+ - Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output
77
+ - Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching. Audio prefixes can be used to elicit behaviours such as whispering which can otherwise be challenging to replicate when cloning from speaker embeddings
78
+ - Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German
79
+ - Audio quality and emotion control: Zonos offers fine-grained control of many aspects of the generated audio. These include speaking rate, pitch, maximum frequency, audio quality, and various emotions such as happiness, anger, sadness, and fear.
80
+ - Fast: our model runs with a real-time factor of ~2x on an RTX 4090
81
+ - Gradio WebUI: Zonos comes packaged with an easy to use gradio interface to generate speech
82
+ - Simple installation and deployment: Zonos can be installed and deployed simply using the docker file packaged with our repository.
83
+
84
+ ## Installation
85
+
86
+ **At the moment this repository only supports Linux systems (preferably Ubuntu 22.04/24.04) with recent NVIDIA GPUs (3000-series or newer, 6GB+ VRAM).**
87
 
88
+ See also [Docker Installation](#docker-installation)
89
+
90
+ #### System dependencies
91
+
92
+ Zonos depends on the eSpeak library phonemization. You can install it on Ubuntu with the following command:
93
 
94
  ```bash
95
+ apt install -y espeak-ng
 
 
96
  ```
97
 
98
+ #### Python dependencies
99
+
100
+ We highly recommend using a recent version of [uv](https://docs.astral.sh/uv/#installation) for installation. If you don't have uv installed, you can install it via pip: `pip install -U uv`.
101
+
102
+ ##### Installing into a new uv virtual environment (recommended)
103
 
104
  ```bash
105
+ uv sync
106
+ uv sync --extra compile
107
  ```
 
108
 
109
+ ##### Installing into the system/actived environment using uv
 
 
 
 
 
 
110
 
111
+ ```bash
112
+ uv pip install -e .
113
+ uv pip install -e .[compile]
114
+ ```
115
 
116
+ ##### Installing into the system/actived environment using pip
 
117
 
118
+ ```bash
119
+ pip install -e .
120
+ pip install --no-build-isolation -e .[compile]
121
+ ```
122
 
123
+ ##### Confirm that it's working
 
 
 
 
 
124
 
125
+ For convenience we provide a minimal example to check that the installation works:
126
 
127
+ ```bash
128
+ uv run sample.py
129
+ # python sample.py
130
  ```
131
+
132
+ ## Docker installation
133
+
134
+ ```bash
135
+ git clone https://github.com/Zyphra/Zonos.git
136
+ cd Zonos
137
+
138
+ # For gradio
139
+ docker compose up
140
+
141
+ # Or for development you can do
142
+ docker build -t Zonos .
143
+ docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t Zonos
144
+ cd /Zonos
145
+ python sample.py # this will generate a sample.wav in /Zonos
146
+ ```