WhaleDolphin commited on
Commit
3341ef1
·
1 Parent(s): 4c5c5fc

Update model

Browse files
Files changed (27) hide show
  1. README.md +505 -0
  2. dump/raw/org/tr_no_dev/spk2sid +105 -0
  3. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/203epoch.pth +3 -0
  4. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/config.yaml +422 -0
  5. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_backward_time.png +0 -0
  6. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_fake_loss.png +0 -0
  7. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_forward_time.png +0 -0
  8. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_loss.png +0 -0
  9. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_optim_step_time.png +0 -0
  10. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_real_loss.png +0 -0
  11. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_train_time.png +0 -0
  12. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_adv_loss.png +0 -0
  13. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_backward_time.png +0 -0
  14. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_dur_loss.png +0 -0
  15. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_feat_match_loss.png +0 -0
  16. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_forward_time.png +0 -0
  17. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_kl_loss.png +0 -0
  18. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_loss.png +0 -0
  19. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_mel_loss.png +0 -0
  20. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_optim_step_time.png +0 -0
  21. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_train_time.png +0 -0
  22. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/gpu_max_cached_mem_GB.png +0 -0
  23. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/iter_time.png +0 -0
  24. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/optim0_lr0.png +0 -0
  25. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/optim1_lr0.png +0 -0
  26. exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/train_time.png +0 -0
  27. meta.yaml +8 -0
README.md ADDED
@@ -0,0 +1,505 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - text-to-speech
6
+ language: en
7
+ datasets:
8
+ - genshin
9
+ license: cc-by-4.0
10
+ ---
11
+
12
+ ## ESPnet2 TTS model
13
+
14
+ ### `WhaleDolphin/Genshin-vits-espnet2`
15
+
16
+ This model was trained by Whale-Dolphin using genshin recipe in [espnet](https://github.com/espnet/espnet/).
17
+
18
+ ### Demo: How to use in ESPnet2
19
+
20
+ Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html)
21
+ if you haven't done that already.
22
+
23
+ ```bash
24
+ cd espnet
25
+ git checkout 0fa63ed0a4dae8ac19fd489ff1a14a9b8a98dd64
26
+ pip install -e .
27
+ cd egs2/genshin/tts1
28
+ ./run.sh --skip_data_prep false --skip_train true --download_model WhaleDolphin/Genshin-vits-espnet2
29
+ ```
30
+
31
+
32
+
33
+ ## TTS config
34
+
35
+ <details><summary>expand</summary>
36
+
37
+ ```
38
+ config: ./conf/tuning/train_full_band_multi_spk_vits.yaml
39
+ print_config: false
40
+ log_level: INFO
41
+ drop_last_iter: false
42
+ dry_run: false
43
+ iterator_type: sequence
44
+ valid_iterator_type: null
45
+ output_dir: exp//tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space
46
+ ngpu: 1
47
+ seed: 777
48
+ num_workers: 6
49
+ num_att_plot: 3
50
+ dist_backend: nccl
51
+ dist_init_method: env://
52
+ dist_world_size: 8
53
+ dist_rank: 0
54
+ local_rank: 0
55
+ dist_master_addr: localhost
56
+ dist_master_port: 59239
57
+ dist_launcher: null
58
+ multiprocessing_distributed: true
59
+ unused_parameters: true
60
+ sharded_ddp: false
61
+ use_deepspeed: false
62
+ deepspeed_config: null
63
+ cudnn_enabled: true
64
+ cudnn_benchmark: true
65
+ cudnn_deterministic: false
66
+ use_tf32: false
67
+ collect_stats: false
68
+ write_collected_feats: false
69
+ max_epoch: 1000
70
+ patience: null
71
+ val_scheduler_criterion:
72
+ - valid
73
+ - loss
74
+ early_stopping_criterion:
75
+ - valid
76
+ - loss
77
+ - min
78
+ best_model_criterion:
79
+ - - train
80
+ - total_count
81
+ - max
82
+ keep_nbest_models: 10
83
+ nbest_averaging_interval: 0
84
+ grad_clip: -1
85
+ grad_clip_type: 2.0
86
+ grad_noise: false
87
+ accum_grad: 1
88
+ no_forward_run: false
89
+ resume: true
90
+ train_dtype: float32
91
+ use_amp: false
92
+ log_interval: 50
93
+ use_matplotlib: true
94
+ use_tensorboard: true
95
+ create_graph_in_tensorboard: false
96
+ use_wandb: false
97
+ wandb_project: null
98
+ wandb_id: null
99
+ wandb_entity: null
100
+ wandb_name: null
101
+ wandb_model_log_interval: -1
102
+ detect_anomaly: false
103
+ use_adapter: false
104
+ adapter: lora
105
+ save_strategy: all
106
+ adapter_conf: {}
107
+ pretrain_path: null
108
+ init_param: []
109
+ ignore_init_mismatch: false
110
+ freeze_param: []
111
+ num_iters_per_epoch: 1000
112
+ batch_size: 20
113
+ valid_batch_size: null
114
+ batch_bins: 50000000
115
+ valid_batch_bins: null
116
+ category_sample_size: 10
117
+ train_shape_file:
118
+ - exp//tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/train/text_shape.phn
119
+ - exp//tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/train/speech_shape
120
+ valid_shape_file:
121
+ - exp//tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/valid/text_shape.phn
122
+ - exp//tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/valid/speech_shape
123
+ batch_type: numel
124
+ valid_batch_type: null
125
+ fold_length:
126
+ - 150
127
+ - 409600
128
+ sort_in_batch: descending
129
+ shuffle_within_batch: false
130
+ sort_batch: descending
131
+ multiple_iterator: false
132
+ chunk_length: 500
133
+ chunk_shift_ratio: 0.5
134
+ num_cache_chunks: 1024
135
+ chunk_excluded_key_prefixes: []
136
+ chunk_default_fs: null
137
+ chunk_max_abs_length: null
138
+ chunk_discard_short_samples: true
139
+ train_data_path_and_name_and_type:
140
+ - - dump//raw/tr_no_dev/text
141
+ - text
142
+ - text
143
+ - - dump//raw/tr_no_dev/wav.scp
144
+ - speech
145
+ - sound
146
+ - - dump//raw/tr_no_dev/utt2sid
147
+ - sids
148
+ - text_int
149
+ valid_data_path_and_name_and_type:
150
+ - - dump//raw/dev/text
151
+ - text
152
+ - text
153
+ - - dump//raw/dev/wav.scp
154
+ - speech
155
+ - sound
156
+ - - dump//raw/dev/utt2sid
157
+ - sids
158
+ - text_int
159
+ multi_task_dataset: false
160
+ allow_variable_data_keys: false
161
+ max_cache_size: 0.0
162
+ max_cache_fd: 32
163
+ allow_multi_rates: false
164
+ valid_max_cache_size: null
165
+ exclude_weight_decay: false
166
+ exclude_weight_decay_conf: {}
167
+ optim: adamw
168
+ optim_conf:
169
+ lr: 0.0002
170
+ betas:
171
+ - 0.8
172
+ - 0.99
173
+ eps: 1.0e-09
174
+ weight_decay: 0.0
175
+ scheduler: exponentiallr
176
+ scheduler_conf:
177
+ gamma: 0.999875
178
+ optim2: adamw
179
+ optim2_conf:
180
+ lr: 0.0002
181
+ betas:
182
+ - 0.8
183
+ - 0.99
184
+ eps: 1.0e-09
185
+ weight_decay: 0.0
186
+ scheduler2: exponentiallr
187
+ scheduler2_conf:
188
+ gamma: 0.999875
189
+ generator_first: false
190
+ skip_discriminator_prob: 0.0
191
+ token_list:
192
+ - <blank>
193
+ - <unk>
194
+ - T
195
+ - N
196
+ - AH0
197
+ - S
198
+ - R
199
+ - L
200
+ - D
201
+ - M
202
+ - IH1
203
+ - K
204
+ - DH
205
+ - EH1
206
+ - AH1
207
+ - Z
208
+ - UW1
209
+ - AE1
210
+ - W
211
+ - AY1
212
+ - IH0
213
+ - IY1
214
+ - P
215
+ - V
216
+ - F
217
+ - ','
218
+ - B
219
+ - ER0
220
+ - HH
221
+ - AA1
222
+ - EY1
223
+ - .
224
+ - IY0
225
+ - AO1
226
+ - OW1
227
+ - Y
228
+ - NG
229
+ - G
230
+ - SH
231
+ - AW1
232
+ - '...'
233
+ - UH1
234
+ - '?'
235
+ - JH
236
+ - TH
237
+ - '!'
238
+ - CH
239
+ - ER1
240
+ - EY2
241
+ - IH2
242
+ - OW0
243
+ - EH2
244
+ - AH2
245
+ - UW0
246
+ - AY2
247
+ - AA2
248
+ - AE2
249
+ - OY1
250
+ - OW2
251
+ - EH0
252
+ - ZH
253
+ - AO2
254
+ - AA0
255
+ - AO0
256
+ - UW2
257
+ - AE0
258
+ - AY0
259
+ - AW2
260
+ - IY2
261
+ - ''''
262
+ - ER2
263
+ - UH0
264
+ - EY0
265
+ - UH2
266
+ - AW0
267
+ - . ...
268
+ - OY0
269
+ - OY2
270
+ - ..
271
+ - '... ...'
272
+ - '... ... ...'
273
+ - <sos/eos>
274
+ odim: null
275
+ model_conf: {}
276
+ use_preprocessor: true
277
+ token_type: phn
278
+ bpemodel: null
279
+ non_linguistic_symbols: null
280
+ cleaner: tacotron
281
+ g2p: g2p_en_no_space
282
+ feats_extract: linear_spectrogram
283
+ feats_extract_conf:
284
+ n_fft: 2048
285
+ hop_length: 512
286
+ win_length: null
287
+ normalize: null
288
+ normalize_conf: {}
289
+ tts: vits
290
+ tts_conf:
291
+ generator_type: vits_generator
292
+ generator_params:
293
+ hidden_channels: 192
294
+ spks: 4096
295
+ global_channels: 256
296
+ segment_size: 32
297
+ text_encoder_attention_heads: 2
298
+ text_encoder_ffn_expand: 4
299
+ text_encoder_blocks: 6
300
+ text_encoder_positionwise_layer_type: conv1d
301
+ text_encoder_positionwise_conv_kernel_size: 3
302
+ text_encoder_positional_encoding_layer_type: rel_pos
303
+ text_encoder_self_attention_layer_type: rel_selfattn
304
+ text_encoder_activation_type: swish
305
+ text_encoder_normalize_before: true
306
+ text_encoder_dropout_rate: 0.1
307
+ text_encoder_positional_dropout_rate: 0.0
308
+ text_encoder_attention_dropout_rate: 0.1
309
+ use_macaron_style_in_text_encoder: true
310
+ use_conformer_conv_in_text_encoder: false
311
+ text_encoder_conformer_kernel_size: -1
312
+ decoder_kernel_size: 7
313
+ decoder_channels: 512
314
+ decoder_upsample_scales:
315
+ - 8
316
+ - 8
317
+ - 2
318
+ - 2
319
+ - 2
320
+ decoder_upsample_kernel_sizes:
321
+ - 16
322
+ - 16
323
+ - 4
324
+ - 4
325
+ - 4
326
+ decoder_resblock_kernel_sizes:
327
+ - 3
328
+ - 7
329
+ - 11
330
+ decoder_resblock_dilations:
331
+ - - 1
332
+ - 3
333
+ - 5
334
+ - - 1
335
+ - 3
336
+ - 5
337
+ - - 1
338
+ - 3
339
+ - 5
340
+ use_weight_norm_in_decoder: true
341
+ posterior_encoder_kernel_size: 5
342
+ posterior_encoder_layers: 16
343
+ posterior_encoder_stacks: 1
344
+ posterior_encoder_base_dilation: 1
345
+ posterior_encoder_dropout_rate: 0.0
346
+ use_weight_norm_in_posterior_encoder: true
347
+ flow_flows: 4
348
+ flow_kernel_size: 5
349
+ flow_base_dilation: 1
350
+ flow_layers: 4
351
+ flow_dropout_rate: 0.0
352
+ use_weight_norm_in_flow: true
353
+ use_only_mean_in_flow: true
354
+ stochastic_duration_predictor_kernel_size: 3
355
+ stochastic_duration_predictor_dropout_rate: 0.5
356
+ stochastic_duration_predictor_flows: 4
357
+ stochastic_duration_predictor_dds_conv_layers: 3
358
+ vocabs: 82
359
+ aux_channels: 1025
360
+ discriminator_type: hifigan_multi_scale_multi_period_discriminator
361
+ discriminator_params:
362
+ scales: 1
363
+ scale_downsample_pooling: AvgPool1d
364
+ scale_downsample_pooling_params:
365
+ kernel_size: 4
366
+ stride: 2
367
+ padding: 2
368
+ scale_discriminator_params:
369
+ in_channels: 1
370
+ out_channels: 1
371
+ kernel_sizes:
372
+ - 15
373
+ - 41
374
+ - 5
375
+ - 3
376
+ channels: 128
377
+ max_downsample_channels: 1024
378
+ max_groups: 16
379
+ bias: true
380
+ downsample_scales:
381
+ - 2
382
+ - 2
383
+ - 4
384
+ - 4
385
+ - 1
386
+ nonlinear_activation: LeakyReLU
387
+ nonlinear_activation_params:
388
+ negative_slope: 0.1
389
+ use_weight_norm: true
390
+ use_spectral_norm: false
391
+ follow_official_norm: false
392
+ periods:
393
+ - 2
394
+ - 3
395
+ - 5
396
+ - 7
397
+ - 11
398
+ period_discriminator_params:
399
+ in_channels: 1
400
+ out_channels: 1
401
+ kernel_sizes:
402
+ - 5
403
+ - 3
404
+ channels: 32
405
+ downsample_scales:
406
+ - 3
407
+ - 3
408
+ - 3
409
+ - 3
410
+ - 1
411
+ max_downsample_channels: 1024
412
+ bias: true
413
+ nonlinear_activation: LeakyReLU
414
+ nonlinear_activation_params:
415
+ negative_slope: 0.1
416
+ use_weight_norm: true
417
+ use_spectral_norm: false
418
+ generator_adv_loss_params:
419
+ average_by_discriminators: false
420
+ loss_type: mse
421
+ discriminator_adv_loss_params:
422
+ average_by_discriminators: false
423
+ loss_type: mse
424
+ feat_match_loss_params:
425
+ average_by_discriminators: false
426
+ average_by_layers: false
427
+ include_final_outputs: true
428
+ mel_loss_params:
429
+ fs: 44100
430
+ n_fft: 2048
431
+ hop_length: 512
432
+ win_length: null
433
+ window: hann
434
+ n_mels: 80
435
+ fmin: 0
436
+ fmax: null
437
+ log_base: null
438
+ lambda_adv: 1.0
439
+ lambda_mel: 45.0
440
+ lambda_feat_match: 2.0
441
+ lambda_dur: 1.0
442
+ lambda_kl: 1.0
443
+ sampling_rate: 44100
444
+ cache_generator_outputs: true
445
+ plot_pred_mos: false
446
+ mos_pred_tool: utmos
447
+ pitch_extract: null
448
+ pitch_extract_conf: {}
449
+ pitch_normalize: null
450
+ pitch_normalize_conf: {}
451
+ energy_extract: null
452
+ energy_extract_conf: {}
453
+ energy_normalize: null
454
+ energy_normalize_conf: {}
455
+ required:
456
+ - output_dir
457
+ - token_list
458
+ version: '202412'
459
+ distributed: true
460
+ ```
461
+
462
+ </details>
463
+
464
+
465
+
466
+ ### Citing ESPnet
467
+
468
+ ```BibTex
469
+ @inproceedings{watanabe2018espnet,
470
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
471
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
472
+ year={2018},
473
+ booktitle={Proceedings of Interspeech},
474
+ pages={2207--2211},
475
+ doi={10.21437/Interspeech.2018-1456},
476
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
477
+ }
478
+
479
+
480
+
481
+
482
+ @inproceedings{hayashi2020espnet,
483
+ title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
484
+ author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
485
+ booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
486
+ pages={7654--7658},
487
+ year={2020},
488
+ organization={IEEE}
489
+ }
490
+
491
+
492
+ ```
493
+
494
+ or arXiv:
495
+
496
+ ```bibtex
497
+ @misc{watanabe2018espnet,
498
+ title={ESPnet: End-to-End Speech Processing Toolkit},
499
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
500
+ year={2018},
501
+ eprint={1804.00015},
502
+ archivePrefix={arXiv},
503
+ primaryClass={cs.CL}
504
+ }
505
+ ```
dump/raw/org/tr_no_dev/spk2sid ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <unk> 0
2
+ ABeiDuo 1
3
+ ALeiQiNuo 2
4
+ AiBeiEr 3
5
+ AiDe 4
6
+ AiErHaiSen 5
7
+ AiMeiLiAi 6
8
+ AnBai 7
9
+ AoZi 8
10
+ BaBaLa 9
11
+ BaZhongShenZi 10
12
+ BaiZhu 11
13
+ BanNiTe 12
14
+ BeiDou 13
15
+ DaDaLiYa 14
16
+ DaRouWan 15
17
+ DaiYinSiLeiBu 16
18
+ DiAoNa 17
19
+ DiLuKe 18
20
+ DiNaZeDai 19
21
+ DiXiYa 20
22
+ DuoLi 21
23
+ FaLuShan 22
24
+ FeiMiNi 23
25
+ FeiXieEr 24
26
+ FengYuanWanYe 25
27
+ FuNingNa 26
28
+ GanYu 27
29
+ HuTao 28
30
+ HuangLongYiDou 29
31
+ JiNiQi 30
32
+ JiaMing 31
33
+ JiuQiRen 32
34
+ JiuTiaoShaLuo 33
35
+ KaQiNa 34
36
+ KaWei 35
37
+ KaiSeLin 36
38
+ KaiYa 37
39
+ KanDiSi 38
40
+ KeLai 39
41
+ KeLi 40
42
+ KeLuoLinDe 41
43
+ KeQing 42
44
+ Kong 43
45
+ LaiOuSiLi 44
46
+ LaiYiLa 45
47
+ LeiDianJiangJun 46
48
+ LeiZe 47
49
+ LiSha 48
50
+ LinNi 49
51
+ LinNiTe 50
52
+ LiuLangZhe 51
53
+ LiuYunJieFengZhenJun 52
54
+ LuYeYuanPingCang 53
55
+ LuoShaLiYa 54
56
+ MaLaNi 55
57
+ MaWeiKa 56
58
+ MiKa 57
59
+ MoNa 58
60
+ NaWeiLaiTe 59
61
+ NaWeiYa 60
62
+ NaXiDa 61
63
+ NiLu 62
64
+ NingGuang 63
65
+ NuoAiEr 64
66
+ OuFeiNi 65
67
+ PaiMeng 66
68
+ PingLaoLao 67
69
+ QiLiangLiang 68
70
+ QianTeLaLi 69
71
+ QianZhi 70
72
+ Qin 71
73
+ SaiNuo 72
74
+ SaiSuoSi 73
75
+ ShaTang 74
76
+ ShanHuGongXinHai 75
77
+ ShenHe 76
78
+ ShenLiLingHua 77
79
+ ShenLiLingRen 78
80
+ ShiDaJiang 79
81
+ TeLaZuoLi 80
82
+ TiNaLi 81
83
+ TuoMa 82
84
+ WenDi 83
85
+ WuLang 84
86
+ XiGeWen 85
87
+ XiNuoNing 86
88
+ XiaLuoDi 87
89
+ XiaWoLei 88
90
+ XianYun 89
91
+ XiangLing 90
92
+ Xiao 91
93
+ XiaoGong 92
94
+ XinYan 93
95
+ XingQiu 94
96
+ YanFei 95
97
+ YaoYao 96
98
+ YeLan 97
99
+ YiDiYa 98
100
+ Ying 99
101
+ YouLa 100
102
+ YunJin 101
103
+ ZaoYou 102
104
+ ZhongLi 103
105
+ ZhongYun 104
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/203epoch.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d53f3fe2649859a19382e3dc5dcdb6ef8e4dfb46ae9c1b08e5c7fb7b2e96dbac
3
+ size 390876366
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/config.yaml ADDED
@@ -0,0 +1,422 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: ./conf/tuning/train_full_band_multi_spk_vits.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ drop_last_iter: false
5
+ dry_run: false
6
+ iterator_type: sequence
7
+ valid_iterator_type: null
8
+ output_dir: exp//tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space
9
+ ngpu: 1
10
+ seed: 777
11
+ num_workers: 6
12
+ num_att_plot: 3
13
+ dist_backend: nccl
14
+ dist_init_method: env://
15
+ dist_world_size: 8
16
+ dist_rank: 0
17
+ local_rank: 0
18
+ dist_master_addr: localhost
19
+ dist_master_port: 59239
20
+ dist_launcher: null
21
+ multiprocessing_distributed: true
22
+ unused_parameters: true
23
+ sharded_ddp: false
24
+ use_deepspeed: false
25
+ deepspeed_config: null
26
+ cudnn_enabled: true
27
+ cudnn_benchmark: true
28
+ cudnn_deterministic: false
29
+ use_tf32: false
30
+ collect_stats: false
31
+ write_collected_feats: false
32
+ max_epoch: 1000
33
+ patience: null
34
+ val_scheduler_criterion:
35
+ - valid
36
+ - loss
37
+ early_stopping_criterion:
38
+ - valid
39
+ - loss
40
+ - min
41
+ best_model_criterion:
42
+ - - train
43
+ - total_count
44
+ - max
45
+ keep_nbest_models: 10
46
+ nbest_averaging_interval: 0
47
+ grad_clip: -1
48
+ grad_clip_type: 2.0
49
+ grad_noise: false
50
+ accum_grad: 1
51
+ no_forward_run: false
52
+ resume: true
53
+ train_dtype: float32
54
+ use_amp: false
55
+ log_interval: 50
56
+ use_matplotlib: true
57
+ use_tensorboard: true
58
+ create_graph_in_tensorboard: false
59
+ use_wandb: false
60
+ wandb_project: null
61
+ wandb_id: null
62
+ wandb_entity: null
63
+ wandb_name: null
64
+ wandb_model_log_interval: -1
65
+ detect_anomaly: false
66
+ use_adapter: false
67
+ adapter: lora
68
+ save_strategy: all
69
+ adapter_conf: {}
70
+ pretrain_path: null
71
+ init_param: []
72
+ ignore_init_mismatch: false
73
+ freeze_param: []
74
+ num_iters_per_epoch: 1000
75
+ batch_size: 20
76
+ valid_batch_size: null
77
+ batch_bins: 50000000
78
+ valid_batch_bins: null
79
+ category_sample_size: 10
80
+ train_shape_file:
81
+ - exp//tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/train/text_shape.phn
82
+ - exp//tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/train/speech_shape
83
+ valid_shape_file:
84
+ - exp//tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/valid/text_shape.phn
85
+ - exp//tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/valid/speech_shape
86
+ batch_type: numel
87
+ valid_batch_type: null
88
+ fold_length:
89
+ - 150
90
+ - 409600
91
+ sort_in_batch: descending
92
+ shuffle_within_batch: false
93
+ sort_batch: descending
94
+ multiple_iterator: false
95
+ chunk_length: 500
96
+ chunk_shift_ratio: 0.5
97
+ num_cache_chunks: 1024
98
+ chunk_excluded_key_prefixes: []
99
+ chunk_default_fs: null
100
+ chunk_max_abs_length: null
101
+ chunk_discard_short_samples: true
102
+ train_data_path_and_name_and_type:
103
+ - - dump//raw/tr_no_dev/text
104
+ - text
105
+ - text
106
+ - - dump//raw/tr_no_dev/wav.scp
107
+ - speech
108
+ - sound
109
+ - - dump//raw/tr_no_dev/utt2sid
110
+ - sids
111
+ - text_int
112
+ valid_data_path_and_name_and_type:
113
+ - - dump//raw/dev/text
114
+ - text
115
+ - text
116
+ - - dump//raw/dev/wav.scp
117
+ - speech
118
+ - sound
119
+ - - dump//raw/dev/utt2sid
120
+ - sids
121
+ - text_int
122
+ multi_task_dataset: false
123
+ allow_variable_data_keys: false
124
+ max_cache_size: 0.0
125
+ max_cache_fd: 32
126
+ allow_multi_rates: false
127
+ valid_max_cache_size: null
128
+ exclude_weight_decay: false
129
+ exclude_weight_decay_conf: {}
130
+ optim: adamw
131
+ optim_conf:
132
+ lr: 0.0002
133
+ betas:
134
+ - 0.8
135
+ - 0.99
136
+ eps: 1.0e-09
137
+ weight_decay: 0.0
138
+ scheduler: exponentiallr
139
+ scheduler_conf:
140
+ gamma: 0.999875
141
+ optim2: adamw
142
+ optim2_conf:
143
+ lr: 0.0002
144
+ betas:
145
+ - 0.8
146
+ - 0.99
147
+ eps: 1.0e-09
148
+ weight_decay: 0.0
149
+ scheduler2: exponentiallr
150
+ scheduler2_conf:
151
+ gamma: 0.999875
152
+ generator_first: false
153
+ skip_discriminator_prob: 0.0
154
+ token_list:
155
+ - <blank>
156
+ - <unk>
157
+ - T
158
+ - N
159
+ - AH0
160
+ - S
161
+ - R
162
+ - L
163
+ - D
164
+ - M
165
+ - IH1
166
+ - K
167
+ - DH
168
+ - EH1
169
+ - AH1
170
+ - Z
171
+ - UW1
172
+ - AE1
173
+ - W
174
+ - AY1
175
+ - IH0
176
+ - IY1
177
+ - P
178
+ - V
179
+ - F
180
+ - ','
181
+ - B
182
+ - ER0
183
+ - HH
184
+ - AA1
185
+ - EY1
186
+ - .
187
+ - IY0
188
+ - AO1
189
+ - OW1
190
+ - Y
191
+ - NG
192
+ - G
193
+ - SH
194
+ - AW1
195
+ - '...'
196
+ - UH1
197
+ - '?'
198
+ - JH
199
+ - TH
200
+ - '!'
201
+ - CH
202
+ - ER1
203
+ - EY2
204
+ - IH2
205
+ - OW0
206
+ - EH2
207
+ - AH2
208
+ - UW0
209
+ - AY2
210
+ - AA2
211
+ - AE2
212
+ - OY1
213
+ - OW2
214
+ - EH0
215
+ - ZH
216
+ - AO2
217
+ - AA0
218
+ - AO0
219
+ - UW2
220
+ - AE0
221
+ - AY0
222
+ - AW2
223
+ - IY2
224
+ - ''''
225
+ - ER2
226
+ - UH0
227
+ - EY0
228
+ - UH2
229
+ - AW0
230
+ - . ...
231
+ - OY0
232
+ - OY2
233
+ - ..
234
+ - '... ...'
235
+ - '... ... ...'
236
+ - <sos/eos>
237
+ odim: null
238
+ model_conf: {}
239
+ use_preprocessor: true
240
+ token_type: phn
241
+ bpemodel: null
242
+ non_linguistic_symbols: null
243
+ cleaner: tacotron
244
+ g2p: g2p_en_no_space
245
+ feats_extract: linear_spectrogram
246
+ feats_extract_conf:
247
+ n_fft: 2048
248
+ hop_length: 512
249
+ win_length: null
250
+ normalize: null
251
+ normalize_conf: {}
252
+ tts: vits
253
+ tts_conf:
254
+ generator_type: vits_generator
255
+ generator_params:
256
+ hidden_channels: 192
257
+ spks: 4096
258
+ global_channels: 256
259
+ segment_size: 32
260
+ text_encoder_attention_heads: 2
261
+ text_encoder_ffn_expand: 4
262
+ text_encoder_blocks: 6
263
+ text_encoder_positionwise_layer_type: conv1d
264
+ text_encoder_positionwise_conv_kernel_size: 3
265
+ text_encoder_positional_encoding_layer_type: rel_pos
266
+ text_encoder_self_attention_layer_type: rel_selfattn
267
+ text_encoder_activation_type: swish
268
+ text_encoder_normalize_before: true
269
+ text_encoder_dropout_rate: 0.1
270
+ text_encoder_positional_dropout_rate: 0.0
271
+ text_encoder_attention_dropout_rate: 0.1
272
+ use_macaron_style_in_text_encoder: true
273
+ use_conformer_conv_in_text_encoder: false
274
+ text_encoder_conformer_kernel_size: -1
275
+ decoder_kernel_size: 7
276
+ decoder_channels: 512
277
+ decoder_upsample_scales:
278
+ - 8
279
+ - 8
280
+ - 2
281
+ - 2
282
+ - 2
283
+ decoder_upsample_kernel_sizes:
284
+ - 16
285
+ - 16
286
+ - 4
287
+ - 4
288
+ - 4
289
+ decoder_resblock_kernel_sizes:
290
+ - 3
291
+ - 7
292
+ - 11
293
+ decoder_resblock_dilations:
294
+ - - 1
295
+ - 3
296
+ - 5
297
+ - - 1
298
+ - 3
299
+ - 5
300
+ - - 1
301
+ - 3
302
+ - 5
303
+ use_weight_norm_in_decoder: true
304
+ posterior_encoder_kernel_size: 5
305
+ posterior_encoder_layers: 16
306
+ posterior_encoder_stacks: 1
307
+ posterior_encoder_base_dilation: 1
308
+ posterior_encoder_dropout_rate: 0.0
309
+ use_weight_norm_in_posterior_encoder: true
310
+ flow_flows: 4
311
+ flow_kernel_size: 5
312
+ flow_base_dilation: 1
313
+ flow_layers: 4
314
+ flow_dropout_rate: 0.0
315
+ use_weight_norm_in_flow: true
316
+ use_only_mean_in_flow: true
317
+ stochastic_duration_predictor_kernel_size: 3
318
+ stochastic_duration_predictor_dropout_rate: 0.5
319
+ stochastic_duration_predictor_flows: 4
320
+ stochastic_duration_predictor_dds_conv_layers: 3
321
+ vocabs: 82
322
+ aux_channels: 1025
323
+ discriminator_type: hifigan_multi_scale_multi_period_discriminator
324
+ discriminator_params:
325
+ scales: 1
326
+ scale_downsample_pooling: AvgPool1d
327
+ scale_downsample_pooling_params:
328
+ kernel_size: 4
329
+ stride: 2
330
+ padding: 2
331
+ scale_discriminator_params:
332
+ in_channels: 1
333
+ out_channels: 1
334
+ kernel_sizes:
335
+ - 15
336
+ - 41
337
+ - 5
338
+ - 3
339
+ channels: 128
340
+ max_downsample_channels: 1024
341
+ max_groups: 16
342
+ bias: true
343
+ downsample_scales:
344
+ - 2
345
+ - 2
346
+ - 4
347
+ - 4
348
+ - 1
349
+ nonlinear_activation: LeakyReLU
350
+ nonlinear_activation_params:
351
+ negative_slope: 0.1
352
+ use_weight_norm: true
353
+ use_spectral_norm: false
354
+ follow_official_norm: false
355
+ periods:
356
+ - 2
357
+ - 3
358
+ - 5
359
+ - 7
360
+ - 11
361
+ period_discriminator_params:
362
+ in_channels: 1
363
+ out_channels: 1
364
+ kernel_sizes:
365
+ - 5
366
+ - 3
367
+ channels: 32
368
+ downsample_scales:
369
+ - 3
370
+ - 3
371
+ - 3
372
+ - 3
373
+ - 1
374
+ max_downsample_channels: 1024
375
+ bias: true
376
+ nonlinear_activation: LeakyReLU
377
+ nonlinear_activation_params:
378
+ negative_slope: 0.1
379
+ use_weight_norm: true
380
+ use_spectral_norm: false
381
+ generator_adv_loss_params:
382
+ average_by_discriminators: false
383
+ loss_type: mse
384
+ discriminator_adv_loss_params:
385
+ average_by_discriminators: false
386
+ loss_type: mse
387
+ feat_match_loss_params:
388
+ average_by_discriminators: false
389
+ average_by_layers: false
390
+ include_final_outputs: true
391
+ mel_loss_params:
392
+ fs: 44100
393
+ n_fft: 2048
394
+ hop_length: 512
395
+ win_length: null
396
+ window: hann
397
+ n_mels: 80
398
+ fmin: 0
399
+ fmax: null
400
+ log_base: null
401
+ lambda_adv: 1.0
402
+ lambda_mel: 45.0
403
+ lambda_feat_match: 2.0
404
+ lambda_dur: 1.0
405
+ lambda_kl: 1.0
406
+ sampling_rate: 44100
407
+ cache_generator_outputs: true
408
+ plot_pred_mos: false
409
+ mos_pred_tool: utmos
410
+ pitch_extract: null
411
+ pitch_extract_conf: {}
412
+ pitch_normalize: null
413
+ pitch_normalize_conf: {}
414
+ energy_extract: null
415
+ energy_extract_conf: {}
416
+ energy_normalize: null
417
+ energy_normalize_conf: {}
418
+ required:
419
+ - output_dir
420
+ - token_list
421
+ version: '202412'
422
+ distributed: true
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_backward_time.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_fake_loss.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_forward_time.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_loss.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_optim_step_time.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_real_loss.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/discriminator_train_time.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_adv_loss.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_backward_time.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_dur_loss.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_feat_match_loss.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_forward_time.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_kl_loss.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_loss.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_mel_loss.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_optim_step_time.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/generator_train_time.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/gpu_max_cached_mem_GB.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/iter_time.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/optim0_lr0.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/optim1_lr0.png ADDED
exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/images/train_time.png ADDED
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: '202412'
2
+ files:
3
+ model_file: exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/203epoch.pth
4
+ python: 3.11.10 (main, Oct 3 2024, 07:29:13) [GCC 11.2.0]
5
+ timestamp: 1735199167.423778
6
+ torch: 2.5.1+cu124
7
+ yaml_files:
8
+ train_config: exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/config.yaml