Text-to-Speech
ONNX
English

Priorities

#6
by hexgrad - opened

Owner Priorities

Train the v1 model:

  1. Detach completely from espeak-ng, which requires rebuilding g2p / tokenization from first principles
  2. Migrate training set towards MLLM synthetic audio (4o, Gemini 2 Flash), instead of synthetic audio from traditional TTS models

If both of the above objectives are completed and there is a successful training run, it is likely the resulting model would earn the v1 name—however this is not a guarantee and v1 has no scheduled release date yet. If you want to help with the 2nd objective of growing the synthetic dataset, consider joining the Discord server.

Because my attention is diverted above, I may not have bandwidth to address the issues below. Contributions are welcome, including those not listed.

Top Priority

Quality of Life

Long Term

  • Crowdsourced data collection (more on this in a separate post)
  • Explore other architectures. StyleTTS-ZS could warrant attention on the author name alone, but the author appears to have abandoned it and I have not had time to unpack the notebooks yet.

Third Party / Arenas

Again, this list is not comprehensive and likely will evolve over time, so feel free to contribute in ways you don't see specified above if you think it would be helpful.

hexgrad changed discussion title from Philosophy to Priorities

hi, any plans to port code to C++ make the Kokoro.CPP via Georgi Gerganov GGML lib,for more speed, memory efficiency, and no dependence on overbloated python libs,
and model quant s support

I clarified where my own priorities currently lie under "Owner Priorities".

This would be nice, but I do not have the bandwidth to do this:

hi, any plans to port code to C++ make the Kokoro.CPP via Georgi Gerganov GGML lib,for more speed, memory efficiency, and no dependence on overbloated python libs,

Consider opening an issue in llama.cpp if you think it would be appropriate. I saw this was done recently for OuteTTS in https://github.com/ggerganov/llama.cpp/pull/10784 so maybe it can be done for Kokoro models as well. As mentioned elsewhere, the inference code is deliberately thinned relative to the full StyleTTS2 to (hopefully) improve readability.

and model quant s support

Have to walk before you can run. FP32 inference works, but I have not cracked FP16 inference yet, which is the top priority (after v1 model training). I think FP16 inference should work if fixed, because as you can see in the linked issue, the generated samples sound fine if we run inference against a half precision, 160 MB model file.

  • pip installable package

Instead of that, you can also implement inference pipeline using the transformers API.

Sign up or log in to comment