|
--- |
|
license: mit |
|
language: |
|
- en |
|
- de |
|
- es |
|
- fr |
|
- hi |
|
- it |
|
- ja |
|
- ko |
|
- pl |
|
- pt |
|
- ru |
|
- tr |
|
- zh |
|
pipeline_tag: text-to-speech |
|
library: bark |
|
tags: |
|
- bark |
|
- audio |
|
- text-to-speech |
|
inference: true |
|
--- |
|
|
|
# Safetensors files for [Bark](https://huggingface.co/suno/bark) |
|
This repository hosts safetensors files for Suno's Bark. They are directly converted by loading the model, and using `safetensors.torch.save_model`. |
|
These safetensors files only contain the model, and the config has been split to another file with the name `model_name.json`. |
|
|
|
All original model files have been converted and had their configs extracted into json format. |
|
For example, for model `text`, the safetensors file is `text.safetensors`, and the config file is `text.json`. |
|
|
|
# Bark (From [original readme](https://huggingface.co/suno/bark), modified to exclude information not relevant to this repo) |
|
|
|
Bark is a transformer-based text-to-audio model created by [Suno](https://www.suno.ai). |
|
Bark can generate highly realistic, multilingual speech as well as other audio - including music, |
|
background noise and simple sound effects. The model can also produce nonverbal |
|
communications like laughing, sighing and crying. To support the research community, |
|
we are providing access to pretrained model checkpoints ready for inference. |
|
|
|
The original github repo and model card can be found [here](https://github.com/suno-ai/bark). |
|
|
|
This model is meant for research purposes only. |
|
The model output is not censored and the authors do not endorse the opinions in the generated content. |
|
Use at your own risk. |
|
|
|
Two checkpoints are released: |
|
- [small](https://huggingface.co/suno/bark-small) |
|
- [large](https://huggingface.co/suno/bark) |
|
|
|
## Model Details |
|
|
|
|
|
The following is additional information about the models released here. |
|
|
|
Bark is a series of three transformer models that turn text into audio. |
|
|
|
### Text to semantic tokens |
|
- Input: text, tokenized with [BERT tokenizer from Hugging Face](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer) |
|
- Output: semantic tokens that encode the audio to be generated |
|
|
|
### Semantic to coarse tokens |
|
- Input: semantic tokens |
|
- Output: tokens from the first two codebooks of the [EnCodec Codec](https://github.com/facebookresearch/encodec) from facebook |
|
|
|
### Coarse to fine tokens |
|
- Input: the first two codebooks from EnCodec |
|
- Output: 8 codebooks from EnCodec |
|
|
|
### Architecture |
|
| Model | Parameters | Attention | Output Vocab size | |
|
|:-------------------------:|:----------:|------------|:-----------------:| |
|
| Text to semantic tokens | 80/300 M | Causal | 10,000 | |
|
| Semantic to coarse tokens | 80/300 M | Causal | 2x 1,024 | |
|
| Coarse to fine tokens | 80/300 M | Non-causal | 6x 1,024 | |
|
|
|
|
|
### Release date |
|
April 2023 |
|
|
|
## Broader Implications |
|
We anticipate that this model's text to audio capabilities can be used to improve accessbility tools in a variety of languages. |
|
|
|
While we hope that this release will enable users to express their creativity and build applications that are a force |
|
for good, we acknowledge that any text to audio model has the potential for dual use. While it is not straightforward |
|
to voice clone known people with Bark, it can still be used for nefarious purposes. To further reduce the chances of unintended use of Bark, |
|
we also release a simple classifier to detect Bark-generated audio with high accuracy (see notebooks section of the main repository). |