Solving NaN Tensors and Pickling Errors in a ZeroGPU Space

Community Article Published November 13, 2024

Hi, in this post I'm going to talk about a recent difficulty I had involving an XTTS Space and the ZeroGPU from Hugging Face.

The problem gave me a lot of new knowledge about Python, Hugging Face and ZeroGPU! And I hope it can help someone who is going through something similar!

portuguese version

The Space

The Space involved is this: https://huggingface.co/spaces/rrg92/xtts
This Space contains a version of XTTS, which is a model to text to speech (TTS) and clone voices!
Someone who tried to use the Space commented that the voice clone wasn't working. When I went to test, it wasn't working either, but everything else was.
So that you understand the main components involved in this problem, I'll summarize the structure:

  • It's a Gradio Space, using version 5.5.0 of Gradio.

  • There are two main files (modules): xtts and app:

    • the xtts is where I put all the imports to invoke the XTTS model and the functions that interact directly with it.
    • app is where the Gradio app is located, with its respective event_listeners. So, they invoke the functions of the xtts module.
    • This structure is a small adaptation of the xtts-streaming-server project. I put the API and the model in the same app, so I could use Gradio on Hugging Face, and, the main benefit: use ZeroGPU!
  • Of all the functions, the ones relevant to the problem are these:

    • xtts.predict_speaker
      This is the function that invoke model inference to clone voice.
      Basically, it receives the binary of the reference audio file and calculates the voice embeddings calling model.
      It invokes the model using model.get_conditioning_latents, passing the file binary. It returns these embeddings, which can later be sent to xtts.predict_speech as the speaker voice.

    • xtts.predict_speech
      This is the function that converts text to speech.
      Of the parameters it accepts, the most relevant for us are: the text to be converted and the speaker.
      This speaker is embeddings that represent the voice.
      The XTTS comes with a range of standard, high quality, studio voices, and we can also generate new embeddings using xtts.predict_speaker.
      Anyway, one way or another, these are the main parameters. The function returns the binary of the generated audio.

    • app.clone_voice
      This is the function triggered when someone click button to clone the voice.
      It receives as its first parameter, the reference audio provided by the user in the gradio interface. It is a string containing a file path.
      Then, we open the file, using python open function (rb mode), and invoke the xtts.predict_speaker function, passing the binary returned by open.

    • app.tts
      This is the function invoked when the user click on TTS button, on gradio interface.
      The function does a series of operations, but it all boils down to: determining the text, the embeddings of the speaker chosen in the interface, and invoking xtts.predict_speech.

And to finish, as I wanted to run the TTS using ZeroGPU, I decorated the xtts.predict_speech function with the @spaces.GPU decorator. This is the official procedure documented by Hugging Face when we want to use GPU.

Now, you know the space structure. Lets dive into two problems I found!

Problem 1: probability tensor contains either inf, nan or element < 0

The first problem I noticed in the cloning process was the error returned when trying to generate text with a cloned voice:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 256, in thread_wrapper
    res = future.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/user/app/xtts.py", line 185, in predict_speech
    out = model.inference(
  File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/TTS/tts/models/xtts.py", line 548, in inference
    gpt_codes = self.gpt.generate(
  File "/usr/local/lib/python3.10/site-packages/TTS/tts/layers/xtts/gpt.py", line 592, in generate
    gen = self.gpt_inference.generate(
  File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2215, in generate
    result = self._sample(
  File "/usr/local/lib/python3.10/site-packages/transformers/generation/utils.py", line 3249, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/gradio/queueing.py", line 624, in process_events
    response = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/site-packages/gradio/route_utils.py", line 323, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 2015, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 1562, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2441, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 943, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/gradio/utils.py", line 865, in wrapper
    response = f(*args, **kwargs)
  File "/home/user/app/app.py", line 218, in tts
    generated_audio = xtts.predict_speech(ipts)
  File "/usr/local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 214, in gradio_handler
    raise res.value
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

image/png

This error only occurred when trying to use a cloned voice, and not a studio voice.
And, it occurred at the time of the TTS, not at the time of cloning the voice. In other words, it occurred in the xtts.predict_speech function.

Also, in my local tests, I had no problems.

If you look at the space files, you'll see that there's a Dockerfile created.
This Docker is for when I want to test locally.
If you want try space locally, just run git clone, and after docker compose up

And, on top of that, the last message of the stack references a file from the spaces lib.
All this led me to believe that the difference was in something related to ZeroGPU, since it was one of the main differences between local.

Since the message mentioned the tensors, and, in the stack, the predict_speech function, the first thing I decided to do was to include a print of the voice embeddings. Specifically, I added the print in two points of this function:

@spaces.GPU
def predict_speech(parsed_input: TTSInputs):
    print("device", model.device)
    speaker_embedding = torch.tensor(parsed_input.speaker_embedding).unsqueeze(0).unsqueeze(-1)
    gpt_cond_latent = torch.tensor(parsed_input.gpt_cond_latent).reshape((-1, 1024)).unsqueeze(0)
    
    print(speaker_embedding)
    
    print("latent:")
    print(gpt_cond_latent)

image/png

My hope was to see if I could confirm at least some of the tensor values with NaN... And bingo:

image/png

image/png

Not only was the value of one of the tensors NaN, but ALL of them were.
If you look at the function, it returns 2 values that represent the speakers. Both are tensors, and they were all NaN. Remember that, in the case of the cloned voice, these tensors were generated by the xtts.predict_speaker function.

So, I decided to go a little deeper into the source, and added the prints directly to the output of this function:

def predict_speaker(wav_file):
    """Compute conditioning inputs from reference audio file."""
    temp_audio_name = next(tempfile._get_candidate_names())
    with open(temp_audio_name, "wb") as temp, torch.inference_mode():
        print("device", model.device)
        temp.write(io.BytesIO(wav_file.read()).getbuffer())
        gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
            temp_audio_name
        )
        
        print(gpt_cond_latent);
        print(speaker_embedding);
        
    result = {
        "gpt_cond_latent": gpt_cond_latent.cpu().squeeze().half().tolist(),
        "speaker_embedding": speaker_embedding.cpu().squeeze().half().tolist(),
    }
    
    print(result);
    
    return result;

image/png

And, again, I saw that already in the output of model.get_conditioning_latents, the tensors were coming as NaN.

I went deeper into the XTTS source code to understand how this was done:

image/png

This is part of forked coqui-tts code

As the two calculated embeddings were NaN, I went to the speaker_embedding, which is calculated first. What this function does, basically, is convert the sample rate of the audio and invoke a method of the hifigan_decoder object.
I didn't know about this, but I saw that there's this paper about a neural network called HiFi-GAN: https://arxiv.org/abs/2010.05646 But, from a quick read, I saw that it's a network for synthesizing speech... which, obviously, makes perfect sense for voice cloning!

Despite my limited knowledge at this level, I noticed that at this point, the to method is invoked a lot, to put the tensors to another device. This made me wonder how this code could be working, considering that there is no GPU involved here, and only CPU. Then, I remembered a simple detail: the predict_speaker function was running on CPU, and the predict_speech function, on GPU... I imagined there could be some incompatibility problem with this...

This became even stranger, when I added logs to see on which device the XTTS model was loaded. This is the snippet:

image/png

And here is the log that was generated:

image/png

And what caught my attention was the following:

  • The device variable starts with the value "cuda", so far so good, since this is the intention.
  • Next, right below, there is a check: if not cuda is available in torch, generate an error...
    But no error is generated...
    It means that, even though the code is running in a Space with ZeroGPU, and without the decorator, it detects that cuda is indeed available.
  • Then, the model is loaded, and, as expected, on the CPU. The "before" message shows the value "cpu".
  • However, the model is moved to CUDA, and curiously, it is done successfully... Even being a code that runs without the decorator...

That is, I hadn't noticed this, but the model loads easily on the GPU, in a ZeroGPU Space without the decorator...

This made me believe that, when a function that doesn't have the decorator runs, these movements made to a device, can, somehow, generate NaN in the tensor. I still haven't figured out exactly why, and I'm doing tests in this Space: https://huggingface.co/spaces/rrg92/zero-test to try to simulate the scenario. When I have updates, I'll post.

To solve at this point, I just added the @spaces.GPU decorator to the predict_speaker function.

image/png

This generated the tensors correctly... However, when trying to clone, a new error appeared...

Problem 2: cannot pickle '_io.BufferedReader' object

After adding the decorator to the xtts.predict_speaker function, an error was generated when trying to clone:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/spaces/utils.py", line 43, in put
    super().put(obj)
  File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 371, in put
    obj = _ForkingPickler.dumps(obj)
  File "/usr/local/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_io.BufferedReader' object


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/gradio/queueing.py", line 624, in process_events
    response = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/site-packages/gradio/route_utils.py", line 323, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 2015, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 1562, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2441, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 943, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/gradio/utils.py", line 865, in wrapper
    response = f(*args, **kwargs)
  File "/home/user/app/app.py", line 127, in clone_speaker
    embeddings =  xtts.predict_speaker(open(upload_file,"rb"))
  File "/usr/local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 202, in gradio_handler
    worker.arg_queue.put(((args, kwargs), GradioPartialContext.get()))
  File "/usr/local/lib/python3.10/site-packages/spaces/utils.py", line 51, in put
    raise PicklingError(message)
_pickle.PicklingError: cannot pickle '_io.BufferedReader' object

image/png

This was the error generated when I tried to generate audio using a cloned voice.

Now it was a pickle error... I didn't know what it was, and after some research, I understood that it was related to object serialization, which is a process I know from other languages.

Basically, something in the call of my function wasn't able to be serialized.
And, as the only thing different was the decorator, I went to look at the decorator code again, in the part where the problem occurs:

image/png

I saw that the problematic part used to put something in a queue... And looking at the code of this queue, which wasn't very complex, I noticed that it basically needed to serialize these objects.

image/png

Since the error message mentioned _io.BufferedReader, and I saw that the arguments are serialized, then, immediately I turned to the parameter passed to this function: wav_file. This parameter is the file that the user provided in the interface. Specifically, the file binary. It is passed in this way by app.clone_speaker:

image/png

That is, we open the file in binary mode and pass it to the function... With this, xtts.predict_speaker receives a binary. I imagined that, instead of passing the binary, I could try passing the path, which would be a string. So I rewrote it as follows to maintain compatibility:

image/png

image/png

And voilà! The clone started to work! So, in summary, there were two problems:

  • The xtts.predict_speaker function was not decorated with the Space decorator, and, for some reason that I still don't know, instead of the model resulting in errors, or transferring to the CPU, it generated tensors with NaN.
    Resolution: Added the @spaces.GPU decorator to the xtts.predict_speaker function

  • Including the function in ZeroGPU, caused the error due to the type of the parameter;, because ZeroGPU pickles the arguments.
    Resolution: Pass the string with the file path and open it inside the xtts.predict_speaker function.

Final Toughts

And curiously, this sparked a new question for me: How does Hugging Face implement ZeroGPU? I always wondered if, it dynamically adds the video card, or if it moves the machine, or if it's a custom driver that intercepts the calls and manages to send only the request to a machine with ZeroGPU... Etc.. anyway, many questions...

I created this Space: https://huggingface.co/spaces/rrg92/zero-test

And in it I'm doing tests to help me answer all the questions that are still left.
Anyway, doing this whole process helped me learn a lot more about Python, PyTorch, Hugging Face, ZeroGPU and XTTS. It was already worth it!

When I have more answers, I'll update this post and/or post a new one!

Also, I realize that my debugging approach — adding print statements and pushing changes from my local repo — is not the best practice. I tried using dev space mode, which is a nice feature, but I encountered some difficulties and ended up choosing that archaic method instead. However, I did learn a few things from the experience and hope to use it more effectively next time.

Special thanks to @p3nGu1nZz for reviewing this article and providing invaluable guidance with numerous tips that have deepened my understanding of AI. They’re currently working on an exciting project called Tau. Be sure to follow them on GitHub to stay updated on their work!

Follow me on Github

Thank you for reading!