Inquiry on Fine-Tuning Details and Model Similarities for MelodyFlow

#2
by lfurman - opened

Hi there,

Thank you for making MelodyFlow available for the community! I am currently exploring its capabilities and am highly interested in learning more about the model.

I have a couple of questions:
1. Are there any updates or detailed guidelines on how to fine-tune the MelodyFlow model? Fine-tuning information would be greatly beneficial for adapting the model to specific datasets.
2. Could you confirm if MelodyFlow fine-tuning process shares similarities with other models like Audiocraft or other models in the same family? If so, are there any key architectural or functionality overlaps that would be helpful to know?

Any additional information or resources on these topics would be greatly appreciated. Thank you for your time and effort in advancing AI music generation!

Looking forward to your response!

AI at Meta org

Hey,

  1. We do not plan to release the training code, but fine tuning should be straightforward considering the simplicity of Flow's forward method (https://huggingface.co/spaces/facebook/MelodyFlow/blob/9d0d223e9a63bbb8c20b9f57c5afcb4de297e6da/audiocraft/models/flow.py#L273).
  • Fine-tuning only requires mixing a true latent x with a random normal noise sample n of the same size, using linear interpolation that depends on the flow step t.
  • Namely during training t was sampled following t = torch.nn.functional.sigmoid(torch.randn(...)) and the linear interpolation gives z = x * t + (1 - t) * n + 1e-5 * torch.randn_like(x).
  • The model target is x - n and the loss is MSE.
  • During training x and n were aligned sample-wise using https://github.com/ivan-chai/torch-linear-assignment (permutation_indices = batch_linear_assignment(pairwise_distances)[0] with pairwise_distances being a [1, B, B] matrix of L2 distances between x and n samples).
  1. MelodyFlow is not an autoregressive Language Model, it is a flow matching model based on a diffusion transformer. I suspect fine tuning may be different here (instead of finetuning on the target token distribution, it would technically be fine-tuned on the velocity prediction task).

Sign up or log in to comment