arxiv:2301.02262

Singing voice synthesis based on frame-level sequence-to-sequence models considering vocal timing deviation

Published on Jan 5, 2023

Authors:

Yukiya Hono ,

Abstract

This paper proposes singing voice synthesis (SVS) based on frame-level sequence-to-sequence models considering vocal timing deviation. In SVS, it is essential to synchronize the timing of singing with temporal structures represented by scores, taking into account that there are differences between actual vocal timing and note start timing. In many SVS systems including our previous work, phoneme-level score features are converted into frame-level ones on the basis of phoneme boundaries obtained by external aligners to take into account vocal timing deviations. Therefore, the sound quality is affected by the aligner accuracy in this system. To alleviate this problem, we introduce an attention mechanism with frame-level features. In the proposed system, the attention mechanism absorbs alignment errors in phoneme boundaries. Additionally, we evaluate the system with pseudo-phoneme-boundaries defined by heuristic rules based on musical scores when there is no aligner. The experimental results show the effectiveness of the proposed system.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2301.02262 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2301.02262 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2301.02262 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.