Skip to the content.

Robust Singing Voice Transcription Serves Synthesis

Abstract

Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating a multi-scale framework that effectively captures coarse-grained note information and ensures fine-grained frame-level segmentation, coupled with an attention-based pitch decoder for reliable pitch prediction. We also established a comprehensive annotation-and-training pipeline for SVS to test the model in real-world settings. Experimental findings reveal that the proposed model achieves state-of-the-art transcription accuracy with either clean or noisy inputs. Moreover, when trained on enlarged, automatically annotated datasets, the SVS model outperforms its baseline, affirming the capability for practical application.

SVS Results of Different Ratios of Pseudo Annotations

We train ROSVOT with M4Singer dataset and utilize which to annotate and generate the pseudo annotations of dataset $D_1$. Pseudo annotations with different ratios are mixed into $D_1$ to train the SVS model, RMSSinger. The inference is performed on the test set of $D_1$.

  1. <breath> 如 果 那 两 个 字 没 有 颤 抖 <breath> 我 不 会 发 现 我 难 受 <breath> 怎 么 说 出 口 <silence>

    GT
    wav
    100%D1 50%D1 10%D1 5%D1 0%D1
    wav
  2. 再 一 次 沾 染 你 <breath> 若 生 命 <breath> 如 过 场 电 影 <breath>

    GT
    wav
    100%D1 50%D1 10%D1 5%D1 0%D1
    wav
  3. 能 够 握 紧 的 就 别 放 了 <breath> 能 够 拥 抱 的 就 别 拉 扯 <breath> 时 间 着 急 地 <breath>

    GT
    wav
    100%D1 50%D1 10%D1 5%D1 0%D1
    wav
  4. 去 寻 找 遗 失 了 的 思 念 <breath> 如 果 你 在 眼 前 <breath> 我 会 让 你 看 见 <silence>

    GT
    wav
    100%D1 50%D1 10%D1 5%D1 0%D1
    wav

SVS Results of Expanding Datasets

We use MFA and the same ROSVOT trained with M4Singer to re-align and re-annotate OpenSinger, a multi-singer dataset designed for training vocoders which is without note annotations. We use M4Singer as the base training set and gradually involve dataset $D_1$ and OpenSinger to see the improvement of the SVS model. The inference is performed on the test set of $D_1$ and M4Singer. However, model M4 is only tested on M4Singer (sample 5 and 6) since we don’t investigate SVS model’s generalization capabilities here.

  1. <breath> 如 果 那 两 个 字 没 有 颤 抖 <breath> 我 不 会 发 现 我 难 受 <breath> 怎 么 说 出 口 <silence>

    GT
    wav
    M4+100%D1 M4+0%D1 M4+0%D1+OP
    wav
  2. 再 一 次 沾 染 你 <breath> 若 生 命 <breath> 如 过 场 电 影 <breath>

    GT
    wav
    M4+100%D1 M4+0%D1 M4+0%D1+OP
    wav
  3. 能 够 握 紧 的 就 别 放 了 <breath> 能 够 拥 抱 的 就 别 拉 扯 <breath> 时 间 着 急 地 <breath>

    GT
    wav
    M4+100%D1 M4+0%D1 M4+0%D1+OP
    wav
  4. 去 寻 找 遗 失 了 的 思 念 <breath> 如 果 你 在 眼 前 <breath> 我 会 让 你 看 见 <silence>

    GT
    wav
    M4+100%D1 M4+0%D1 M4+0%D1+OP
    wav
  5. 我 想 唱 一 首 歌 给

    GT
    wav
    M4 M4+100%D1 M4+0%D1 M4+0%D1+OP
    wav
  6. 却 是 下 落 不 详 <breath> 心 好 <silence> 空 荡 <breath>

    GT
    wav
    M4 M4+100%D1 M4+0%D1 M4+0%D1+OP
    wav

SVS with English Transcriptions

We use MFA and the same ROSVOT trained with M4Singer to re-align and re-annotate a small English singing dataset, $D_2$. We finetune the instance M4+0%D1+OP in the previous section using the Enligh singing dataset to test the cross-lingual annotation capability of ROSVOT. The inference is performed on English transcriptions.

  1. I wouldn’t change a thing about it

    GT RMSSinger
    wav
  2. They’ve all been said before you know <breath> so why don’t we <breath> just play pretend <breath>

    GT RMSSinger
    wav