构建训练集
训练集是指用于训练神经网络模型的数据集合。这个数据集通常由大量的输入和对应的输出组成,神经网络模型通过学习输入和输出之间的关系来进行训练,并且在训练过程中调整模型的参数以最小化误差。
没错,so-vits库底层就是神经网络架构,而训练音色模型库,本质上解决的是预测问题,关于神经网络架构,请移步:人工智能机器学习底层原理剖析,人造神经元,您一定能看懂,通俗解释把AI“黑话”转化为“白话文”,这里不再赘述。
此外,训练集数据贵精不贵多,特征权重比较高的清晰样本,在训练效果要比低质量样本要好,比如歌手“翻唱”的一些歌曲,或者使用非常规唱法的歌曲,这类样本虽然也具备一些歌手的音色特征,但对于模型训练来说,实际上起到是反作用,这是需要注意的事情。
在深度学习中,通常需要大量的数据才能训练出高性能的模型。例如,在计算机视觉任务中,需要大量的图像数据来训练卷积神经网络模型。但是,在其他一些任务中,如语音识别和自然语言处理,相对较少的数据量也可以训练出高性能的模型。
除了数量之外,训练集的质量也非常重要。需要确保训练集中不存在偏差和噪声,同时需要进行数据清洗和数据增强等预处理操作,以提高训练集的质量和多样性。
综上,考虑到笔者的电脑配置以及训练时间成本,训练集相对较小,其他朋友可以根据自己的情况丰俭由己地进行调整。
训练集数据清洗
伴奏和人声分离推荐使用spleeter库:
pip3 install spleeter --user
接着运行命令,对训练集歌曲进行分离操作:
spleeter separate -o d:/output/ -p spleeter:2stems d:/数据.mp3
这里-o代表输出目录,-p代表选择的分离模型,最后是要分离的素材。
D:\歌曲制作\清唱 的目录
2023/05/11 15:38 <DIR> .
2023/05/11 13:45 <DIR> ..
2023/05/11 13:40 39,651,884 1_1_01. wxs.wav
2023/05/11 15:34 46,103,084 1_1_02. qad_(Vocals_(Vocals.wav
2023/05/11 15:35 43,802,924 1_1_03. hs_(Vocals_(Vocals.wav
2023/05/11 15:36 39,054,764 1_1_04. hope_(Vocals_(Vocals.wav
2023/05/11 15:36 32,849,324 1_1_05. kamen_(Vocals_(Vocals.wav
2023/05/11 15:37 50,741,804 1_1_06. ctrl_(Vocals_(Vocals.wav
6 个文件 252,203,784 字节
2 个目录 449,446,780,928 可用字节
关于spleeter更多的操作,请移步至:人工智能AI库Spleeter免费人声和背景音乐分离实践(Python3.10, 这里不再赘述。
pip3 install noisereduce,soundfile
随后进行降噪处理:
import noisereduce as nr
import soundfile as sf
# 读入音频文件
data, rate = sf.read("audio_file.wav"
# 获取噪声样本
noisy_part = data[10000:15000]
# 估算噪声
noise = nr.estimate_noise(noisy_part, rate
# 应用降噪算法
reduced_noise = nr.reduce_noise(audio_clip=data, noise_clip=noise, verbose=False
# 将结果写入文件
sf.write("audio_file_denoised.wav", reduced_noise, rate
先通过soundfile库将歌曲文件读出来,然后获取噪声样本并对其使用降噪算法,最后写入新文件。
训练集数据切分
深度学习过程中,计算机会把训练数据读入显卡的缓存中,但如果训练集数据过大,会导致内存溢出问题,也就是常说的“爆显存”现象。
这里可以使用github.com/openvpi/audio-slicer库:
git clone https://github.com/openvpi/audio-slicer.git
随后编写代码:
import librosa # Optional. Use any library you like to read audio files.
import soundfile # Optional. Use any library you like to write audio files.
from slicer2 import Slicer
audio, sr = librosa.load('example.wav', sr=None, mono=False # Load an audio file with librosa.
slicer = Slicer(
sr=sr,
threshold=-40,
min_length=5000,
min_interval=300,
hop_size=10,
max_sil_kept=500
chunks = slicer.slice(audio
for i, chunk in enumerate(chunks:
if len(chunk.shape > 1:
chunk = chunk.T # Swap axes if the audio is stereo.
soundfile.write(f'clips/example_{i}.wav', chunk, sr # Save sliced audio files with soundfile.
该脚本可以将所有降噪后的清唱样本切成小样本,方便训练,电脑配置比较低的朋友,可以考虑将min_interval和max_sil_kept调的更高一些,这些会切的更碎,所谓“细细切做臊子”。
D:\歌曲制作\slicer 的目录
2023/05/11 15:45 <DIR> .
2023/05/11 13:45 <DIR> ..
2023/05/11 15:45 873,224 1_1_01. wxs_0.wav
2023/05/11 15:45 934,964 1_1_01. wxs_1.wav
2023/05/11 15:45 1,039,040 1_1_01. wxs_10.wav
2023/05/11 15:45 1,391,840 1_1_01. wxs_11.wav
2023/05/11 15:45 2,272,076 1_1_01. wxs_12.wav
2023/05/11 15:45 2,637,224 1_1_01. wxs_13.wav
2023/05/11 15:45 1,476,512 1_1_01. wxs_14.wav
2023/05/11 15:45 1,044,332 1_1_01. wxs_15.wav
2023/05/11 15:45 1,809,908 1_1_01. wxs_16.wav
2023/05/11 15:45 887,336 1_1_01. wxs_17.wav
2023/05/11 15:45 952,604 1_1_01. wxs_18.wav
2023/05/11 15:45 989,648 1_1_01. wxs_19.wav
2023/05/11 15:45 957,896 1_1_01. wxs_2.wav
2023/05/11 15:45 231,128 1_1_01. wxs_20.wav
2023/05/11 15:45 1,337,156 1_1_01. wxs_3.wav
2023/05/11 15:45 1,308,932 1_1_01. wxs_4.wav
2023/05/11 15:45 1,035,512 1_1_01. wxs_5.wav
2023/05/11 15:45 2,388,500 1_1_01. wxs_6.wav
2023/05/11 15:45 2,952,980 1_1_01. wxs_7.wav
2023/05/11 15:45 929,672 1_1_01. wxs_8.wav
2023/05/11 15:45 878,516 1_1_01. wxs_9.wav
2023/05/11 15:45 963,188 1_1_02. qad_(Vocals_(Vocals_0.wav
2023/05/11 15:45 901,448 1_1_02. qad_(Vocals_(Vocals_1.wav
2023/05/11 15:45 1,411,244 1_1_02. qad_(Vocals_(Vocals_10.wav
2023/05/11 15:45 2,070,980 1_1_02. qad_(Vocals_(Vocals_11.wav
2023/05/11 15:45 2,898,296 1_1_02. qad_(Vocals_(Vocals_12.wav
2023/05/11 15:45 885,572 1_1_02. qad_(Vocals_(Vocals_13.wav
2023/05/11 15:45 841,472 1_1_02. qad_(Vocals_(Vocals_14.wav
2023/05/11 15:45 876,752 1_1_02. qad_(Vocals_(Vocals_15.wav
2023/05/11 15:45 1,091,960 1_1_02. qad_(Vocals_(Vocals_16.wav
2023/05/11 15:45 1,188,980 1_1_02. qad_(Vocals_(Vocals_17.wav
2023/05/11 15:45 1,446,524 1_1_02. qad_(Vocals_(Vocals_18.wav
2023/05/11 15:45 924,380 1_1_02. qad_(Vocals_(Vocals_19.wav
2023/05/11 15:45 255,824 1_1_02. qad_(Vocals_(Vocals_2.wav
2023/05/11 15:45 1,718,180 1_1_02. qad_(Vocals_(Vocals_20.wav
2023/05/11 15:45 2,070,980 1_1_02. qad_(Vocals_(Vocals_21.wav
2023/05/11 15:45 2,827,736 1_1_02. qad_(Vocals_(Vocals_22.wav
2023/05/11 15:45 862,640 1_1_02. qad_(Vocals_(Vocals_23.wav
2023/05/11 15:45 1,628,216 1_1_02. qad_(Vocals_(Vocals_24.wav
2023/05/11 15:45 1,626,452 1_1_02. qad_(Vocals_(Vocals_25.wav
2023/05/11 15:45 1,499,444 1_1_02. qad_(Vocals_(Vocals_26.wav
2023/05/11 15:45 1,303,640 1_1_02. qad_(Vocals_(Vocals_27.wav
2023/05/11 15:45 998,468 1_1_02. qad_(Vocals_(Vocals_28.wav
2023/05/11 15:45 781,496 1_1_02. qad_(Vocals_(Vocals_3.wav
2023/05/11 15:45 1,368,908 1_1_02. qad_(Vocals_(Vocals_4.wav
2023/05/11 15:45 892,628 1_1_02. qad_(Vocals_(Vocals_5.wav
2023/05/11 15:45 1,386,548 1_1_02. qad_(Vocals_(Vocals_6.wav
2023/05/11 15:45 883,808 1_1_02. qad_(Vocals_(Vocals_7.wav
2023/05/11 15:45 952,604 1_1_02. qad_(Vocals_(Vocals_8.wav
2023/05/11 15:45 1,303,640 1_1_02. qad_(Vocals_(Vocals_9.wav
2023/05/11 15:45 1,354,796 1_1_03. hs_(Vocals_(Vocals_0.wav
2023/05/11 15:45 1,344,212 1_1_03. hs_(Vocals_(Vocals_1.wav
2023/05/11 15:45 1,305,404 1_1_03. hs_(Vocals_(Vocals_10.wav
2023/05/11 15:45 1,291,292 1_1_03. hs_(Vocals_(Vocals_11.wav
2023/05/11 15:45 1,338,920 1_1_03. hs_(Vocals_(Vocals_12.wav
2023/05/11 15:45 1,093,724 1_1_03. hs_(Vocals_(Vocals_13.wav
2023/05/11 15:45 1,375,964 1_1_03. hs_(Vocals_(Vocals_14.wav
2023/05/11 15:45 1,409,480 1_1_03. hs_(Vocals_(Vocals_15.wav
2023/05/11 15:45 1,481,804 1_1_03. hs_(Vocals_(Vocals_16.wav
2023/05/11 15:45 2,247,380 1_1_03. hs_(Vocals_(Vocals_17.wav
2023/05/11 15:45 1,312,460 1_1_03. hs_(Vocals_(Vocals_18.wav
2023/05/11 15:45 1,428,884 1_1_03. hs_(Vocals_(Vocals_19.wav
2023/05/11 15:45 1,051,388 1_1_03. hs_(Vocals_(Vocals_2.wav
2023/05/11 15:45 1,377,728 1_1_03. hs_(Vocals_(Vocals_20.wav
2023/05/11 15:45 1,485,332 1_1_03. hs_(Vocals_(Vocals_21.wav
2023/05/11 15:45 897,920 1_1_03. hs_(Vocals_(Vocals_22.wav
2023/05/11 15:45 1,591,172 1_1_03. hs_(Vocals_(Vocals_23.wav
2023/05/11 15:45 920,852 1_1_03. hs_(Vocals_(Vocals_24.wav
2023/05/11 15:45 1,046,096 1_1_03. hs_(Vocals_(Vocals_25.wav
2023/05/11 15:45 730,340 1_1_03. hs_(Vocals_(Vocals_26.wav
2023/05/11 15:45 1,383,020 1_1_03. hs_(Vocals_(Vocals_3.wav
2023/05/11 15:45 1,188,980 1_1_03. hs_(Vocals_(Vocals_4.wav
2023/05/11 15:45 1,003,760 1_1_03. hs_(Vocals_(Vocals_5.wav
2023/05/11 15:45 1,243,664 1_1_03. hs_(Vocals_(Vocals_6.wav
2023/05/11 15:45 845,000 1_1_03. hs_(Vocals_(Vocals_7.wav
2023/05/11 15:45 892,628 1_1_03. hs_(Vocals_(Vocals_8.wav
2023/05/11 15:45 539,828 1_1_03. hs_(Vocals_(Vocals_9.wav
2023/05/11 15:45 725,048 1_1_04. hope_(Vocals_(Vocals_0.wav
2023/05/11 15:45 1,023,164 1_1_04. hope_(Vocals_(Vocals_1.wav
2023/05/11 15:45 202,904 1_1_04. hope_(Vocals_(Vocals_10.wav
2023/05/11 15:45 659,780 1_1_04. hope_(Vocals_(Vocals_11.wav
2023/05/11 15:45 1,017,872 1_1_04. hope_(Vocals_(Vocals_12.wav
2023/05/11 15:45 1,495,916 1_1_04. hope_(Vocals_(Vocals_13.wav
2023/05/11 15:45 1,665,260 1_1_04. hope_(Vocals_(Vocals_14.wav
2023/05/11 15:45 675,656 1_1_04. hope_(Vocals_(Vocals_15.wav
2023/05/11 15:45 1,187,216 1_1_04. hope_(Vocals_(Vocals_16.wav
2023/05/11 15:45 1,201,328 1_1_04. hope_(Vocals_(Vocals_17.wav
2023/05/11 15:45 1,368,908 1_1_04. hope_(Vocals_(Vocals_18.wav
2023/05/11 15:45 1,462,400 1_1_04. hope_(Vocals_(Vocals_19.wav
2023/05/11 15:45 963,188 1_1_04. hope_(Vocals_(Vocals_2.wav
2023/05/11 15:45 1,121,948 1_1_04. hope_(Vocals_(Vocals_20.wav
2023/05/11 15:45 165,860 1_1_04. hope_(Vocals_(Vocals_21.wav
2023/05/11 15:45 1,116,656 1_1_04. hope_(Vocals_(Vocals_3.wav
2023/05/11 15:45 622,736 1_1_04. hope_(Vocals_(Vocals_4.wav
2023/05/11 15:45 1,349,504 1_1_04. hope_(Vocals_(Vocals_5.wav
2023/05/11 15:45 984,356 1_1_04. hope_(Vocals_(Vocals_6.wav
2023/05/11 15:45 2,104,496 1_1_04. hope_(Vocals_(Vocals_7.wav
2023/05/11 15:45 1,762,280 1_1_04. hope_(Vocals_(Vocals_8.wav
2023/05/11 15:45 1,116,656 1_1_04. hope_(Vocals_(Vocals_9.wav
2023/05/11 15:45 1,114,892 1_1_05. kamen_(Vocals_(Vocals_0.wav
2023/05/11 15:45 874,988 1_1_05. kamen_(Vocals_(Vocals_1.wav
2023/05/11 15:45 1,400,660 1_1_05. kamen_(Vocals_(Vocals_10.wav
2023/05/11 15:45 943,784 1_1_05. kamen_(Vocals_(Vocals_11.wav
2023/05/11 15:45 1,351,268 1_1_05. kamen_(Vocals_(Vocals_12.wav
2023/05/11 15:45 1,476,512 1_1_05. kamen_(Vocals_(Vocals_13.wav
2023/05/11 15:45 933,200 1_1_05. kamen_(Vocals_(Vocals_14.wav
2023/05/11 15:45 1,388,312 1_1_05. kamen_(Vocals_(Vocals_15.wav
2023/05/11 15:45 1,012,580 1_1_05. kamen_(Vocals_(Vocals_16.wav
2023/05/11 15:45 1,365,380 1_1_05. kamen_(Vocals_(Vocals_17.wav
2023/05/11 15:45 1,614,104 1_1_05. kamen_(Vocals_(Vocals_18.wav
2023/05/11 15:45 1,582,352 1_1_05. kamen_(Vocals_(Vocals_19.wav
2023/05/11 15:45 949,076 1_1_05. kamen_(Vocals_(Vocals_2.wav
2023/05/11 15:45 1,402,424 1_1_05. kamen_(Vocals_(Vocals_20.wav
2023/05/11 15:45 1,268,360 1_1_05. kamen_(Vocals_(Vocals_21.wav
2023/05/11 15:45 1,016,108 1_1_05. kamen_(Vocals_(Vocals_22.wav
2023/05/11 15:45 1,065,500 1_1_05. kamen_(Vocals_(Vocals_3.wav
2023/05/11 15:45 874,988 1_1_05. kamen_(Vocals_(Vocals_4.wav
2023/05/11 15:45 954,368 1_1_05. kamen_(Vocals_(Vocals_5.wav
2023/05/11 15:45 1,049,624 1_1_05. kamen_(Vocals_(Vocals_6.wav
2023/05/11 15:45 878,516 1_1_05. kamen_(Vocals_(Vocals_7.wav
2023/05/11 15:45 1,019,636 1_1_05. kamen_(Vocals_(Vocals_8.wav
2023/05/11 15:45 1,383,020 1_1_05. kamen_(Vocals_(Vocals_9.wav
2023/05/11 15:45 1,005,524 1_1_06. ctrl_(Vocals_(Vocals_0.wav
2023/05/11 15:45 1,090,196 1_1_06. ctrl_(Vocals_(Vocals_1.wav
2023/05/11 15:45 84,716 1_1_06. ctrl_(Vocals_(Vocals_10.wav
2023/05/11 15:45 857,348 1_1_06. ctrl_(Vocals_(Vocals_11.wav
2023/05/11 15:45 991,412 1_1_06. ctrl_(Vocals_(Vocals_12.wav
2023/05/11 15:45 1,121,948 1_1_06. ctrl_(Vocals_(Vocals_13.wav
2023/05/11 15:45 931,436 1_1_06. ctrl_(Vocals_(Vocals_14.wav
2023/05/11 15:45 3,129,380 1_1_06. ctrl_(Vocals_(Vocals_15.wav
2023/05/11 15:45 6,202,268 1_1_06. ctrl_(Vocals_(Vocals_16.wav
2023/05/11 15:45 1,457,108 1_1_06. ctrl_(Vocals_(Vocals_17.wav
2023/05/11 15:45 1,046,096 1_1_06. ctrl_(Vocals_(Vocals_2.wav
2023/05/11 15:45 956,132 1_1_06. ctrl_(Vocals_(Vocals_3.wav
2023/05/11 15:45 1,286,000 1_1_06. ctrl_(Vocals_(Vocals_4.wav
2023/05/11 15:45 804,428 1_1_06. ctrl_(Vocals_(Vocals_5.wav
2023/05/11 15:45 1,337,156 1_1_06. ctrl_(Vocals_(Vocals_6.wav
2023/05/11 15:45 1,372,436 1_1_06. ctrl_(Vocals_(Vocals_7.wav
2023/05/11 15:45 2,954,744 1_1_06. ctrl_(Vocals_(Vocals_8.wav
2023/05/11 15:45 6,112,304 1_1_06. ctrl_(Vocals_(Vocals_9.wav
140 个文件 183,026,452 字节
至此,数据切分顺利完成。
开始训练
随后将切分后的数据集放在项目根目录的dataset_raw/yebei文件夹,如果没有yebei文件夹,请进行创建。
{
"train": {
"log_interval": 200,
"eval_interval": 800,
"seed": 1234,
"epochs": 10000,
"learning_rate": 0.0001,
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"batch_size": 6,
"fp16_run": false,
"lr_decay": 0.999875,
"segment_size": 10240,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0,
"use_sr": true,
"max_speclen": 512,
"port": "8001",
"keep_ckpts": 10,
"all_in_mem": false
},
"data": {
"training_files": "filelists/train.txt",
"validation_files": "filelists/val.txt",
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 2048,
"hop_length": 512,
"win_length": 2048,
"n_mel_channels": 80,
"mel_fmin": 0.0,
"mel_fmax": 22050
},
"model": {
"inter_channels": 192,
"hidden_channels": 192,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 6,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [
3,
7,
11
],
"resblock_dilation_sizes": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"upsample_rates": [
8,
8,
2,
2,
2
],
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [
16,
16,
4,
4,
4
],
"n_layers_q": 3,
"use_spectral_norm": false,
"gin_channels": 768,
"ssl_dim": 768,
"n_speakers": 1
},
"spk": {
"yebei": 0
}
}
这里epochs是指对整个训练集进行一次完整的训练。具体来说,每个epoch包含多个训练步骤,每个训练步骤会从训练集中抽取一个小批量的数据进行训练,并更新模型的参数。
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 8.00 GiB total capacity; 6.86 GiB already allocated; 0 bytes free; 7.25 GiB reserved in total by PyTorch If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
那么就说明显存已经不够用了。
python3 train.py -c configs/config.json -m 44k
终端会返回训练过程:
D:\work\so-vits-svc\workenv\lib\site-packages\torch\optim\lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step(` before `optimizer.step(`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step(` before `lr_scheduler.step(`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step(` before `optimizer.step(`. "
D:\work\so-vits-svc\workenv\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
D:\work\so-vits-svc\workenv\lib\site-packages\torch\autograd\__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes( = [32, 1, 4], strides( = [4, 1, 1]
bucket_view.sizes( = [32, 1, 4], strides( = [4, 4, 1] (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\reducer.cpp:337.
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:44k:====> Epoch: 274, cost 39.02 s
INFO:44k:====> Epoch: 275, cost 17.47 s
INFO:44k:====> Epoch: 276, cost 17.74 s
INFO:44k:====> Epoch: 277, cost 17.43 s
INFO:44k:====> Epoch: 278, cost 17.59 s
INFO:44k:====> Epoch: 279, cost 17.82 s
INFO:44k:====> Epoch: 280, cost 17.64 s
INFO:44k:====> Epoch: 281, cost 17.63 s
INFO:44k:Train Epoch: 282 [65%]
INFO:44k:Losses: [1.8697402477264404, 3.029414415359497, 11.415563583374023, 23.37869644165039, 0.2702481746673584], step: 6600, lr: 9.637943809624507e-05, reference_loss: 39.963661193847656
这里每一次Epoch系统都会返回损失函数等相关信息,训练好的模型存放在项目的logs/44k目录下,模型的后缀名是.pth。
结语
最后,奉上民谣女神叶蓓的总训练6400次的音色模型,与众乡亲同飨:
pan.baidu.com/s/1m3VGc7RktaO5snHw6RPLjQ?pwd=pqkb
提取码:pqkb