{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "eK3nmYDB6C1a" }, "source": [ "# **Cuaderno de entrenamiento de [Piper.](https://github.com/rhasspy/piper)**\n", "## ![Piper logo](https://contribute.rhasspy.org/img/logo.png)\n", "\n", "---\n", "\n", "- Cuaderno creado por [rmcpantoja](http://github.com/rmcpantoja)\n", "- Colaborador y traductor: [Xx_Nessu_xX](https://fakeyou.com/profile/Xx_Nessu_xX)\n", "- [Cuaderno original inglés.](https://colab.research.google.com/github/rmcpantoja/piper/blob/master/notebooks/piper_multilingual_training_notebook.ipynb#scrollTo=dOyx9Y6JYvRF)\n", "\n", "---\n", "\n", "# Notas:\n", "\n", "- **Las cosas en naranja significa que son importantes.**" ] }, { "cell_type": "markdown", "metadata": { "id": "AICh6p5OJybj" }, "source": [ "# 🔧 ***Primeros pasos.*** 🔧" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "qyxSMuzjfQrz" }, "outputs": [], "source": [ "#@markdown ## **Google Colab Anti-Disconnect.** 🔌\n", "#@markdown ---\n", "#@markdown #### Evita la desconexión automática. Aún así, se desconectará después de **6 a 12 horas**.\n", "\n", "import IPython\n", "js_code = '''\n", "function ClickConnect(){\n", "console.log(\"Working\");\n", "document.querySelector(\"colab-toolbar-button#connect\").click()\n", "}\n", "setInterval(ClickConnect,60000)\n", "'''\n", "display(IPython.display.Javascript(js_code))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "ygxzp-xHTC7T" }, "outputs": [], "source": [ "#@markdown ## **Comprueba la GPU.** 👁️\n", "#@markdown ---\n", "#@markdown #### Una GPU de mayor capacidad puede aumentar la velocidad de entrenamiento. Por defecto, tendrás una **Tesla T4**.\n", "!nvidia-smi" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "sUNjId07JfAK" }, "outputs": [], "source": [ "#@markdown # **Monta tu Google Drive.** 📂\n", "from google.colab import drive\n", "drive.mount('/content/drive', force_remount=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "_XwmTVlcUgCh" }, "outputs": [], "source": [ "#@markdown # **Instalar software.** 📦\n", "\n", "#@markdown ####En esta celda se instalará el sintetizador y sus dependencias necesarias para ejecutar el entrenamiento. (Esto puede llevar un rato.)\n", "\n", "#@markdown #### **¿Quieres usar el parche?**\n", "#@markdown El parche ofrece la posibilidad de exportar archivos de audio a la carpeta de salida y guardar un único modelo durante el entrenamiento.\n", "usepatch = True #@param {type:\"boolean\"}\n", "#@markdown ---\n", "# clone:\n", "!git clone -q https://github.com/rmcpantoja/piper\n", "%cd /content/piper/src/python\n", "!wget -q \"https://raw.githubusercontent.com/coqui-ai/TTS/dev/TTS/bin/resample.py\"\n", "!pip install -q -r requirements.txt\n", "!pip install -q torchtext==0.12.0\n", "!pip install -q torchvision==0.12.0\n", "# fixing recent compativility isswes:\n", "!pip install -q torchaudio==0.11.0 torchmetrics==0.11.4\n", "!bash build_monotonic_align.sh\n", "!apt-get install -q espeak-ng\n", "# download patches:\n", "print(\"Downloading patch...\")\n", "!gdown -q \"1EWEb7amo1rgFGpBFfRD4BKX3pkjVK1I-\" -O \"/content/piper/src/python/patch.zip\"\n", "!unzip -o -q \"patch.zip\"\n", "%cd /content" ] }, { "cell_type": "markdown", "metadata": { "id": "A3bMzEE0V5Ma" }, "source": [ "# 🤖 ***Entrenamiento.*** 🤖" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "SvEGjf0aV8eg" }, "outputs": [], "source": [ "#@markdown # **1. Extraer dataset.** 📥\n", "#@markdown ####Importante: los audios deben estar en formato **wav, (16000 o 22050hz, 16-bits, mono), y, por comodidad, numerados. Ejemplo:**\n", "\n", "#@markdown * **1.wav**\n", "#@markdown * **2.wav**\n", "#@markdown * **3.wav**\n", "#@markdown * **.....**\n", "\n", "#@markdown ---\n", "\n", "%cd /content\n", "!mkdir /content/dataset\n", "%cd /content/dataset\n", "!mkdir /content/dataset/wavs\n", "#@markdown ### Ruta del dataset para descomprimir:\n", "zip_path = \"/content/drive/MyDrive/wavs.zip\" #@param {type:\"string\"}\n", "!unzip \"{zip_path}\" -d /content/dataset/wavs\n", "#@markdown ---" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "E0W0OCvXXvue" }, "outputs": [], "source": [ "#@markdown # **2. Cargar el archivo de transcripción.** 📝\n", "#@markdown ---\n", "#@markdown ####**Importante: la transcripción significa escribir lo que dice el personaje en cada uno de los audios, y debe tener la siguiente estructura:**\n", "\n", "#@markdown ##### Para un conjunto de datos de un solo hablante:\n", "#@markdown * wavs/1.wav|Esto dice el personaje en el audio 1.\n", "#@markdown * wavs/2.wav|Este, el texto que dice el personaje en el audio 2.\n", "#@markdown * ...\n", "\n", "#@markdown ##### Para un conjunto de datos de varios hablantes:\n", "\n", "#@markdown * wavs/speaker1audio1.wav|speaker1|Esto es lo que dice el primer hablante.\n", "#@markdown * wavs/speaker1audio2.wav|speaker1|Este es otro audio del primer hablante.\n", "#@markdown * wavs/speaker2audio1.wav|speaker2|Esto es lo que dice el segundo hablante en el primer audio.\n", "#@markdown * wavs/speaker2audio2.wav|speaker2|Este es otro audio del segundo hablante.\n", "#@markdown * ...\n", "\n", "#@markdown #### Y así sucesivamente. Además, la transcripción debe estar en formato **.csv (UTF-8 sin BOM)**\n", "#@markdown ---\n", "%cd /content/dataset\n", "from google.colab import files\n", "!rm /content/dataset/metadata.csv\n", "listfn, length = files.upload().popitem()\n", "if listfn != \"metadata.csv\":\n", " !mv \"$listfn\" metadata.csv\n", "%cd .." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "dOyx9Y6JYvRF" }, "outputs": [], "source": [ "#@markdown # **3. Preprocesar el dataset.** 🔄\n", "\n", "import os\n", "#@markdown ### En primer lugar, seleccione el idioma de su conjunto de datos.
(Está disponible para español los siguientes: Español y Español lationamericano.)\n", "language = \"Español\" #@param [\"Català\", \"Dansk\", \"Deutsch\", \"Ελληνικά\", \"English (British)\", \"English (U.S.)\", \"Español\", \"Español (latinoamericano)\", \"Suomi\", \"Français\", \"ქართული\", \"hindy\", \"Icelandic\", \"Italiano\", \"қазақша\", \"नेपाली\", \"Nederlands\", \"Norsk\", \"Polski\", \"Português (Brasil)\", \"Русский\", \"Svenska\", \"украї́нська\", \"Tiếng Việt\", \"简体中文\"]\n", "#@markdown ---\n", "# language definition:\n", "languages = {\n", " \"Català\": \"ca\",\n", " \"Dansk\": \"da\",\n", " \"Deutsch\": \"de\",\n", " \"Ελληνικά\": \"grc\",\n", " \"English (British)\": \"en\",\n", " \"English (U.S.)\": \"en-us\",\n", " \"Español\": \"es\",\n", " \"Español (latinoamericano)\": \"es-419\",\n", " \"Suomi\": \"fi\",\n", " \"Français\": \"fr\",\n", " \"hindy\": \"hi\",\n", " \"Icelandic\": \"is\",\n", " \"Italiano\": \"it\",\n", " \"ქართული\": \"ka\",\n", " \"қазақша\": \"kk\",\n", " \"नेपाली\": \"ne\",\n", " \"Nederlands\": \"nl\",\n", " \"Norsk\": \"nb\",\n", " \"Polski\": \"pl\",\n", " \"Português (Brasil)\": \"pt-br\",\n", " \"Русский\": \"ru\",\n", " \"Svenska\": \"sv\",\n", " \"украї́нська\": \"uk\",\n", " \"Tiếng Việt\": \"vi-vn-x-central\",\n", " \"简体中文\": \"yue\"\n", "}\n", "\n", "def _get_language(code):\n", " return languages[code]\n", "\n", "final_language = _get_language(language)\n", "#@markdown ### Elige un nombre para tu modelo:\n", "model_name = \"Test\" #@param {type:\"string\"}\n", "#@markdown ---\n", "# output:\n", "#@markdown ###Elige la carpeta de trabajo: (se recomienda guardar en Drive)\n", "\n", "#@markdown La carpeta de trabajo se utilizará en el preprocesamiento, pero también en el entrenamiento del modelo.\n", "output_path = \"/content/drive/MyDrive/colab/piper\" #@param {type:\"string\"}\n", "output_dir = output_path+\"/\"+model_name\n", "if not os.path.exists(output_dir):\n", " os.makedirs(output_dir)\n", "#@markdown ---\n", "#@markdown ### Elige el formato del dataset:\n", "dataset_format = \"ljspeech\" #@param [\"ljspeech\", \"mycroft\"]\n", "#@markdown ---\n", "#@markdown ### ¿Se trata de un conjunto de datos de un solo hablante? Si no es así, desmarca la casilla:\n", "single_speaker = True #@param {type:\"boolean\"}\n", "if single_speaker:\n", " force_sp = \" --single-speaker\"\n", "else:\n", " force_sp = \"\"\n", "#@markdown ---\n", "#@markdown ###Seleccione la frecuencia de muestreo del dataset:\n", "sample_rate = \"22050\" #@param [\"16000\", \"22050\"]\n", "#@markdown ---\n", "%cd /content/piper/src/python\n", "#@markdown ###¿Quieres entrenar utilizando esta frecuencia de muestreo, pero tus audios no la tienen?\n", "#@markdown ¡El remuestreador te ayuda a hacerlo rápidamente!\n", "resample = False #@param {type:\"boolean\"}\n", "if resample:\n", " !python resample.py --input_dir \"/content/dataset/wavs\" --output_dir \"/content/dataset/wavs_resampled\" --output_sr {sample_rate} --file_ext \"wav\"\n", " !mv /content/dataset/wavs_resampled/* /content/dataset/wavs\n", "#@markdown ---\n", "\n", "!python -m piper_train.preprocess \\\n", " --language {final_language} \\\n", " --input-dir /content/dataset \\\n", " --output-dir \"{output_dir}\" \\\n", " --dataset-format {dataset_format} \\\n", " --sample-rate {sample_rate} \\\n", " {force_sp}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "ickQlOCRjkBL" }, "outputs": [], "source": [ "#@markdown # **4. Ajustes.** 🧰\n", "import json\n", "import ipywidgets as widgets\n", "from IPython.display import display\n", "from google.colab import output\n", "import os\n", "#@markdown ### Seleccione la acción para entrenar este conjunto de datos:\n", "\n", "#@markdown * La opción de continuar un entrenamiento se explica por sí misma. Si has entrenado previamente un modelo con colab gratuito, se te ha acabado el tiempo y estás considerando entrenarlo un poco más, esto es ideal para ti. Sólo tienes que establecer los mismos ajustes que estableciste cuando entrenaste este modelo por primera vez.\n", "#@markdown * La opción para convertir un modelo de un solo hablante en un modelo multihablante se explica por sí misma, y para ello es importante que hayas procesado un conjunto de datos que contenga texto y audio de todos los posibles hablantes que quieras entrenar en tu modelo.\n", "#@markdown * La opción finetune se utiliza para entrenar un conjunto de datos utilizando un modelo preentrenado, es decir, entrenar sobre esos datos. Esta opción es ideal si desea entrenar un conjunto de datos muy pequeño (se recomiendan más de cinco minutos).\n", "#@markdown * La opción entrenar desde cero construye características como el diccionario y la forma del habla desde cero, y esto puede tardar más en converger. Para ello, se recomiendan horas de audio (8 como mínimo) que tengan una gran colección de fonemas.\n", "action = \"finetune\" #@param [\"Continue training\", \"convert single-speaker to multi-speaker model\", \"finetune\", \"train from scratch\"]\n", "#@markdown ---\n", "if action == \"Continue training\":\n", " if os.path.exists(f\"{output_dir}/lightning_logs/version_0/checkpoints/last.ckpt\"):\n", " ft_command = f'--resume_from_checkpoint \"{output_dir}/lightning_logs/version_0/checkpoints/last.ckpt\" '\n", " print(f\"Continuing {model_name}'s training at: {output_dir}/lightning_logs/version_0/checkpoints/last.ckpt\")\n", " else:\n", " raise Exception(\"Training cannot be continued as there is no checkpoint to continue at.\")\n", "elif action == \"finetune\":\n", " if os.path.exists(f\"{output_dir}/lightning_logs/version_0/checkpoints/last.ckpt\"):\n", " raise Exception(\"Oh no! You have already trained this model before, you cannot choose this option since your progress will be lost, and then your previous time will not count. Please select the option to continue a training.\")\n", " else:\n", " ft_command = '--resume_from_checkpoint \"/content/pretrained.ckpt\" '\n", "elif action == \"convert single-speaker to multi-speaker model\":\n", " if not single_speaker:\n", " ft_command = '--resume_from_single_speaker_checkpoint \"/content/pretrained.ckpt\" '\n", " else:\n", " raise Exception(\"This dataset is not a multi-speaker dataset!\")\n", "else:\n", " ft_command = \"\"\n", "if action== \"convert single-speaker to multi-speaker model\" or action == \"finetune\":\n", " try:\n", " with open('/content/piper/notebooks/pretrained_models.json') as f:\n", " pretrained_models = json.load(f)\n", " if final_language in pretrained_models:\n", " models = pretrained_models[final_language]\n", " model_options = [(model_name, model_name) for model_name, model_url in models.items()]\n", " model_dropdown = widgets.Dropdown(description = \"Choose pretrained model\", options=model_options)\n", " download_button = widgets.Button(description=\"Download\")\n", " def download_model(btn):\n", " model_name = model_dropdown.value\n", " model_url = pretrained_models[final_language][model_name]\n", " print(\"Downloading pretrained model...\")\n", " if model_url.startswith(\"1\"):\n", " !gdown -q \"{model_url}\" -O \"/content/pretrained.ckpt\"\n", " elif model_url.startswith(\"https://drive.google.com/file/d/\"):\n", " !gdown -q \"{model_url}\" -O \"/content/pretrained.ckpt\" --fuzzy\n", " else:\n", " !wget -q \"{model_url}\" -O \"/content/pretrained.ckpt\"\n", " model_dropdown.close()\n", " download_button.close()\n", " output.clear()\n", " if os.path.exists(\"/content/pretrained.ckpt\"):\n", " print(\"Model downloaded!\")\n", " else:\n", " raise Exception(\"Couldn't download the pretrained model!\")\n", " download_button.on_click(download_model)\n", " display(model_dropdown, download_button)\n", " else:\n", " raise Exception(f\"There are no pretrained models available for the language {final_language}\")\n", " except FileNotFoundError:\n", " raise Exception(\"The pretrained_models.json file was not found.\")\n", "else:\n", " print(\"Warning: this model will be trained from scratch. You need at least 8 hours of data for everything to work decent. Good luck!\")\n", "#@markdown ### Elige el tamaño del lote basándose en este conjunto de datos:\n", "batch_size = 12 #@param {type:\"integer\"}\n", "#@markdown ---\n", "validation_split = 0.01\n", "#@markdown ### Elige la calidad para este modelo:\n", "\n", "#@markdown * x-low - 16Khz audio, 5-7M params\n", "#@markdown * medium - 22.05Khz audio, 15-20 params\n", "#@markdown * high - 22.05Khz audio, 28-32M params\n", "quality = \"medium\" #@param [\"high\", \"x-low\", \"medium\"]\n", "#@markdown ---\n", "#@markdown ### Elige la calidad para este modelo: ¿Cada cuántas épocas quieres autoguardar los puntos de control de entrenamiento?\n", "#@markdown Cuanto mayor sea tu conjunto de datos, debes establecer este intervalo de guardado en un valor menor, ya que las épocas pueden progresar durante más tiempo.\n", "checkpoint_epochs = 5 #@param {type:\"integer\"}\n", "#@markdown ---\n", "#@markdown ### Intervalo de pasos para generar muestras de audio del modelo:\n", "log_every_n_steps = 1000 #@param {type:\"integer\"}\n", "#@markdown ---\n", "#@markdown ### Número de épocas para el entrenamiento.\n", "max_epochs = 10000 #@param {type:\"integer\"}\n", "#@markdown ---" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "colab": { "background_save": true }, "id": "X4zbSjXg2J3N" }, "outputs": [], "source": [ "#@markdown # **5. Entrenar.** 🏋️‍♂️\n", "#@markdown Ejecuta esta celda para entrenar tu modelo. Si es posible, se guardarán algunas muestras de audio durante el entrenamiento en la carpeta de salida.\n", "\n", "get_ipython().system(f'''\n", "python -m piper_train \\\n", "--dataset-dir \"{output_dir}\" \\\n", "--accelerator 'gpu' \\\n", "--devices 1 \\\n", "--batch-size {batch_size} \\\n", "--validation-split {validation_split} \\\n", "--num-test-examples 2 \\\n", "--quality {quality} \\\n", "--checkpoint-epochs {checkpoint_epochs} \\\n", "--log_every_n_steps {log_every_n_steps} \\\n", "--max_epochs {max_epochs} \\\n", "{ft_command}\\\n", "--precision 32\n", "''')" ] }, { "cell_type": "markdown", "metadata": { "id": "6ISG085SYn85" }, "source": [ "# ¿Has terminado el entrenamiento y quieres probar el modelo?\n", "\n", "* ¡Si quieres ejecutar este modelo en cualquier software que Piper integre o en la misma app de Piper, exporta tu modelo usando el [cuaderno exportador de modelos](https://colab.research.google.com/github/rmcpantoja/piper/blob/master/notebooks/piper_model_exporter.ipynb)!\n", "* Si quieres probar este modelo ahora mismo antes de exportarlo al formato soportado por Piper. ¡Prueba tu last.ckpt generado con [este cuaderno](https://colab.research.google.com/github/rmcpantoja/piper/blob/master/notebooks/piper_inference_(ckpt).ipynb)!" ] } ], "metadata": { "accelerator": "GPU", "colab": { "provenance": [], "include_colab_link": true }, "gpuClass": "standard", "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }