Speech synthesis for low-resourced languages based on adaptation approach: Application to muong language

teraction. Many recent voice interaction systems have been introduced, allowing users to communicate with devices on various platforms, such as smartphones (Apple Siri, Google Cloud, Amazon Alexa, etc.), intelligent cars (BMW, Ford, etc.), and smart homes. In these systems, one of the essential components is speech synthesis or Text-to-Speech (TTS), which can convert input text into speech. Developing a TTS system for a language is not only the implementation of speech processing techniques but also requires linguistic studies such as phonetics, phonology, syntax, and grammar. According to statistics in the 25th edition of Ethnologue1 (regarded as the most comprehensive source of information on linguistic statistics), there are 7,151 living languages in the world, belonging to 141 language families, of which 2,982 languages are not written. Some languages have not been described in academic literature, such as dialects of ethnic minorities. Machine learning methods based on big data do not immediately apply to low- resourced languages, especially unwritten ones. The low-resourced/unwritten language processing field has started to pay attention in the past few years and has yet to have many results. However, the research results of this field are essential because, in addition to bringing voice communication technologies to ethnic minority communities, products applying this technology are also essential. It also contributes to the conservation of endangered languages. Regarding the Vietnamese language and speech processing field, domestic research units have given it comprehensive attention and addressed various aspects, ranging from natural language processing problems such as text processing, syntactic component separation, and semantics to speech processing problems such as synthesis and recognition. However, the problem of language and speech processing in general, including TTS) systems for minority languages without a writing system in Vietnam, has not received much attention due to the scarcity of data sources such as bilingual text data and speech data, as well as a lack of related linguistic studies. The Muong language presents unique linguistic characteristics that make it challenging to develop a TTS system, such as tonality and complex phonetic structures. Therefore, this thesis aims to fill this gap by focusing on developing a TTS system for the Muong language, a minority language spoken in Vietnam that does not have a writing system (only the Muong Hoa Binh dialect had a writing system in 2016). This research area is novel not only in Vietnam but also worldwide, and the development of a Muong TTS system can contribute to preserving and promoting this endangered language.

176 trang | Chia sẻ: Tài Chi | Ngày: 27/11/2023 | Lượt xem: 642 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Speech synthesis for low-resourced languages based on adaptation approach: Application to muong language, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

Pham Van Dong SPEECH SYNTHESIS FOR LOW-RESOURCED LANGUAGES BASED ON ADAPTATION APPROACH: APPLICATION TO MUONG LANGUAGE DOCTORAL DISSERTATION IN COMPUTER SCIENCE Ha Noi – 2023 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY Pham Van Dong SPEECH SYNTHESIS FOR LOW-RESOURCED LANGUAGES BASED ON ADAPTATION APPROACH: APPLICATION TO MUONG LANGUAGE Major: Computer science Code: 9480101 DOCTORAL DISSERTATION IN COMPUTER SCIENCE ADVISORS: 1. Dr. MAC DANG KHOA 2. Assoc. Prof. TRAN DO DAT Ha Noi - 2023 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY i DECLARATION OF AUTHORSHIP I, Pham Van Dong, declare that the dissertation titled “Speech Synthesis for Low- Resourced Languages based on Adaptation Approach: Application to Muong Language” has been entirely composed by myself. I assure you of some points as follows: • This work was done wholly or mainly while in candidature for a Ph.D. research degree at Hanoi University of Science and Technology. • The work has not been submitted for any other degree or qualifications at Hanoi University of Science and Technology or any other institution. • Appropriate acknowledgment has been given within this dissertation, where reference has been made to the published work of others. • The dissertation submitted is my own, except where work in the collaboration has been included. The collaborative contributions have been indicated. Hanoi, September 19, 2023 Ph.D. Student Pham Van Dong ADVISORS 1. Dr. Mac Dang Khoa 2. Assoc. Prof. Tran Do Dat ii ACKNOWLEDGMENT Foremost, I would like to express my most sincere and deepest gratitude to my thesis advisors Dr. Mạc Đăng Khoa (Speech Communication Department, MultiLab at MICA), Prof. TRẦN Đỗ Đạt (The Ministry of Science and Technology, Vietnam) for their continuous support and guidance during my Ph.D. program, and for providing me with such a severe and inspiring research environment. I am grateful to Dr. Mạc Đăng Khoa for his excellent mentorship, caring, patience, and immense Text-To-Speech (TTS) knowledge. His advice helped me in all the research and writing of this thesis. I am very thankful to Prof. Đạt for shaping my thesis at the beginning and for their enthusiasm and encouragement. Prof. Trần Đỗ Đạt substantially facilitated my Ph.D. research, especially when I was a freshman on speech processing and TTS, with his valuable comments on Vietnamese and Muong TTS. I thank all MICA members for their help during my Ph.D. study. My sincere thanks to Dr. Nguyen Viet Son, Assoc. Prof. Dao Trung Kien and Dr. Do Thi Ngoc Diep for giving me much support and valuable advice. Thanks to Nguyen Van Thinh, Nguyen Tien Thanh, Dang Thanh Mai, and Vu Thi Hai Ha for their help. I want to thank my Hanoi University of Mining and Geology colleagues for all their support during my Ph.D. study. Special thanks to my family for understanding my hours glued to the computer screen. Hanoi, September 19, 2023 Ph.D. Student iii ABSTRACT Text-to-speech (TTS) synthesis is the automatic conversion of text into speech. Typically, building high-quality voiceovers requires collecting tens of hours of the voice of a professional speaker with a high-quality microphone. There are about 7,000 languages spoken worldwide, but only a few languages, such as English, Spanish, Mandarin, and Japanese, are used in good TTS. With so-called "low-resourced languages" or even languages that are not yet written, these languages do not have TTS. Thus, to apply TTS technology to low-resourced language, it is necessary to study other TTS methods. In Vietnam, Vietnamese is the mother tongue and is used the most. The Muong is a group of the language spoken by the Muong people of Vietnam. They are in the Austroasiatic language family and are closely related to Vietnamese, and Muong is also one of the five ethnic groups with the largest population. However, Muong still needs an official script, a typical representative of the low-resourced language in Vietnam. Therefore, researching TTS technologies to create TTS for the Muong language is challenging. In the first part of this thesis, we do an overview of TTS. Researching the phonetics of Vietnamese and Muong languages, the thesis has also researched and published some tools to support TTS technology for Vietnamese and Muong languages. In the rest of the thesis, we conduct various experiments in creating TTS for low-resourced language; specifically, we experiment with the Muong language. We focus on two main low-resourced language groups: • Written: We use emulating to simulate the reading of the Muong language using Vietnamese TTS and cross-lingual adaptation transfer-learning. • Unwritten: We experiment with adaptation in two directions. The first is to create Muong speech synthesis directly from Vietnamese Text and Muong voice. The second is to create Muong speech synthesis from translation through intermediate representation We hope our findings can serve as an impetus to develop speech synthesis for low-resourced languages worldwide and contribute to the basis for speech synthesis development for 53 ethnic minority languages in Viet Nam. Hanoi, September 19, 2023 Ph.D. Student iv CONTENT DECLARATION OF AUTHORSHIP ................................................................................. I ACKNOWLEDGMENT .................................................................................................... II ABSTRACT .........................................................................................................................III CONTENT .............................................................................................................................. IV ABBREVIATIONS ........................................................................................................... VIII LIST OF TABLES ................................................................................................................ IX LIST OF FIGURES .............................................................................................................. XI INTRODUCTION ................................................................................................................. 1 PART 1 : BACKGROUND AND RELATED WORKS ............................................ 5 CHAPTER 1. OVERVIEW OF SPEECH SYNTHESIS AND SPEECH SYNTHESIS FOR LOW-RESOURCED LANGUAGE ...................................................... 6 1.1. Overview of speech synthesis .................................................................................... 6 1.1.1. Overview...................................................................................................................... 6 1.1.2. TTS architecture .......................................................................................................... 8 1.1.3. Evolution of TTS methods over time ........................................................................ 9 1.1.3.1. TTS using unit-selection method ...................................................................... 10 1.1.3.2. Statistical parameter speech synthesis .............................................................. 11 1.1.3.3. Speech synthesis using deep neural networks ................................................. 13 1.1.3.4. Neural speech synthesis .................................................................................... 14 1.2. Speech synthesis for low-resourced languages..................................................... 19 1.2.1. TTS using emulating input approach ....................................................................... 20 1.2.2. TTS using the polyglot approach ............................................................................. 22 1.2.3. Speech synthesis for low-resourced language using the adaptation approach ...... 25 1.3. Machine translation .................................................................................................. 27 1.3.1. Neural translation model ........................................................................................... 28 1.3.2. Attention in neural machine translation ................................................................... 29 1.3.3. Statistical machine translation based on phrase ...................................................... 30 1.3.3.1. Statistical machine translation problem based on phrase ................................ 30 1.3.3.2. Translation model and language model ........................................................... 31 1.3.3.3. Decode the input sentence in the translation system ....................................... 32 1.3.3.4. Model for building a statistical translation system .......................................... 34 1.3.4. Machine translation through intermediate representation ...................................... 34 1.3.5. Speech translation for unwritten low-resourced languages .................................... 36 1.4. Speech synthesis evaluation metrics ...................................................................... 38 1.4.1. Mean Opinion Score (MOS) .................................................................................... 38 1.4.1.1. Definition ........................................................................................................... 38 1.4.1.2. Formula .............................................................................................................. 38 1.4.1.3. Significance ........................................................................................................ 38 1.4.1.4. Confidence Interval (CI) ................................................................................... 39 1.4.2. Mel Cepstral Distortion (MCD) ............................................................................... 39 v 1.4.2.1. Concept .............................................................................................................. 39 1.4.2.2. Formula .............................................................................................................. 39 1.4.2.3. Significance ........................................................................................................ 40 1.4.2.4. MCD with Dynamic Time Warping (MCD – DTW) .................................... 40 1.4.3. Analysis of variance (Anova) ................................................................................... 40 1.4.4. Intelligibility .............................................................................................................. 42 1.5. Conclusion .................................................................................................................. 42 CHAPTER 2. VIETNAMESE AND MUONG LANGUAGE ..................................... 44 2.1. Vietnamese language ................................................................................................ 44 2.1.1. History of Vietnamese .............................................................................................. 44 2.1.2. Vietnamese phonetic system .................................................................................... 45 2.1.2.1. Vietnamese syllabus structure .......................................................................... 46 2.1.2.2. Vietnamese phonetic system............................................................................. 47 2.1.2.3. Vietnamese tone system .................................................................................... 49 2.2. Muong language ........................................................................................................ 50 2.2.1. Overview of Muong people and Muong language ................................................. 50 2.2.1.1. Muong history .................................................................................................... 50 2.2.1.2. Viet Muong group ............................................................................................. 51 2.2.1.3. Muong dialects ................................................................................................... 53 2.2.1.4. Muong written script ......................................................................................... 54 2.2.2. Muong phonetics system .......................................................................................... 55 2.2.2.1. Muong syllable structure ................................................................................... 55 2.2.2.2. Muong phoneme system ................................................................................... 55 2.2.2.3. Muong tone system ........................................................................................... 57 2.3. Comparison between Vietnamese and Muong .................................................... 57 2.4. Dicussion and proposal approach .......................................................................... 60 PART 2 : SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE ........................................................................................................................................................ 61 CHAPTER 3. EMULATING OF THE MUONG TTS BASED ON INPUT TRANSFORMATION OF THE VIETNAMESE TTS ...................................................... 62 3.1. Proposed method ...................................................................................................... 63 3.1.1. Muong G2P module.................................................................................................. 64 3.1.2. Muong emulating IPA module................................................................................. 65 3.2. Experiment................................................................................................................. 65 3.2.1. Testing materials ....................................................................................................... 66 3.2.2. Experiment protocol .................................................................................................. 67 3.2.3. Results ........................................................................................................................ 68 3.2.4. Analysis by ANOVA method .................................................................................. 72 3.2.4.1. MOS analysis by ANOVA ............................................................................... 72 3.2.4.2. Intelligibility analysis by ANOVA ................................................................... 75 3.3. Conclusion .................................................................................................................. 77 vi CHAPTER 4. CROSS-LINGUAL TRANSFER LEARNING FOR MUONG SPEECH SYNTHESIS .............................................................................................................. 78 4.1. Proposed method ...................................................................................................... 78 4.2. Experiment................................................................................................................. 82 4.2.1. Dataset ........................................................................................................................ 82 4.2.1.1. Vietnamese data ................................................................................................. 82 4.2.1.2. Muong Project‘s data ........................................................................................ 84 4.2.1.3. Muong fine-tuning data ..................................................................................... 84 4.2.2. Graphemes to phonemes .......................................................................................... 85 4.2.3. Training the pretrained model using Vietnamese dataset. ...................................... 86 4.2.4. Finetuned TTS model on Muong datasets .............................................................. 87 4.3. Evaluation .................................................................................................................. 88 4.4. MOS analysis by ANOVA ....................................................................................... 91 4.5. Conclusion .................................................................................................................. 94 PART 3 : SPEECH SYNTHESIS FOR MUONG AS AN UNWRITTEN LANGUAGE .............................................................................................................................. 96 CHAPTER 5. GENERATE UNWRITTEN LOW-RESOURCED LANGUAGE’S SPEECH DIRECTLY FROM RICH-RESOURCE LANGUAGE’S TEXT ................. 97 5.1. Introduction ............................................................................................................... 97 5.2. Proposed method ...................................................................................................... 98 5.2.1. Model architecture .................................................................................................... 98 5.2.2. Database ..................................................................................................................... 99 5.2.3. Training the speech synthesis system .................................................................... 100 5.2.4. Evaluation ................................................................................................................ 100 5.2.5. MOS analysis by ANOVA ..................................................................................... 105 5.2.5.1. ANOVA analysis in Muong Bi speech synthesis ......................................... 105 5.2.5.2. ANOVA analysis in Muong Tan Son speech synthesis ............................... 108 5.3. Conclusion ................................................................................................................ 111 CHAPTER 6. SPEECH SYNTHESIS FOR UNWRITTEN LOW-RESOURCED LANGUAGE USING INTERMEDIATE REPRESENTATION .................................. 112 6.1. Proposal Method ..................................................................................................... 112 6.2. Experiment............................................................................................................... 114 6.2.1. Database building .................................................................................................... 114 6.2.2. System development ............................................................................................... 114 6.2.2.1. Text to phone translation ................................................................................. 115 6.2.2.2. Phone to Sound Conversion............................................................................ 117 6.3. Evaluation ................................................................................................................ 119 6.3.1. Evaluation in Muong Bi and Muong Tan Son ...................................................... 119 6.3.2. MOS analysis by ANOVA ..................................................................................... 122 6.3.2.1. ANOVA analysis in Muong Bi speech synthesis ......................................... 122 6.3.2.2. ANOVA analysis in Muong Tan Son speech synthesis ............................... 125 6.4. Conclusion and comparison .................................................................................. 128 CONCLUSION AND FUTURE WORKS ...................................................

Các file đính kèm theo tài liệu này:

speech_synthesis_for_low_resourced_languages_based_on_adapta.pdf
3.-Trich-yeu-cua-Luan-an - DM_DongPV edit 1.7.2023.pdf
12. Tom tat diem moi thesis dongpv english.pdf
12. Tom tat diem moi thesis dongpv tieng Viet.pdf
TiengAnh.Tom tat luan an DongPV V9.6.2 24 pages.pdf
TiengViet.Tom tat luan an DongPV V9.6.2 tieng viet 24 pages.pdf