teraction. Many recent voice interaction systems have been introduced, allowing users to
communicate with devices on various platforms, such as smartphones (Apple Siri, Google
Cloud, Amazon Alexa, etc.), intelligent cars (BMW, Ford, etc.), and smart homes. In these
systems, one of the essential components is speech synthesis or Text-to-Speech (TTS), which
can convert input text into speech. Developing a TTS system for a language is not only the
implementation of speech processing techniques but also requires linguistic studies such as
phonetics, phonology, syntax, and grammar.
According to statistics in the 25th edition of Ethnologue1 (regarded as the most
comprehensive source of information on linguistic statistics), there are 7,151 living languages
in the world, belonging to 141 language families, of which 2,982 languages are not written.
Some languages have not been described in academic literature, such as dialects of ethnic
minorities. Machine learning methods based on big data do not immediately apply to low-
resourced languages, especially unwritten ones. The low-resourced/unwritten language
processing field has started to pay attention in the past few years and has yet to have many
results. However, the research results of this field are essential because, in addition to bringing
voice communication technologies to ethnic minority communities, products applying this
technology are also essential. It also contributes to the conservation of endangered languages.
Regarding the Vietnamese language and speech processing field, domestic research units
have given it comprehensive attention and addressed various aspects, ranging from natural
language processing problems such as text processing, syntactic component separation, and
semantics to speech processing problems such as synthesis and recognition. However, the
problem of language and speech processing in general, including TTS) systems for minority
languages without a writing system in Vietnam, has not received much attention due to the
scarcity of data sources such as bilingual text data and speech data, as well as a lack of related
linguistic studies.
The Muong language presents unique linguistic characteristics that make it challenging to
develop a TTS system, such as tonality and complex phonetic structures. Therefore, this thesis
aims to fill this gap by focusing on developing a TTS system for the Muong language, a
minority language spoken in Vietnam that does not have a writing system (only the Muong
Hoa Binh dialect had a writing system in 2016). This research area is novel not only in Vietnam
but also worldwide, and the development of a Muong TTS system can contribute to preserving
and promoting this endangered language.
176 trang |
Chia sẻ: Tài Chi | Ngày: 27/11/2023 | Lượt xem: 380 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Speech synthesis for low-resourced languages based on adaptation approach: Application to muong language, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Pham Van Dong
SPEECH SYNTHESIS FOR LOW-RESOURCED LANGUAGES
BASED ON ADAPTATION APPROACH: APPLICATION TO
MUONG LANGUAGE
DOCTORAL DISSERTATION IN
COMPUTER SCIENCE
Ha Noi – 2023
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Pham Van Dong
SPEECH SYNTHESIS FOR LOW-RESOURCED LANGUAGES
BASED ON ADAPTATION APPROACH: APPLICATION TO
MUONG LANGUAGE
Major: Computer science
Code: 9480101
DOCTORAL DISSERTATION IN
COMPUTER SCIENCE
ADVISORS:
1. Dr. MAC DANG KHOA
2. Assoc. Prof. TRAN DO DAT
Ha Noi - 2023
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
i
DECLARATION OF AUTHORSHIP
I, Pham Van Dong, declare that the dissertation titled “Speech Synthesis for Low-
Resourced Languages based on Adaptation Approach: Application to Muong Language” has
been entirely composed by myself. I assure you of some points as follows:
• This work was done wholly or mainly while in candidature for a Ph.D.
research degree at Hanoi University of Science and Technology.
• The work has not been submitted for any other degree or qualifications at
Hanoi University of Science and Technology or any other institution.
• Appropriate acknowledgment has been given within this dissertation, where
reference has been made to the published work of others.
• The dissertation submitted is my own, except where work in the
collaboration has been included. The collaborative contributions have been
indicated.
Hanoi, September 19, 2023
Ph.D. Student
Pham Van Dong
ADVISORS
1. Dr. Mac Dang Khoa
2. Assoc. Prof. Tran Do Dat
ii
ACKNOWLEDGMENT
Foremost, I would like to express my most sincere and deepest gratitude to my thesis
advisors Dr. Mạc Đăng Khoa (Speech Communication Department, MultiLab at MICA), Prof.
TRẦN Đỗ Đạt (The Ministry of Science and Technology, Vietnam) for their continuous
support and guidance during my Ph.D. program, and for providing me with such a severe and
inspiring research environment. I am grateful to Dr. Mạc Đăng Khoa for his excellent
mentorship, caring, patience, and immense Text-To-Speech (TTS) knowledge. His advice
helped me in all the research and writing of this thesis. I am very thankful to Prof. Đạt for
shaping my thesis at the beginning and for their enthusiasm and encouragement. Prof. Trần Đỗ
Đạt substantially facilitated my Ph.D. research, especially when I was a freshman on speech
processing and TTS, with his valuable comments on Vietnamese and Muong TTS.
I thank all MICA members for their help during my Ph.D. study. My sincere thanks to Dr.
Nguyen Viet Son, Assoc. Prof. Dao Trung Kien and Dr. Do Thi Ngoc Diep for giving me much
support and valuable advice. Thanks to Nguyen Van Thinh, Nguyen Tien Thanh, Dang Thanh
Mai, and Vu Thi Hai Ha for their help. I want to thank my Hanoi University of Mining and
Geology colleagues for all their support during my Ph.D. study. Special thanks to my family
for understanding my hours glued to the computer screen.
Hanoi, September 19, 2023
Ph.D. Student
iii
ABSTRACT
Text-to-speech (TTS) synthesis is the automatic conversion of text into speech. Typically,
building high-quality voiceovers requires collecting tens of hours of the voice of a professional
speaker with a high-quality microphone. There are about 7,000 languages spoken worldwide,
but only a few languages, such as English, Spanish, Mandarin, and Japanese, are used in good
TTS. With so-called "low-resourced languages" or even languages that are not yet written, these
languages do not have TTS. Thus, to apply TTS technology to low-resourced language, it is
necessary to study other TTS methods.
In Vietnam, Vietnamese is the mother tongue and is used the most. The Muong is a group
of the language spoken by the Muong people of Vietnam. They are in the Austroasiatic
language family and are closely related to Vietnamese, and Muong is also one of the five ethnic
groups with the largest population. However, Muong still needs an official script, a typical
representative of the low-resourced language in Vietnam. Therefore, researching TTS
technologies to create TTS for the Muong language is challenging.
In the first part of this thesis, we do an overview of TTS. Researching the phonetics of
Vietnamese and Muong languages, the thesis has also researched and published some tools to
support TTS technology for Vietnamese and Muong languages. In the rest of the thesis, we
conduct various experiments in creating TTS for low-resourced language; specifically, we
experiment with the Muong language. We focus on two main low-resourced language groups:
• Written: We use emulating to simulate the reading of the Muong language
using Vietnamese TTS and cross-lingual adaptation transfer-learning.
• Unwritten: We experiment with adaptation in two directions. The first is to
create Muong speech synthesis directly from Vietnamese Text and Muong
voice. The second is to create Muong speech synthesis from translation
through intermediate representation
We hope our findings can serve as an impetus to develop speech synthesis for low-resourced
languages worldwide and contribute to the basis for speech synthesis development for 53 ethnic
minority languages in Viet Nam.
Hanoi, September 19, 2023
Ph.D. Student
iv
CONTENT
DECLARATION OF AUTHORSHIP ................................................................................. I
ACKNOWLEDGMENT .................................................................................................... II
ABSTRACT .........................................................................................................................III
CONTENT .............................................................................................................................. IV
ABBREVIATIONS ........................................................................................................... VIII
LIST OF TABLES ................................................................................................................ IX
LIST OF FIGURES .............................................................................................................. XI
INTRODUCTION ................................................................................................................. 1
PART 1 : BACKGROUND AND RELATED WORKS ............................................ 5
CHAPTER 1. OVERVIEW OF SPEECH SYNTHESIS AND SPEECH
SYNTHESIS FOR LOW-RESOURCED LANGUAGE ...................................................... 6
1.1. Overview of speech synthesis .................................................................................... 6
1.1.1. Overview...................................................................................................................... 6
1.1.2. TTS architecture .......................................................................................................... 8
1.1.3. Evolution of TTS methods over time ........................................................................ 9
1.1.3.1. TTS using unit-selection method ...................................................................... 10
1.1.3.2. Statistical parameter speech synthesis .............................................................. 11
1.1.3.3. Speech synthesis using deep neural networks ................................................. 13
1.1.3.4. Neural speech synthesis .................................................................................... 14
1.2. Speech synthesis for low-resourced languages..................................................... 19
1.2.1. TTS using emulating input approach ....................................................................... 20
1.2.2. TTS using the polyglot approach ............................................................................. 22
1.2.3. Speech synthesis for low-resourced language using the adaptation approach ...... 25
1.3. Machine translation .................................................................................................. 27
1.3.1. Neural translation model ........................................................................................... 28
1.3.2. Attention in neural machine translation ................................................................... 29
1.3.3. Statistical machine translation based on phrase ...................................................... 30
1.3.3.1. Statistical machine translation problem based on phrase ................................ 30
1.3.3.2. Translation model and language model ........................................................... 31
1.3.3.3. Decode the input sentence in the translation system ....................................... 32
1.3.3.4. Model for building a statistical translation system .......................................... 34
1.3.4. Machine translation through intermediate representation ...................................... 34
1.3.5. Speech translation for unwritten low-resourced languages .................................... 36
1.4. Speech synthesis evaluation metrics ...................................................................... 38
1.4.1. Mean Opinion Score (MOS) .................................................................................... 38
1.4.1.1. Definition ........................................................................................................... 38
1.4.1.2. Formula .............................................................................................................. 38
1.4.1.3. Significance ........................................................................................................ 38
1.4.1.4. Confidence Interval (CI) ................................................................................... 39
1.4.2. Mel Cepstral Distortion (MCD) ............................................................................... 39
v
1.4.2.1. Concept .............................................................................................................. 39
1.4.2.2. Formula .............................................................................................................. 39
1.4.2.3. Significance ........................................................................................................ 40
1.4.2.4. MCD with Dynamic Time Warping (MCD – DTW) .................................... 40
1.4.3. Analysis of variance (Anova) ................................................................................... 40
1.4.4. Intelligibility .............................................................................................................. 42
1.5. Conclusion .................................................................................................................. 42
CHAPTER 2. VIETNAMESE AND MUONG LANGUAGE ..................................... 44
2.1. Vietnamese language ................................................................................................ 44
2.1.1. History of Vietnamese .............................................................................................. 44
2.1.2. Vietnamese phonetic system .................................................................................... 45
2.1.2.1. Vietnamese syllabus structure .......................................................................... 46
2.1.2.2. Vietnamese phonetic system............................................................................. 47
2.1.2.3. Vietnamese tone system .................................................................................... 49
2.2. Muong language ........................................................................................................ 50
2.2.1. Overview of Muong people and Muong language ................................................. 50
2.2.1.1. Muong history .................................................................................................... 50
2.2.1.2. Viet Muong group ............................................................................................. 51
2.2.1.3. Muong dialects ................................................................................................... 53
2.2.1.4. Muong written script ......................................................................................... 54
2.2.2. Muong phonetics system .......................................................................................... 55
2.2.2.1. Muong syllable structure ................................................................................... 55
2.2.2.2. Muong phoneme system ................................................................................... 55
2.2.2.3. Muong tone system ........................................................................................... 57
2.3. Comparison between Vietnamese and Muong .................................................... 57
2.4. Dicussion and proposal approach .......................................................................... 60
PART 2 : SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE
........................................................................................................................................................ 61
CHAPTER 3. EMULATING OF THE MUONG TTS BASED ON INPUT
TRANSFORMATION OF THE VIETNAMESE TTS ...................................................... 62
3.1. Proposed method ...................................................................................................... 63
3.1.1. Muong G2P module.................................................................................................. 64
3.1.2. Muong emulating IPA module................................................................................. 65
3.2. Experiment................................................................................................................. 65
3.2.1. Testing materials ....................................................................................................... 66
3.2.2. Experiment protocol .................................................................................................. 67
3.2.3. Results ........................................................................................................................ 68
3.2.4. Analysis by ANOVA method .................................................................................. 72
3.2.4.1. MOS analysis by ANOVA ............................................................................... 72
3.2.4.2. Intelligibility analysis by ANOVA ................................................................... 75
3.3. Conclusion .................................................................................................................. 77
vi
CHAPTER 4. CROSS-LINGUAL TRANSFER LEARNING FOR MUONG
SPEECH SYNTHESIS .............................................................................................................. 78
4.1. Proposed method ...................................................................................................... 78
4.2. Experiment................................................................................................................. 82
4.2.1. Dataset ........................................................................................................................ 82
4.2.1.1. Vietnamese data ................................................................................................. 82
4.2.1.2. Muong Project‘s data ........................................................................................ 84
4.2.1.3. Muong fine-tuning data ..................................................................................... 84
4.2.2. Graphemes to phonemes .......................................................................................... 85
4.2.3. Training the pretrained model using Vietnamese dataset. ...................................... 86
4.2.4. Finetuned TTS model on Muong datasets .............................................................. 87
4.3. Evaluation .................................................................................................................. 88
4.4. MOS analysis by ANOVA ....................................................................................... 91
4.5. Conclusion .................................................................................................................. 94
PART 3 : SPEECH SYNTHESIS FOR MUONG AS AN UNWRITTEN
LANGUAGE .............................................................................................................................. 96
CHAPTER 5. GENERATE UNWRITTEN LOW-RESOURCED LANGUAGE’S
SPEECH DIRECTLY FROM RICH-RESOURCE LANGUAGE’S TEXT ................. 97
5.1. Introduction ............................................................................................................... 97
5.2. Proposed method ...................................................................................................... 98
5.2.1. Model architecture .................................................................................................... 98
5.2.2. Database ..................................................................................................................... 99
5.2.3. Training the speech synthesis system .................................................................... 100
5.2.4. Evaluation ................................................................................................................ 100
5.2.5. MOS analysis by ANOVA ..................................................................................... 105
5.2.5.1. ANOVA analysis in Muong Bi speech synthesis ......................................... 105
5.2.5.2. ANOVA analysis in Muong Tan Son speech synthesis ............................... 108
5.3. Conclusion ................................................................................................................ 111
CHAPTER 6. SPEECH SYNTHESIS FOR UNWRITTEN LOW-RESOURCED
LANGUAGE USING INTERMEDIATE REPRESENTATION .................................. 112
6.1. Proposal Method ..................................................................................................... 112
6.2. Experiment............................................................................................................... 114
6.2.1. Database building .................................................................................................... 114
6.2.2. System development ............................................................................................... 114
6.2.2.1. Text to phone translation ................................................................................. 115
6.2.2.2. Phone to Sound Conversion............................................................................ 117
6.3. Evaluation ................................................................................................................ 119
6.3.1. Evaluation in Muong Bi and Muong Tan Son ...................................................... 119
6.3.2. MOS analysis by ANOVA ..................................................................................... 122
6.3.2.1. ANOVA analysis in Muong Bi speech synthesis ......................................... 122
6.3.2.2. ANOVA analysis in Muong Tan Son speech synthesis ............................... 125
6.4. Conclusion and comparison .................................................................................. 128
CONCLUSION AND FUTURE WORKS ...................................................