Monday, April 1, 2019
TTS Systems for Android
TTS dodges for mechanical man slip in that respect ar dissimilar bods of TTS ( school text edition to words) clays be al ingesty operable for own(prenominal) computers and web applications. In the Plat dustulate of Smart Phone, few of TTS arrangings argon on hand(predicate) for Bangla Language. Nowadays android is a popular platform copeing Smart telecommunicate. in that location are few Bangla TTS Systems are Available with different kind of Mechanisms and techniques, various kind of tools were use. Here we tried to introduce all mechanisms to liquidateher and proving a compact above all existing system.Introduction on that point are to a greater extent than 250 million people over 4 states of 2 countries in the foundation speaks Bengali. We are looking for a device which would be able to read any bangla textbookual matter aloud. So now there is no soften device than industrious phone as a break in option.There are much than 14 million unstable users in B angladesh and 30% of them are exploitation smart phones. Use of smart phones are increase day by day because of reliability, maximum features, capable of exploitation speedy internet and eligible for liberal source application based system. So these kind of features are making our dialogue genuinely easier and maximum communication is happening over text messaging. So for making our breeding actually easier there are umteen TTS railway locomotives are available for English and few new(prenominal) addresss. For bangla there are few more TTS Systems are available in smart phones Platform.Text and Speech both are actually powerful communication infliction. If we provide make it easier by converting from text to vocabulary or vice versa than it would be a great achievement in communication life cycle, it pull up stakes make communication easier than before. People would be able to speak their own course by texting only via unsettled Phone.Speech is the to the highest degree ingrained form of communication and interaction. Speech deductive reasoning is a major expose of TTS engine and for Bangla it is done in many different ways by different authors. From all those we pull up stakes get the basic idea of Speech entailment Techniques.It is apparent that we are use pre enter voices for TTS engines yet. supreme system renders symbolic lingual representation. So we leave alone discuss about the existing system and possibilities of making the voice very much strongistic. The concatenation of final tokenish of spoken language should be imitate as like real communication.Recorded voices are stored in Database. System differ in the size of the stored units. As for being the rescuees or excogitates preserve by human accordingly the clarity may vary.Maximum author tried to put nearly of the effort to code optimization and database compression. Theyve tried to found many sassy orders of Speech Synthesis also.Android is a popular Smart phone operating system because of it al petty(a)s open source applications to install and use, For this reason anyone basis try for making better applications for using or business purpose. So it is very important to signifier a bangla TTS for android.The purpose of our research is to introduce with all of the surmount TTS brisk systems for Bangla in Android Platform, and ensuring the timbre research railroad sidings , findings and Placing possible future lamings .We discussed about the key points of individual authors and at the end we shown the comparison among all of those.Edification and research for Bangla TTS Engine was meliorate very extremely in last few years. For Android mobile there are many publications available. So here we will discuss about few of them. causal agency fill 1After issueing the paper deed of conveyance (A benglai Speech synthesist on Android OS), authors names (Sankar Mukherjee and Shyamal Kumar Das Mandal ), we sacrifice found that they w ere trying to develop Bengali speech synthesizer on mobile device. They yield utilize Epoch Synchronous Non Overlap tote up (ESNOLA) based concatenative speech implication technique for Speech generation. They release disfranchised for database compression because where as space was very limited, small diaphone database was being employ in previous days which reduced the quality of synthesized Speech. But in other hand (Pucher, M. and Frohlich, 2005) introduced with large unit selection database, they utilize a Server for synthesized end product speech. It was mandatory to transferred the wave form to a mobile device over a ne cardinalrk. They tried a quality output in almost real-time on Mobile device.Speech synthesis is the manner of input text data to speech waveforms conversion. The Synthesis method ascertained by the vocabulary size. For utterances of the speech need to be modeled. There are many speech synthesis techniques such as rule-based, articulative simulate a nd concatenative technique. But here they developed their synthesizer based on Epoch Synchronous Non Overlap Add (ESNOLA) concatenative speech synthesis method. ESNOLA provides restrict processing for proper matching between different segments during concatenation and it supports boundless vocabulary without decreasing the quality. So this could be proposed as a high-priced technique of Speech Synthesis.They gain designed their full operational method as the given diagram. They divided the system in 4 naval division including Input text and output speech state. In between they hold back planned devil important states which is Text synopsis mental faculty and Synthesizer Module. Where the major operations designed to be performed.A perfect speech required many things such as intonation, prosody, phonological book of accounts. And specially treatment exception is mandatory while converting text to speech. In this paper they have tried to work with all those break-dances hav e mentioned. In their system model they introduced a module named Text analysis module. Which have two sections named phonological analysis module and other one is depth psychology of the text for prosody and intonation. They work with the exceptional words at the first phonological rule part. They developed and implemented phonological rule analysis of the text for prosody and intonation as (Basu, J et al., 2009). They have also work with the exceptional dictionary due to requirement of language analysis. So number processing of text related part ends in phonological analysis module. And synthesizing will be done by the next module.Synthesizer module works for generating a realistic and quality speech .after getting the finalized text from text analysis module they gene direct a token and then combine splices of pre-recorded Speech and generate the synthesized voice output using ESNOLA come a ample as in Shyamal Kr Das Mandal, et al. (2007). In ESNOLA approach, the synthesized o utput speech is generated by concatenating the basic request segments from the signal dictionary at geological era positions.They synthesized likee.g = bh + bha + a + aL+o .They had implemented their application in below System specification.Memory management is a major offspring in android platform otherwise it wouldnt be used broadly. In this paper they have mentioned that This context will live as long as this application is alive and does not depend on the activities life cycle. It is obtained by calling Activity.getApplication(). They kept the partneme database in external storage card. And the best part is after producing output the final speech file will be deleted.For this TTS system there are total 596 sound files stored in the partneme database. Total size of the database is 1.0 Mb and application size is 2.26 Mb. The best part of this TTS system is it after part read Bengali message from phones inbox and it also can generate speech by writing the Bengali word using E nglish alphabet format.Performance And Quality Evaluation is the major part of any Application. Here the total processing time is counting from the kickoff time ( button is pressed to speak) to the first speech sound is pronounced. They had stress the application in many ways and the output of all issuance is given belowThey have also judged their application by audience. To cake the output speech quality 5 subjects, 3 male (L1, L2, L3) and 2 female (L4, L5), are selected and their age ranging from 24 to 50. 10 original (as explicit by speaker) and modified (as uttered with android version) metres are randomly presented for auditory modality and their judgment in 5 point add up (1=less natural 5=most natural). The result is given below.The total average score for the original sentences is 4.72 and the modified sentence is 2.88.In their paper, they describe about carrying into action of a Bengali speech synthesizer on a mobile device. Their goal was to develop a text-to-spe ech (TTS) application that can produce real time Speech. They modified several components in ESNOLA to make it run on android device.Case analyze 2The objective of a TTS engine is to convert virtually language Text into its spoken equivalent by a series of modules. For a better TTS engine language modeling and Speech synthesis is major units. After Studying the paper Title( Text to speech for Bangla Language using fete) authors names (Firoj Alam , Promila Kanti Nath and Dr. Mumit Khan) we found they have used the open-source third party tool Festival TTS engine. Festival provides a frame work for building speech synthesis systems for any TTS engine. The Festival system is written in C++ and uses the Edinburgh Speech Tools Library for low level architecture and has a Scheme (SIOD) based command articulation for control. Festival Provides API documentation. In their TTS engine they have used two different kind of concatenative methods unit selection and multisyn unit selection whi ch supported in Festival.In their research they have discussed about Text Analysis, Phonetic analysis Grapheme to phoneme Conversion, Prosodic Analysis, Speech Database or Waveform Synthesis, Speech yield and Analysis of output result.The input text may come in non modular way, considering this problem they have used the text analysis part to convert all non standard words to standard words. Their grapheme-to-phoneme module produces strings of phonemic symbols based on information in the written text. utmost speech synthesis is accomplished by concatenative unit selection technique and multisyn unit selection technique.In their proposed system the first step is text analysis. the job of a TTS engine is to convert the input text to equivalent Speech, for this reason the input text should convert to a standard format. There is always a chance that the input text may submit NSW (Non-Standard Word) geek words. Here the author listed the NSW words as e.g. numbers (year, time, ordinal , cardinal, undirected point), abbreviations, acronyms, currency, dates, URLs. They have used Text normalization for formatting NSW to SW (Standard Word) and they disambiguate the ambiguous token using rule.In their research they didnt work with Unicode today because Festival doesnt support Unicode, So that they convert Unicode text to ASCII. In text analysis part they Split the token based on white-space and punctuation. They consider white space as a separator and Punctuation can separate the raw tokens. Festival Ordered list of tokens, to each one with features of white-space, and punctuation. For tokenization White-space is the most commonly used .they have identified Bangla Language have more than 10 types of NSW, so each NSW can identify as separate token by token identifier rules. They used scheme fifty-fifty flavor in festival to identify the token. After identifying of all NSW they convert it to standard word by pronunciation lexicon or (letter to sound) LTS rule. pron unciation of a word sometimes doesnt match with the pronunciation form. They have actd this problem by using list of lexicon and LTS rule. They close ined 900 lexicons with its pronunciation in the lexicon dictionary.The Steps of Phonetic Analysis within festival1. Building large amount of lexicon.2. Building letter-to-sound rules.They have used three techniques for concatenative synthesis diphone, unit selection and multisyn-unit selection.They identified 45 phones excluding 31 diphthongs with their features based on articulatory analysis. To build diphone database they include diphthong as well. In their implementation they excluded the diphthongs. The duration they added is taken from Kiswahili TTS system but This is not exact duration for the phone set of Bangla language.They have approximately recorded 500-900 utterance to cover most frequent words of language. The analogy of the system was tested in two ways in terms of acceptableness/naturalness and in terms of intelligibi lity. Synthesized speech was evaluated on three levels sentence level, word level and phrase level. In incident of sentences level the intelligibility rate being close to 85%. On phrase level it is 83.33% and word level it is 56.66%. In their second experiment, degree of naturalness of the synthesized speech was assessed, over again on sentence 90%, phrase 85% and word level 65%. The results Obtained are shown in below Figure. Case Study 3Their model consist of three part, 1st one is LINGUISTIC module what generate a linguistic representation from text. 2nd one is acoustic MODULE which generates speech from the linguistic representation. And the 3rd and final one is opthalmic MODULE which driving a talking head based on the linguistic representation.They created a relational lexical database from three source lexica The Carnegie Mellon Pronouncing Dictionary, Moby Pronunciation II and COMLEX English pronouncing lexicon. There have almost entered 200,000 word, of which over 1500 are non-homophonous homographs. The interesting part of their project is they used airy image which will moved on the subject. In their lingual Module they token textual input and looks up word pronunciations and tags in the lexical database. Which words are not present in their lexical database they used a participating programming alignment algorithm that algorithm described for aligning sequences from the very(prenominal) alphabets. In Letter-to-sound neural network they delimit features for a letter to be the union of the features of the phones that that letter might represent. When they get competitive results they thought that improved performance will come from simplifying the phonological representations found in the dictionary. By this they built a preliminary linguistic representation of the utterance. Then the linguistic representation submitted to a postlexical module where lexical pronunciations derived from the lexicon are converted to postlexical pronunciations veritable(prenominal) of the speaker. They consider the distance to word, phrase, clause, and sentence boundaries was included.After converting the linguistic representation they pass around it to the Acoustic Module, which has three spirit level 1.Duration anxious net income , 2.Phonetic Neural interlocking and 3.Waveform Synthesizer . The acoustic module established the timing of the speech signal by associating segment duration with each phone in the linguistic representation. An acoustic representation, consist of input parameters for the synthesis portion of a vocoder, is generated for each ten-millisecond frame of speech. in conclusion, the synthesis portion of the vocoder is used to generate speech from these acoustic descriptions. The most interesting part of their module is that they are providing the video for the speech, so it looks like natural. And that reason they collect the animated image from the nature. The video subsystem takes the output of the linguistic module and the output of the duration neural network and generates an animated figure by using an additional neural network.Case Study 4 Sanghamitra Mohanty has developed a very intelligent tool, which provides quadruplet Indian language Speech output at a time Hindi, Odiya, Bengali and Telegu. For all language she has considered a common system what she named Priyambada. She found Indian languages are phonetic in nature, and the progenitor phoneme mapping is linear. So the vowel and the consonant of the language are almost same except some of them. She took those in consider and apply algorithm for that. We found three stage on this TTS system. First one is Speech Corpora Creation. Here she identified speakers for four native languages, and get them in a laboratory environment using noise cancellation microphone. The sampling rate is 16 bit in sensation channel of 16000 Hz.By this way she collect the voice from the speakers. Secondly she creates a database for the variant Sylla bles from the text. She also stored individual polysyllables for different languages in a .wav file format. Finally she played the .wav files for the represented data. There she does not give the solution for the new word what is not in her present. With C++ language she developed a very interesting tool what plays very important role.Case Study 5They actually focus to normalize the text. Most probably their work is same, their processes are tokenization, token classification, token sense disambiguation and word representation. They found some ambiguous tokens in bangla language. Like, Bangla use many language(English, Arabic, Hindi etc) in their language. the most challenging part of token are the numbers, dates, year, time, multi-text genre etc. To solve this problem they found two ways. One is to token normal bangla language and another table is to handle the ambiguous words.They levels three stage to token a word i) Tokenizer what will used to token the English and other South A sian scripts Bangla ii) Splitter is used for Punctuation and delimiter and iii) to token phone number, year, time and floating point is used Classifier. It also check the contextual rules, different form of delimiters was removed in this stage, for each type of token, regular expression were written in .jflex format all are checkered in this stage.To make the ambiguous token natural this part is used for. The ambiguous words like non-natural number cardinal, ordinal, acronym, and abbreviations will sound natural. For this the used some stages. Those are (i). traverse from right to left. (ii). Map first two digits with lexicon to get the expanded form (i.e. 10 ten). (iii). After the expanded form of the third digit insert the token hundred. (iv). Get expanded form of each pair of digit after third digit from the lexicon. (v). interpose the token thousand after the expanded form fourth and twenty percent digit and lakh after expanded form of sixth and ordinal digit. They will con tinue those stages. After each of second block they insert the token koti to make it naturalBy this way they believe they can make perfection of 99% of the ambiguous words.Summary of 4 case studiesTopicsCase study 1Case study 2Case study 3Case study 4Case study 5ToolsESNOLAFESTIVALNAPriyambadaJFlexProcessing text typesideASCII, UNICODEENGLISHNOT DEFINEDENGLISHInput text typeBANGLAENGLISHENGLISHENGLISHENGLISHVoice sourcePre recordedPre recordedPre recordedPre recordedPre recordedTotal Modules23Audio formatnot definenot definenot define.WavNot defineintonationYesYesYesYesYesUtteranceYesYesYesYesYesProsodyYesYesYesYesYesphonological wordsYesYesNot definedNot definedYesException interventionYesYesNoNoYesDatabase length596 filesNot defined200,000Not definedNot definedDatabase size1.0 MbNot definedNot definedNot definedNot definedSpeech quality evaluation2.88 out of 5.00Intelligibility rateNo85%NoNoYesWord Processing speed0.45 sec/ 2 word ( no of syllable -6 )Not definedNot definedNot de finedNot definedAccuracy57.8%85%87%Not define99% forAmbiguous word1 Frances Alias, Xavier Servillano, Joan Claudi socoro and Xavier Gonzalvo Towards High-Quality Next Generation Text-to-Speech SynthesisA multi stadium Approach by Automatic Domain Classification,IEEE Transactions on AUDIO,SPEECH AND LANGUAG PROCESSING, VOL16,NO,7 phratry 2008.2 Qing Guo, Jie Zhang, Nobuyuki Katae, Hao Yu , High -Quality Prosody Generation in Mandrain Text-to-Speech system, FujiTSu Sci.Tech,J., vol.46, No.1,pp.40-46 ,2010.3 Gopalakrishna anumanchipalli,Rahul Chitturi, Sachin Joshi, Rohit Kumar, Satinder Pal Singh,R.n.v Sitaram,D.P.Kishore, Development of Indian Language Speech Databases for Large Vocabulary Speech Recognition System,4 A.Black, H.Zen and K.Tokuda statistical parametric speechsynthesis, in proc.ICASSP, Honolulu, HI 2007, vol IV, PP 1229-1232.5 G.Bailly, N.Campbell and b.Mobius, ISCA special session sweltry topicsin speech synthesis, in proc.Eurospeech,Genea, Switzerland, 2003, pp 37- 40.6 M.Ostendorf and I.Bulyko, The impact of speech recognition on speech synthesis, in proc, IEEE Workshop Speech Synthesis, Santa Monica,2002,pp. 99-106.7 Text To Speech Synthesis a knol by Jaibatrik Dutta .8 Silvio Ferreia,Celina Thillou, Bernaud Gosselin, From Picture to Speech an Innovative Application for infix Environment,9 M.Nageshwara Rao, Samuel Thomas, T.Nagarajan and Hema A.Muthy, Text-to-Speech Syntheis using syllable line units10 Jindrich Matousek, Josef Psutks, Jiri Krita, Design of speech Corpus for Text-to-Speech Synthesis. Beckman M. and Elam G. Guidelines for ToBI Labeling. Manuscript, version 3, 1997.11 Corrigan G., Massey N., and Karaali O. Generatingsurgical incision Durations in a Text-to-Speech System A HybridRule-Based/Neural Network Approach. Proc. Eurospeech 97, Rhodes, phratry 1997.12 Gerson I., Karaali O., Corrigan G., and Massey N. NeuralNetwork Speech Synthesis. Speech Science and applied science (SST-96), Australia, 1996.13 Karaali O., Corrigan G. , and Gerson I. Speech Synthesis with Neural Networks. Invited paper, World Congress on Neural Networks (WCNN-96), San Diego, kinfolk 1996.14 Karaali O., Corrigan G., Gerson I., and Massey N. Text-to- Speech Conversion with Neural Networks A Recurrent TDNN Approach. Proc. Eurospeech 97, September 1997.15 Kiparsky P. Lexical phonology and morphology. Linguistics in the morning calm, ed. by I.S. Yang. capital of South Korea Hanshin, 1982.16 Kruskal J. An overview of sequence comparison. TimeWarps, String Edits, and Macromolecules, edited by Joseph Kruskal and David Sankoff. Reading, MA Addison- Wesley, 1983.17 Linguistic Data Consortium. COMLEX English pronouncing lexicon. Trustees of the University of Pennsylvania, version 0.2, 1995.18 Miller C., Karaali O., and Massey N. Variation and Synthetic Speech. NWAVE 26, Quebec, October 1997.19 Nusbaum H., Francis A., and Luks T. relative valuation of the quality of synthetic speech produced at Motorola. Research report, talk Language Rese arch Laboratory, University of Chicago, 1995.20 OShaughnessy, D. Modeling fundamental frequency, andits relationship to syntax, semantics, and phonetics. Ph.D. thesis, M.I.T., 1976.21 Sejnowski T. and Rosenberg C. NETtalk a agree network that learns to pronounce English text. Complex Systems 1.145-168, 1987.22 Seneff S. and Zue V. Transcription and alignment of the TIMIT database. M.I.T., 1988.23 Tuerk C. and Robinson T. Speech Synthesis using Artificial Neural Networks Trained on Cepstral Coefficients. Proc. Eurospeech 93, Berlin, September 1993.24 Ward G. Moby Pronunciator II, 1996.25 Weide R. The Carnegie Mellon Pronouncing Dictionary. cmudict.0.4, 1995.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment