Most Popular


The EM Poll




browse back issues

Humanoid or Vocaloid?

Aug 1, 2003 12:00 PM, By Scott Wilkinson



         Subscribe in NewsGator Online   Subscribe in Bloglines
 

CURRENT NEWSSTAND ISSUE

Read the full Table of Contents for the issue on sale now! Click here

Subscribe for only $1.84 an issue!

Please tell us about yourself so we can better serve you. Click here to take our user survey.

MixBooks Logo
Life in the Fast Lane

This collection of St.CroixÕs columns was assembled during the two years following his death of cancer in May 2006. Included are many of his most-read columns, as well as personal notes, drawings and photographs.

Click for more books
EM Podcasts

Listen to these latest podcasts and more:
Bela Fleck on recording Jingle All the Way.Go

What's New: software and sound products. Go

eDeals Newsletter for Discounts on Gear

Get First Dibs on Hot Gear Discounts, Manufacturer Close-Outs and Job Opportunities when you sign up to receive eDeals E-newsletter, sent twice a month. Check out an issue get advertising info or subscribe

Many acoustic instruments can be simulated convincingly with various synthesis techniques, such as sampling and physical modeling. But one instrument has resisted most simulation attempts: the singing voice. That is because singing exhibits an unusually wide range of timbres, articulations, and transitions between sounds. In addition, singing usually communicates lyrics as well as melody, which results in a double layer of meaning not found in other instruments. Finally, the human ear is so attuned to the voice that the subtlest tonal shifts, errors, or anomalies are immediately apparent.

At the 2003 Musikmesse in Germany and the Audio Engineering Society convention in the Netherlands this past March, Yamaha demonstrated a new vocal-synthesis technology called Vocaloid (www.global.yamaha.com/news/20030304b.html), which achieves a new level of sophistication in this area. Using Visual C++ on a Windows computer, a team at the Yamaha Advanced System Development Center in Japan has written software that mimics the singing voice with surprising accuracy.

The team starts with recordings of professional male and female vocalists singing specially constructed phrases of nonsense words with all possible transitions between syllables. The transitions are slightly different depending on the combination of speech sounds called phonemes. Those differences are a big part of how we understand words and why a vocal track sounds natural or artificial. For example, the phoneme p sounds slightly different at the beginning of a word than it does at the end, and it affects the vowels next to it differently than, say, the phoneme t.

The recorded phrases are converted to the frequency domain using Fast Fourier Transform and divided into separate phonetic transitions. Those elements are then stored in a phonetic database for use with the synthesis engine. Expressive elements such as vibrato, pitch bend, and attack are also extracted and stored in a separate database.

To create a vocal track, you enter music and lyrics into the score editor (see Fig. 1). The music can be entered manually or imported from a Standard MIDI File; the lyrics must be entered manually. Expressive elements can be imported from a MIDI File as Control Change messages or entered from a graphic palette.

The data from the score file is sent to the synthesis engine, which draws on the phonetic and expression databases to create the track. To sing the word part, for example, the software combines four elements from the phonetic database: p (as it sounds at the beginning of a word), p-ar (the transition from p to ar), ar-t (the transition from ar to t), and t (as it sounds at the end of a word). The two ar elements are blended together, and the resulting vowel a is lengthened to accommodate the melodic line.

Different pitches are derived by shifting the fundamental and overtones while leaving the vowel formants relatively untouched. The database elements were originally sung at different pitches, limiting the amount of shifting the engine must do. A Pentium 4/2 GHz computer takes less than one-third real time to render the track and convert it back into the time domain. For example, a 1-minute track can be rendered in less than 20 seconds.

Yamaha intends to commercialize Vocaloid by licensing it to producers of vocal libraries and software marketing firms. The obvious applications include background vocals and rough sketches of arrangements. However, the potential for this technology is virtually unlimited.



Acceptable Use Policy
blog comments powered by Disqus

Get Copyright ClearanceWant to use this article? Click here for options!
© 2009 Penton Media, Inc.

Back to Top