Talking robots - what's the latest?

honestrosewater · Sep 7, 2005

Can you guys produce natural-sounding speech artificially? How do the systems work? Do you create the sounds using an artificial vocal tract (a robot with lips, teeth, tongue, pharynx, etc.) or just manipulate the sounds using some kind of software? Sorry, I don't really know how to ask the question. I want to make an audio recording of Hamlet without using human actors to play the parts; I would just type up the text with all of the phonetic information, give it to some kind of machine, and get human-like speech in return. Doable?

Oh, and what field(s) would this draw from? Acoustical phonetics, phonetical acoustics, bioacoustics, speech synthesis?

Okay, I found the basic info that I was looking for. It is speech synthesis. But I still would like to know how good they really can be; how natural, flexible, what kind of range, etc. My text would be very detailed. Also, as this is a real project, I will appreciate any suggestions for how I could make it happen. How do I get my hands on one of those speech synthesizers?

hitssquad · Sep 7, 2005

The wear-hard guys are all over this. You should ask on the wear-hard mailing list.
http://wearables.blu.org

AT&T's voice synthesis research is supposed to be pretty advanced.
http://www.research.att.com/projects/tts/demo.html

Nice list of programs, here.
http://aldostools.mysite4now.com/msagent.html

The very best programs do give you control over how the words are pronounced. One thing you might consider is, if you could synthesize speech to match the best voice actors, why would animation filmmakers persist in hiring voice actors?

honestrosewater · Sep 7, 2005

Cool, thanks. The synthesizers I listened to this morning were nowhere near what I need, but they weren't designed for what I need, so I'm still hopeful.

hitssquad · Sep 7, 2005

I think you might want to take a closer look at the AT&T system. The demo doesn't show off what you can do with inflection characters:
http://www.research.att.com/news/2001/October/NaturalVoices.html

A companion offering, AT&T Labs Natural Voices Fonts, provides a fast and effective way to offer customers choice in voices. The fonts let companies select voice sounds, tones, accents and inflections that best meet their needs.

hitssquad · Sep 7, 2005

I have found the AT&T TTS documentation. It looks like it might be able to do what you want:
http://www.naturalvoices.com/support/documentation.html

The TTS engines support the Java Speech Markup Language, the Speech Synthesis Markup Langauge component of the Voice XML standard, the Microsoft SAPI 4.0 Markup language, and the Microsoft SAPI 5.1 markup languages. These markup languages allow client applications to include special instructions within the input text that may change the default behavior of the text synthesizer

By the way, did you listen to the demos of the UK voices? I thought those sounded like they might make a pretty good basis for Shakespeare. Since there is only one UK male and one UK female voice, I think I would try multiplying those by changing the default pitches.

honestrosewater · Sep 7, 2005

Yes, I'll take a look at them. I'm just not sure the concatenative technologies, the ones manipulating recordings of human speech, are worth it. For instance, I might want to make the voice tremble at times - how do you add that to a recording? I would need fine control over the loudness, duration, and pitch, from whole utterances, down to each phone (that's relative loudness, duration, and pitch too). I would need fine control over the pronunciation of each phone too. A single word could end up having dozens of variatons.

Untouched recordings are most natural, but you need a lot of them. The more you can manipulate the recordings, the fewer you need, but you lose out on the naturalness, which was the whole point of starting with human recordings. You see what I mean? It seems that making the recordings needed for manipulation would be more work than just having human actors read the text. And if I'd have to do so much manipulation, I may as well just produce the sounds artificially to begin with.

Formant synthesis, creating the speech in an acoustic model, may be able to give me the variety that I need, but I would probably be sacrificing naturalness, i.e., it would have the range but sound like a machine. This isn't necessarily a bad thing. It would depend on how strange the speech sounded.

A mechanical synthesizer sounds like the best option, but there doesn't seem to be nearly as much work giong on in this area. And it might not be affordable either.
Anyway, I'll keep looking and learning.

honestrosewater · Sep 8, 2005

Oh, and cost isn't the motivation here. I love the idea of 'robots' performing Shakespeare; artificially producing speech of such quality. And I'm not really concerned about naturalness. In fact, given the opportunity, I'd like to do some things that humans cannot. Sounding exactly like a human isn't the goal, though I wouldn't want the speech to sound so alien that it detracts from the play. Having the range, precision, and flexibility of a human is what I'm mostly after.

hitssquad · Sep 8, 2005

The possible irrelevance of differences between hard- and soft- simulation

If mechanical speech were that easy, I would think that it would be possible to teach dogs and chimps -- since they can demonstrably understand human speech and are similar to us in mouth/tongue/face morphology -- to speak to us in whispers. Our mouth/tongue/face muscles must be quite intricate and difficult to coordinate.

Melanie McGee from the wear-hard list has been using a Skeletor skeleton to produce pretend-mechanical speaking (Skeletor moves his mouth and head in rhythm with synthesized speech):
http://www.melmcgee.com/about.php

Apparently her site is being refurbished and old content is not yet back up. I'm sure she'll have Skeletor back up pretty soon.

I have been pondering legitimate mechanical speech myself for a few years now. It seems to me to be a difficult nut to crack, but still I ponder it now and then.

Back to pure digital, I would look at simulating the entire upper body with the possible exception of the circulation system. At the speed of todays' computers you might have to wait days or weeks or months for renders of short sections of speech -- just as we have to wait that long for renders of complex povray and motion-picture CGI -- but you might end up with something intricate and indistinguishable-from-meatworld-organic (because it would be organic). According to AI pioneer John McCarthy, everything out here in the world can be simulated in a computer at the very least in non-real-time -- and this is true no matter how slow the computer runs and no matter the computing architecture.
www-formal.stanford.edu/jmc/whatisai/node1.html

Q. Are computers the right kind of machine to be made intelligent?

A. Computers can be programmed to simulate any kind of machine.

Many researchers invented non-computer machines, hoping that they would be intelligent in different ways than the computer programs could be. However, they usually simulate their invented machines on a computer and come to doubt that the new machine is worth building. Because many billions of dollars that have been spent in making computers faster and faster, another kind of machine would have to be very fast to perform better than a program on a computer simulating the machine.

Q. Are computers fast enough to be intelligent?

A. Some people think much faster computers are required as well as new ideas. My own opinion is that the computers of 30 years ago were fast enough if only we knew how to program them. Of course, quite apart from the ambitions of AI researchers, computers will keep getting faster.

Q. What about parallel machines?

A. Machines with many processors are much faster than single processors can be. Parallelism itself presents no advantages, and parallel machines are somewhat awkward to program. When extreme speed is required, it is necessary to face this awkwardness.

honestrosewater · Sep 8, 2005

Well, I have a lot to learn, but maybe text-to-speech is not where I should be looking. I think I could learn how to create the speech; I wouldn't need to feed it text. Maybe I should start learning more about speech waveforms and spectrograms. Maybe I could just create spectrograms and convert those into speech.

hitssquad · Sep 9, 2005

Peter Jackson comments on Rose's idea

Rose,

Director Peter Jackson happened to comment on this very subject in this week's installment of the King Kong production diary:

http://www.kongisking.net/index.shtml

He says, "Of course, digital doubles will never replace actors. I mean, that's the big fear and I think it's a lot of old nonsense really when people say, 'Oh well, you know, we won't need actors any more -- we've got digital people.' But, you know, digital people don't have hearts and souls and they can't provide everything that an actor can provide in a performance."

honestrosewater · Sep 10, 2005

The end result is just sound waves or images on a screen. I can't imagine how the source of the sound or image would matter. The 'heart and soul' would just go into making the sounds and images. If I decide to do this, I expect to do a lot of acting and directing in creating the speech.

hitssquad · Sep 25, 2005

mediamatic.net/artefact-200.5575.html

http://www.mediamatic.net/image/200.8230-309-475-1.jpg

Talking robots - what's the latest?

1. What exactly are talking robots?

2. How do talking robots work?

3. What is the latest technology used in talking robots?

4. How are talking robots being used in society?

5. Are there any ethical concerns surrounding talking robots?

Similar threads

Hot Threads

Recent Insights