Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Talking robots - what's the latest?

  1. Sep 7, 2005 #1


    User Avatar
    Gold Member

    Can you guys produce natural-sounding speech artificially? How do the systems work? Do you create the sounds using an artificial vocal tract (a robot with lips, teeth, tongue, pharynx, etc.) or just manipulate the sounds using some kind of software? Sorry, I don't really know how to ask the question. I want to make an audio recording of Hamlet without using human actors to play the parts; I would just type up the text with all of the phonetic information, give it to some kind of machine, and get human-like speech in return. Doable?

    Oh, and what field(s) would this draw from? Acoustical phonetics, phonetical acoustics, bioacoustics, speech synthesis?

    Okay, I found the basic info that I was looking for. It is speech synthesis. But I still would like to know how good they really can be; how natural, flexible, what kind of range, etc. My text would be very detailed. Also, as this is a real project, I will appreciate any suggestions for how I could make it happen. How do I get my hands on one of those speech synthesizers? :biggrin:
    Last edited: Sep 7, 2005
  2. jcsd
  3. Sep 7, 2005 #2
    The wear-hard guys are all over this. You should ask on the wear-hard mailing list.

    AT&T's voice synthesis research is supposed to be pretty advanced.
    http://www.research.att.com/projects/tts/demo.html [Broken]

    Nice list of programs, here.
    http://aldostools.mysite4now.com/msagent.html [Broken]

    The very best programs do give you control over how the words are pronounced. One thing you might consider is, if you could synthesize speech to match the best voice actors, why would animation filmmakers persist in hiring voice actors?
    Last edited by a moderator: May 2, 2017
  4. Sep 7, 2005 #3


    User Avatar
    Gold Member

    Cool, thanks. The synthesizers I listened to this morning were nowhere near what I need, but they weren't designed for what I need, so I'm still hopeful.
  5. Sep 7, 2005 #4
    I think you might want to take a closer look at the AT&T system. The demo doesn't show off what you can do with inflection characters:
    http://www.research.att.com/news/2001/October/NaturalVoices.html [Broken]

    Last edited by a moderator: May 2, 2017
  6. Sep 7, 2005 #5
    I have found the AT&T TTS documentation. It looks like it might be able to do what you want:
    http://www.naturalvoices.com/support/documentation.html [Broken]

    By the way, did you listen to the demos of the UK voices? I thought those sounded like they might make a pretty good basis for Shakespeare. Since there is only one UK male and one UK female voice, I think I would try multiplying those by changing the default pitches.
    Last edited by a moderator: May 2, 2017
  7. Sep 7, 2005 #6


    User Avatar
    Gold Member

    Yes, I'll take a look at them. I'm just not sure the concatenative technologies, the ones manipulating recordings of human speech, are worth it. For instance, I might want to make the voice tremble at times - how do you add that to a recording? I would need fine control over the loudness, duration, and pitch, from whole utterances, down to each phone (that's relative loudness, duration, and pitch too). I would need fine control over the pronunciation of each phone too. A single word could end up having dozens of variatons.

    Untouched recordings are most natural, but you need a lot of them. The more you can manipulate the recordings, the fewer you need, but you lose out on the naturalness, which was the whole point of starting with human recordings. You see what I mean? It seems that making the recordings needed for manipulation would be more work than just having human actors read the text. And if I'd have to do so much manipulation, I may as well just produce the sounds artificially to begin with.

    Formant synthesis, creating the speech in an acoustic model, may be able to give me the variety that I need, but I would probably be sacrificing naturalness, i.e., it would have the range but sound like a machine. This isn't necessarily a bad thing. It would depend on how strange the speech sounded.

    A mechanical synthesizer sounds like the best option, but there doesn't seem to be nearly as much work giong on in this area. And it might not be affordable either.
    Anyway, I'll keep looking and learning. :smile:
    Last edited: Sep 8, 2005
  8. Sep 8, 2005 #7


    User Avatar
    Gold Member

    Oh, and cost isn't the motivation here. I love the idea of 'robots' performing Shakespeare; artificially producing speech of such quality. And I'm not really concerned about naturalness. In fact, given the opportunity, I'd like to do some things that humans cannot. Sounding exactly like a human isn't the goal, though I wouldn't want the speech to sound so alien that it detracts from the play. Having the range, precision, and flexibility of a human is what I'm mostly after.
  9. Sep 8, 2005 #8
    The possible irrelevance of differences between hard- and soft- simulation

    If mechanical speech were that easy, I would think that it would be possible to teach dogs and chimps -- since they can demonstrably understand human speech and are similar to us in mouth/tongue/face morphology -- to speak to us in whispers. Our mouth/tongue/face muscles must be quite intricate and difficult to coordinate.

    Melanie McGee from the wear-hard list has been using a Skeletor skeleton to produce pretend-mechanical speaking (Skeletor moves his mouth and head in rhythm with synthesized speech):
    http://www.melmcgee.com/about.php [Broken]

    Apparently her site is being refurbished and old content is not yet back up. I'm sure she'll have Skeletor back up pretty soon.

    I have been pondering legitimate mechanical speech myself for a few years now. It seems to me to be a difficult nut to crack, but still I ponder it now and then.

    Back to pure digital, I would look at simulating the entire upper body with the possible exception of the circulation system. At the speed of todays' computers you might have to wait days or weeks or months for renders of short sections of speech -- just as we have to wait that long for renders of complex povray and motion-picture CGI -- but you might end up with something intricate and indistinguishable-from-meatworld-organic (because it would be organic). According to AI pioneer John McCarthy, everything out here in the world can be simulated in a computer at the very least in non-real-time -- and this is true no matter how slow the computer runs and no matter the computing architecture.

    Last edited by a moderator: May 2, 2017
  10. Sep 8, 2005 #9


    User Avatar
    Gold Member

    Well, I have a lot to learn, but maybe text-to-speech is not where I should be looking. I think I could learn how to create the speech; I wouldn't need to feed it text. Maybe I should start learning more about speech waveforms and spectrograms. Maybe I could just create spectrograms and convert those into speech.
  11. Sep 9, 2005 #10
    Peter Jackson comments on Rose's idea


    Director Peter Jackson happened to comment on this very subject in this week's installment of the King Kong production diary:


    He says, "Of course, digital doubles will never replace actors. I mean, that's the big fear and I think it's a lot of old nonsense really when people say, 'Oh well, you know, we won't need actors any more -- we've got digital people.' But, you know, digital people don't have hearts and souls and they can't provide everything that an actor can provide in a performance."
    Last edited: Sep 9, 2005
  12. Sep 10, 2005 #11


    User Avatar
    Gold Member

    The end result is just sound waves or images on a screen. I can't imagine how the source of the sound or image would matter. The 'heart and soul' would just go into making the sounds and images. If I decide to do this, I expect to do a lot of acting and directing in creating the speech.
  13. Sep 25, 2005 #12
    Last edited by a moderator: May 2, 2017
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook