Talking robots - what's the latest?

  • Thread starter honestrosewater
  • Start date
In summary, the best artificial speech synthesizers can approximate the quality of a human's voice, but they may require a lot of recordings to be manipulated. It is possible to produce artificial speech that sounds like a human, but it requires a lot of work and precision to do so.
  • #1
honestrosewater
Gold Member
2,142
6
Can you guys produce natural-sounding speech artificially? How do the systems work? Do you create the sounds using an artificial vocal tract (a robot with lips, teeth, tongue, pharynx, etc.) or just manipulate the sounds using some kind of software? Sorry, I don't really know how to ask the question. I want to make an audio recording of Hamlet without using human actors to play the parts; I would just type up the text with all of the phonetic information, give it to some kind of machine, and get human-like speech in return. Doable?

Oh, and what field(s) would this draw from? Acoustical phonetics, phonetical acoustics, bioacoustics, speech synthesis?

Okay, I found the basic info that I was looking for. It is speech synthesis. But I still would like to know how good they really can be; how natural, flexible, what kind of range, etc. My text would be very detailed. Also, as this is a real project, I will appreciate any suggestions for how I could make it happen. How do I get my hands on one of those speech synthesizers? :biggrin:
 
Last edited:
Engineering news on Phys.org
  • #2
The wear-hard guys are all over this. You should ask on the wear-hard mailing list.
http://wearables.blu.org

AT&T's voice synthesis research is supposed to be pretty advanced.
http://www.research.att.com/projects/tts/demo.html

Nice list of programs, here.
http://aldostools.mysite4now.com/msagent.html

The very best programs do give you control over how the words are pronounced. One thing you might consider is, if you could synthesize speech to match the best voice actors, why would animation filmmakers persist in hiring voice actors?
 
Last edited by a moderator:
  • #3
Cool, thanks. The synthesizers I listened to this morning were nowhere near what I need, but they weren't designed for what I need, so I'm still hopeful.
 
  • #4
I think you might want to take a closer look at the AT&T system. The demo doesn't show off what you can do with inflection characters:
http://www.research.att.com/news/2001/October/NaturalVoices.html

A companion offering, AT&T Labs Natural Voices Fonts, provides a fast and effective way to offer customers choice in voices. The fonts let companies select voice sounds, tones, accents and inflections that best meet their needs.
 
Last edited by a moderator:
  • #5
I have found the AT&T TTS documentation. It looks like it might be able to do what you want:
http://www.naturalvoices.com/support/documentation.html

The TTS engines support the Java Speech Markup Language, the Speech Synthesis Markup Langauge component of the Voice XML standard, the Microsoft SAPI 4.0 Markup language, and the Microsoft SAPI 5.1 markup languages. These markup languages allow client applications to include special instructions within the input text that may change the default behavior of the text synthesizer

By the way, did you listen to the demos of the UK voices? I thought those sounded like they might make a pretty good basis for Shakespeare. Since there is only one UK male and one UK female voice, I think I would try multiplying those by changing the default pitches.
 
Last edited by a moderator:
  • #6
Yes, I'll take a look at them. I'm just not sure the concatenative technologies, the ones manipulating recordings of human speech, are worth it. For instance, I might want to make the voice tremble at times - how do you add that to a recording? I would need fine control over the loudness, duration, and pitch, from whole utterances, down to each phone (that's relative loudness, duration, and pitch too). I would need fine control over the pronunciation of each phone too. A single word could end up having dozens of variatons.

Untouched recordings are most natural, but you need a lot of them. The more you can manipulate the recordings, the fewer you need, but you lose out on the naturalness, which was the whole point of starting with human recordings. You see what I mean? It seems that making the recordings needed for manipulation would be more work than just having human actors read the text. And if I'd have to do so much manipulation, I may as well just produce the sounds artificially to begin with.

Formant synthesis, creating the speech in an acoustic model, may be able to give me the variety that I need, but I would probably be sacrificing naturalness, i.e., it would have the range but sound like a machine. This isn't necessarily a bad thing. It would depend on how strange the speech sounded.

A mechanical synthesizer sounds like the best option, but there doesn't seem to be nearly as much work giong on in this area. And it might not be affordable either.
Anyway, I'll keep looking and learning. :smile:
 
Last edited:
  • #7
Oh, and cost isn't the motivation here. I love the idea of 'robots' performing Shakespeare; artificially producing speech of such quality. And I'm not really concerned about naturalness. In fact, given the opportunity, I'd like to do some things that humans cannot. Sounding exactly like a human isn't the goal, though I wouldn't want the speech to sound so alien that it detracts from the play. Having the range, precision, and flexibility of a human is what I'm mostly after.
 
  • #8
The possible irrelevance of differences between hard- and soft- simulation

If mechanical speech were that easy, I would think that it would be possible to teach dogs and chimps -- since they can demonstrably understand human speech and are similar to us in mouth/tongue/face morphology -- to speak to us in whispers. Our mouth/tongue/face muscles must be quite intricate and difficult to coordinate.

Melanie McGee from the wear-hard list has been using a Skeletor skeleton to produce pretend-mechanical speaking (Skeletor moves his mouth and head in rhythm with synthesized speech):
http://www.melmcgee.com/about.php

Apparently her site is being refurbished and old content is not yet back up. I'm sure she'll have Skeletor back up pretty soon.

I have been pondering legitimate mechanical speech myself for a few years now. It seems to me to be a difficult nut to crack, but still I ponder it now and then.

Back to pure digital, I would look at simulating the entire upper body with the possible exception of the circulation system. At the speed of todays' computers you might have to wait days or weeks or months for renders of short sections of speech -- just as we have to wait that long for renders of complex povray and motion-picture CGI -- but you might end up with something intricate and indistinguishable-from-meatworld-organic (because it would be organic). According to AI pioneer John McCarthy, everything out here in the world can be simulated in a computer at the very least in non-real-time -- and this is true no matter how slow the computer runs and no matter the computing architecture.
www-formal.stanford.edu/jmc/whatisai/node1.html

Q. Are computers the right kind of machine to be made intelligent?

A. Computers can be programmed to simulate any kind of machine.

Many researchers invented non-computer machines, hoping that they would be intelligent in different ways than the computer programs could be. However, they usually simulate their invented machines on a computer and come to doubt that the new machine is worth building. Because many billions of dollars that have been spent in making computers faster and faster, another kind of machine would have to be very fast to perform better than a program on a computer simulating the machine.

Q. Are computers fast enough to be intelligent?

A. Some people think much faster computers are required as well as new ideas. My own opinion is that the computers of 30 years ago were fast enough if only we knew how to program them. Of course, quite apart from the ambitions of AI researchers, computers will keep getting faster.

Q. What about parallel machines?

A. Machines with many processors are much faster than single processors can be. Parallelism itself presents no advantages, and parallel machines are somewhat awkward to program. When extreme speed is required, it is necessary to face this awkwardness.
 
Last edited by a moderator:
  • #9
Well, I have a lot to learn, but maybe text-to-speech is not where I should be looking. I think I could learn how to create the speech; I wouldn't need to feed it text. Maybe I should start learning more about speech waveforms and spectrograms. Maybe I could just create spectrograms and convert those into speech.
 
  • #10
Peter Jackson comments on Rose's idea

Rose,

Director Peter Jackson happened to comment on this very subject in this week's installment of the King Kong production diary:

http://www.kongisking.net/index.shtml

He says, "Of course, digital doubles will never replace actors. I mean, that's the big fear and I think it's a lot of old nonsense really when people say, 'Oh well, you know, we won't need actors any more -- we've got digital people.' But, you know, digital people don't have hearts and souls and they can't provide everything that an actor can provide in a performance."
 
Last edited:
  • #11
The end result is just sound waves or images on a screen. I can't imagine how the source of the sound or image would matter. The 'heart and soul' would just go into making the sounds and images. If I decide to do this, I expect to do a lot of acting and directing in creating the speech.
 
  • #12
Last edited by a moderator:

1. What exactly are talking robots?

Talking robots are robots or machines designed and programmed to communicate with humans using spoken words. They have the ability to understand and generate human speech, making them appear more human-like and capable of interacting with humans in a more natural way.

2. How do talking robots work?

Talking robots work by using a combination of hardware and software components. They have a microphone to capture human speech, a speech recognition system to convert speech into text, and a text-to-speech system to convert the text into spoken words. They also have a database of words and phrases, along with algorithms, that allow them to understand and respond to human speech.

3. What is the latest technology used in talking robots?

The latest technology used in talking robots is artificial intelligence (AI). With AI, talking robots can learn and improve their ability to understand and respond to human speech. They can also adapt to different situations and contexts, making their conversations more natural and human-like.

4. How are talking robots being used in society?

Talking robots are being used in a variety of ways in society. Some are used as personal assistants, helping people with tasks such as scheduling and reminders. Others are used in customer service, providing assistance and answering questions. They are also used in education, healthcare, and even entertainment.

5. Are there any ethical concerns surrounding talking robots?

There are some ethical concerns surrounding talking robots, particularly in the areas of privacy and job displacement. As talking robots become more advanced and human-like, there are concerns about the potential for them to collect and share personal information without consent. There are also concerns about the potential for talking robots to replace human jobs, leading to unemployment. It is important for developers to consider these ethical issues and address them appropriately in the design and use of talking robots.

Similar threads

  • DIY Projects
Replies
11
Views
2K
  • Classical Physics
Replies
21
Views
1K
  • Special and General Relativity
Replies
21
Views
1K
  • General Discussion
Replies
9
Views
1K
  • Mechanical Engineering
Replies
5
Views
2K
  • STEM Academic Advising
Replies
2
Views
2K
Replies
7
Views
4K
Replies
10
Views
2K
Replies
1
Views
2K
Back
Top