The Google App's New Voice

NAT: One more time.

LO: One more time.

NAT: Hi, I'm Nat.

LO: And I'm Lo.

NAT: And this is Nat and Lo's 20% Project,

where we go around Google learning about all

the stuff we're curious about.

LO: And today, we're going behind the scenes

of the new Google Voice.

NAT: Which is not Lo.

LO: Mm-mm.

So a few years ago, Google added Voice Search as one

of the ways you can search for something.

And if you've ever tried it out--

NAT: OK, Google.

How tall is an elephant?

GOOGLE VOICE: African elephant has the height of 11 feet.

NAT: You've probably noticed that computer

voice that talks back to you.

LO: We got word that there is a team at Google creating

a new voice for the Google App.

And we got really excited to learn, how do you actually

create a computer voice.

NAT: So we got into the recording studio

and met James, the voice coach, and Erica, a linguist,

and the new voice talent.

LO: So this is silly.

VOICE TALENT: Yeah, this is totally silly.

So you guys are on camera.

And I don't have to seen.

NAT: We're invading your home.

VOICE TALENT: I love it.

NAT: When was the first text-to-speech made?

ERICA: The main thing that we think of as speech synthesis

mostly started in the 80s.

SPEAK & SPELL: Press Spell to begin.

NAT: Back then, they were using an approach that was completely

computer generated.

But now, how text-to-speech voices are typically made

is that a person records thousands and thousands

of sentences.

And then out of all those sentences

and all the sounds that they contain,

magically, a voice is made.

ERICA: When we do speech synthesis, what we're doing

is stringing together units of sound

to combine them in new ways to make new speech.

LO: Units of sound can be phonemes.

ERICA: The smallest unit of sound

that distinguishes meaning.

So from stack to slack, suddenly,

it's a completely different word.

You've changed the meaning just by swapping

one sound for another.

LO: Phones.

JAMES: I can think of eight kinds of T, eight flavors of T

in English.

Those are phones.



So you see the paper move when I said the T.

But what happens if I put an S before top?

Stop, stop.

NAT: And diphones.

ERICA: It's essentially the transition

from one sound to another.

JAMES: What is the sound of an S before W.

What's the sound of an S before an I?

What's the sound of a T at the beginning of a word?

LO: Erica and James told us that the bulk of these recordings

are for getting good diphone coverage

because these transitionary sounds are

really fundamental to how speech works.

So once you're done and happy with what you have, then

what happens?

ERICA: Then the recordings are stored in a database.

JAMES: Any content that hasn't been pre-recorded would

have to be stitched together.

ERICA: It's essentially a search problem.

It'll be like, OK, I need to put this sound with this sound

with this sound.

Let me look through the recordings,

and then stitches them all together into smooth speech.

LO: Synthesized speech can be great at taking

text and turning it into speech on the fly.

But it can end up sounding a bit robotic.

NAT: So probably one of the biggest reasons

that the team is creating a new voice

is to do more detailed recordings that will add more

nuance to how the voice sounds.

VOICE TALENT: A typical grilled cheese sandwich

made with two slices of white bread,

margarine, and 26 grams of processed American or cheddar

cheese contains 291 calories.

JAMES: I think my favorite part is sandwich.

ERICA: Really Yeah, the way it says--

JAMES: Play "sandwich" again.

VOICE TALENT: A typical grilled cheese sandwich made with--

JAMES: Sandwich.

ERICA: Sandwich.

JAMES: Yeah, sandwich.

I love that.

ERICA: So a big problem in speech synthesis

right now is prosody and intonation.

NAT: Which means the way the speech sounds,

its rise and fall, its rhythm and melody.

JAMES: Let me give an example from just regular language.

What's your name?

LO: Lorraine.

JAMES: What's your name?

LO: Lorraine.

JAMES: So the first time, my melody went down.

When it was a request for repetition

or for clarification, my voice went up.

We've never been able to do this before.

NAT: To try to add all these different layers of melody

and stress and intonation and whatever other nuances

that we have when we speak, the team has been doing

these recordings that are almost mini-plays, where

James, the voice coach, acts as a user.

And then the voice talent can respond back in different ways,

depending on different situations.

JAMES: OK Google.

I need to send a text.

VOICE TALENT: Who do you want to text?

JAMES: Madison.

VOICE TALENT: Madison, sure.

Mobile or work?

JAMES: Mobile.

VOICE TALENT: What's the message?

JAMES: So the user just dictated the message.


Here's your message.

Do you want to send it?

JAMES: Actually, no.

VOICE TALENT: Do you want to change the message?

Do you want to change the message?

JAMES: That was good.

That was lower.

That was the right pitch.

NAT: So the new voice will actually

be a combination of these whole chunk

recordings along with splicing together units of speech.

LO: We wanted to understand how it feels to actually

be the person who's expected to be the new computer voice.

So we asked the voice talent if she had any tips for us.

VOICE TALENT: It takes a lot more energy and air

than you think to keep the voice going

and to keep it energetic and clear.

LO: So I have to learn in the next few minutes

to speak from my diaphragm.

VOICE TALENT: Just speak from your diaphragm.

It takes a lot of energy.

JAMES: OK Google.

I need to put something on my calendar.

LO: When's the event?

JAMES: Tuesday.

LO: At what time?

JAMES: It sounds a little bit tense.

And you're starting a little bit high.

Think of it as a smooth figure skater on ice?

LO: At what time?

JAMES: OK, A little lower.

LO: At what time?

NAT: Sorry, what's the day and time?

JAMES: Now, let's try one like this.

Sorry, what's the day and time?

NAT: Sorry, what's the day and time?

JAMES: Oh, that was really good.

LO: Happy with it, guys?

JAMES: I liked it.

It's a wrap.

LO: That's a rap?

NAT: That's a rap.

LO: So it was great jumping in the room,

even to just see this little snippet of what

goes into recording a voice because it really

made us realize not just, of course,

how complicated language is, but how much more

goes into language than just the words.

I mean, hearing the different ways

you could deliver something and realizing

what a difference that actually makes in terms of somebody's

understanding in the context.

It's very humbling.

I mean, it's not something we walk around

in our day thinking about because it's not something

we have to think about.

And yet, if you're going to make a computer voice,

it makes a big difference.

VOICE TALENT: Thank you for watching

Nat and Lo's 20% Project.

No, wait.

Let's try that again.


Thank you for watching Nat and Lo's 20% Project.

NAT: Bravo.

I liked that.

Thank you.

ERICA: 600 feet, turn right.

LO: What time is it?


LO: That's actually really good!


JAMES: Step away from the mike.

Step away from the mike.