NAT: One more time.
LO: One more time.
NAT: Hi, I'm Nat.
LO: And I'm Lo.
NAT: And this is Nat and Lo's 20% Project,
where we go around Google learning about all
the stuff we're curious about.
LO: And today, we're going behind the scenes
of the new Google Voice.
NAT: Which is not Lo.
LO: Mm-mm.
So a few years ago, Google added Voice Search as one
of the ways you can search for something.
And if you've ever tried it out--
NAT: OK, Google.
How tall is an elephant?
GOOGLE VOICE: African elephant has the height of 11 feet.
NAT: You've probably noticed that computer
voice that talks back to you.
LO: We got word that there is a team at Google creating
a new voice for the Google App.
And we got really excited to learn, how do you actually
create a computer voice.
NAT: So we got into the recording studio
and met James, the voice coach, and Erica, a linguist,
and the new voice talent.
LO: So this is silly.
VOICE TALENT: Yeah, this is totally silly.
So you guys are on camera.
And I don't have to seen.
NAT: We're invading your home.
VOICE TALENT: I love it.
NAT: When was the first text-to-speech made?
ERICA: The main thing that we think of as speech synthesis
mostly started in the 80s.
SPEAK & SPELL: Press Spell to begin.
NAT: Back then, they were using an approach that was completely
computer generated.
But now, how text-to-speech voices are typically made
is that a person records thousands and thousands
of sentences.
And then out of all those sentences
and all the sounds that they contain,
magically, a voice is made.
ERICA: When we do speech synthesis, what we're doing
is stringing together units of sound
to combine them in new ways to make new speech.
LO: Units of sound can be phonemes.
ERICA: The smallest unit of sound
that distinguishes meaning.
So from stack to slack, suddenly,
it's a completely different word.
You've changed the meaning just by swapping
one sound for another.
LO: Phones.
JAMES: I can think of eight kinds of T, eight flavors of T
in English.
Those are phones.
Top.
Top.
So you see the paper move when I said the T.
But what happens if I put an S before top?
Stop, stop.
NAT: And diphones.
ERICA: It's essentially the transition
from one sound to another.
JAMES: What is the sound of an S before W.
What's the sound of an S before an I?
What's the sound of a T at the beginning of a word?
LO: Erica and James told us that the bulk of these recordings
are for getting good diphone coverage
because these transitionary sounds are
really fundamental to how speech works.
So once you're done and happy with what you have, then
what happens?
ERICA: Then the recordings are stored in a database.
JAMES: Any content that hasn't been pre-recorded would
have to be stitched together.
ERICA: It's essentially a search problem.
It'll be like, OK, I need to put this sound with this sound
with this sound.
Let me look through the recordings,
and then stitches them all together into smooth speech.
LO: Synthesized speech can be great at taking
text and turning it into speech on the fly.
But it can end up sounding a bit robotic.
NAT: So probably one of the biggest reasons
that the team is creating a new voice
is to do more detailed recordings that will add more
nuance to how the voice sounds.
VOICE TALENT: A typical grilled cheese sandwich
made with two slices of white bread,
margarine, and 26 grams of processed American or cheddar
cheese contains 291 calories.
JAMES: I think my favorite part is sandwich.
ERICA: Really Yeah, the way it says--
JAMES: Play "sandwich" again.
VOICE TALENT: A typical grilled cheese sandwich made with--
JAMES: Sandwich.
ERICA: Sandwich.
JAMES: Yeah, sandwich.
I love that.
ERICA: So a big problem in speech synthesis
right now is prosody and intonation.
NAT: Which means the way the speech sounds,
its rise and fall, its rhythm and melody.
JAMES: Let me give an example from just regular language.
What's your name?
LO: Lorraine.
JAMES: What's your name?
LO: Lorraine.
JAMES: So the first time, my melody went down.
When it was a request for repetition
or for clarification, my voice went up.
We've never been able to do this before.
NAT: To try to add all these different layers of melody
and stress and intonation and whatever other nuances
that we have when we speak, the team has been doing
these recordings that are almost mini-plays, where
James, the voice coach, acts as a user.
And then the voice talent can respond back in different ways,
depending on different situations.
JAMES: OK Google.
I need to send a text.
VOICE TALENT: Who do you want to text?
JAMES: Madison.
VOICE TALENT: Madison, sure.
Mobile or work?
JAMES: Mobile.
VOICE TALENT: What's the message?
JAMES: So the user just dictated the message.
VOICE TALENT: Got it.
Here's your message.
Do you want to send it?
JAMES: Actually, no.
VOICE TALENT: Do you want to change the message?
Do you want to change the message?
JAMES: That was good.
That was lower.
That was the right pitch.
NAT: So the new voice will actually
be a combination of these whole chunk
recordings along with splicing together units of speech.
LO: We wanted to understand how it feels to actually
be the person who's expected to be the new computer voice.
So we asked the voice talent if she had any tips for us.
VOICE TALENT: It takes a lot more energy and air
than you think to keep the voice going
and to keep it energetic and clear.
LO: So I have to learn in the next few minutes
to speak from my diaphragm.
VOICE TALENT: Just speak from your diaphragm.
It takes a lot of energy.
JAMES: OK Google.
I need to put something on my calendar.
LO: When's the event?
JAMES: Tuesday.
LO: At what time?
JAMES: It sounds a little bit tense.
And you're starting a little bit high.
Think of it as a smooth figure skater on ice?
LO: At what time?
JAMES: OK, A little lower.
LO: At what time?
NAT: Sorry, what's the day and time?
JAMES: Now, let's try one like this.
Sorry, what's the day and time?
NAT: Sorry, what's the day and time?
JAMES: Oh, that was really good.
LO: Happy with it, guys?
JAMES: I liked it.
It's a wrap.
LO: That's a rap?
NAT: That's a rap.
LO: So it was great jumping in the room,
even to just see this little snippet of what
goes into recording a voice because it really
made us realize not just, of course,
how complicated language is, but how much more
goes into language than just the words.
I mean, hearing the different ways
you could deliver something and realizing
what a difference that actually makes in terms of somebody's
understanding in the context.
It's very humbling.
I mean, it's not something we walk around
in our day thinking about because it's not something
we have to think about.
And yet, if you're going to make a computer voice,
it makes a big difference.
VOICE TALENT: Thank you for watching
Nat and Lo's 20% Project.
No, wait.
Let's try that again.
OK.
Thank you for watching Nat and Lo's 20% Project.
NAT: Bravo.
I liked that.
Thank you.
ERICA: 600 feet, turn right.
LO: What time is it?
[BEATBOXING AND SINGING]
LO: That's actually really good!
[BEATBOXING AND SINGING CONTINUES]
JAMES: Step away from the mike.
Step away from the mike.
LO: [GROWLING]