You may not think of it this way, but you probably hear AI voices all the time. When you’re talking to Alexa or Siri, that’s a model trained on human speech to be able to say almost anything. Did you ever have a celebrity give you directions on Waze? AI. And every time you watch TikTok and you hear that slightly too chipper voice speaking the captions aloud, that’s AI all the way down. Heck, Apple’s AI will even read you a romance novel before you go to bed.
AI systems are getting good at turning text into believable speech in almost any language and almost any voice. And on this episode of The Vergecast, the first in our three-part miniseries on AI, that voice is mine. We trained a bunch of different AI bots with the sound of my voice — sometimes reading scripts full of nonsense sentences, sometimes uploading hours of existing audio from old Vergecast episodes, sometimes a bit of each — to see how well — and how quickly — we could make a passable AI copy of my voice.
It was... pretty wild. Here’s the episode:
And if you want a quick comparison of the different tools, first, here’s the reference speech we used from the great Dwight Schrute:
We transcribed that text and fed it into every AI generator we tested. Here’s how Podcastle interpreted it in the voice of AI David Pierce:
Here’s what Descript did with the same thing:
And the new Personal Voice feature in iOS 17:
And finally, ElevenLabs, easily the most realistic and impressive of the tools we tested:
Ultimately, I don’t think any of the AI voices are going to replace me. But they’re getting better really fast, and they raise both huge possibilities and huge questions. What does it mean that I can create a replica this good and that they’re going to only get better and easier over time? What responsibilities do I have as the person who made it? What responsibilities do other people have?
We’re having a lot of debates over AI music right now, obviously, as artists’ voices are being used to train models that can make pretty convincing songs in just about anyone’s voice. That’s going to spawn a decade of interesting court cases and ethical debates, but those same things are coming for just you and me. How do we use these tools? How do we talk about them? Is it even possible to get the good, helpful, democratizing things from them without all the deepfakes and problems? We’ve got a lot to figure out and no time to lose. Because the tech is really good right now, and it’s getting better really fast.