New ‘Voice Engine’ from OpenAI Needs Only 15 Seconds to Clone Speech

[ad_1]

OpenAI, the AI company behind dominant generative AI tool ChatGPT, has unveiled a new voice cloning technology it calls “Voice Engine.” This audio model can replicate a person’s voice, intonation, and other distinctly human speech patterns based on a relatively small sample of original audio.

“It is notable that a small model with a single 15-second sample can create emotive and realistic voices,” the company says in its Friday blog post.

For comparison, AI voice platform ElevenLabs features an instant voice cloning tool that requires samples of at least one minute. For best results, nearly 10 minutes of continuous speech is needed for its professional service level.

The company showed different examples of what this technology is capable of doing. In one example, the voice of a young patient who lost much of her ability to speak due to a vascular brain tumor was cloned using an older recording she made for a school project. This is how she sounds today, according to OpenAI.

OpenAI worked with Lifespan, a nonprofit affiliated with the medical school at Brown University and the creators of a tool called Livox, an “alternative communication app” built for people with disabilities. The team was able to work with a recording that the woman made for a school presentation:

The Open AI Voice Engine was then able to provide instant text-to-speech capability that would allow the patient to effectively speak with her own voice:

OpenAI also showcased how HeyGen is using its technology to generate natural-sounding translations of speech uploaded in a specific language in another language.

The company says Voice Engine was first developed in late 2022 and is already being used to power the preset voices available in OpenAI’s text-to-speech API, as well as ChatGPT’s Voice and Read Aloud feature. With the latest advancements, the company says it’s being cautious before a broader release.

”We hope to start a dialogue on the responsible deployment of synthetic voices and how society can adapt to these new capabilities,” OpenAI wrote, acknowledging the widely condemned practice of “deepfakes.” The voices of celebrities, government officials, and increasingly private citizens are being impersonated for nefarious purposes, from political campaigns, fake ads and outright criminal activities. U.S. President Joe Biden has been pushing for more safeguards against the malicious use of AI voice impersonations.

In fact, Meta disclosed last summer that its AI voice tool was being held back specifically because of the “potential risks of misuse.”

“In line with our approach to AI safety and our voluntary commitments, we are choosing to preview but not widely release this technology at this time,” OpenAI explained.

Even before public release, OpenAI is placing restrictions on Voice Engine—including a list of prominent people that it will not emulate.

“We believe that any broad deployment of synthetic voice technology should be accompanied by voice authentication experiences that verify that the original speaker is knowingly adding their voice to the service and a no-go voice list that detects and prevents the creation of voices that are too similar to prominent figures,” OpenAI wrote.

The partners testing Voice Engine today have agreed to OpenAI’s usage policies, which prohibit the impersonation of another individual or organization without consent. In addition, the company requires explicit and informed consent from the original speaker, and they don’t allow developers to build ways for individual users to clone their own voices.

“Based on these conversations and the results of these small scale tests, we will make a more informed decision about whether and how to deploy this technology at scale,” the blog post reads.

In addition to Voice Engine, Open AI is working on multiple projects in parallel. CEO Sam Altman revealed that the company is working on releasing GPT-5 this year. The company also showed off its generative video tool Sora. The company claims that Sora will be the most advanced video generator on the market, surpassing models like Pika, Stable Video Diffusion, and Runway ML.

Sora is currently only available to “red teamers” enlisted by Open AI to make sure it cannot be abused.

Voice Engine could certainly outperform other voice cloning tools, including offerings from Meta, ElevenLabs, WellSaid Labs, and open-source models like RVC.

Open AI is also working on a secret project named Q* of which only its name has been leaked. Sam Altman has refused to give any details, but said the research team was heavily focused on finding techniques and approaches that make AI reason better.

Edited by Ryan Ozawa.