This open text-to-speech model needs just seconds of audio to clone your voice

#### [AI + ML](/software/ai_ml/)This open text-to-speech model needs just seconds of audio to clone your voice==============================================================================El Reg shows you how to run Zypher’s speech-replicating AI on your own box————————————————————————–[Tobias Mann](/Author/Tobias-Mann ‘Read more by this author’) Sun 16 Feb 2025 // 18:58 UTC [](https://www.reddit.com/submit?url=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dreddit&title=This%20open%20text-to-speech%20model%20needs%20just%20seconds%20of%20audio%20to%20clone%20your%20voice) [](https://twitter.com/intent/tweet?text=This%20open%20text-to-speech%20model%20needs%20just%20seconds%20of%20audio%20to%20clone%20your%20voice&url=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dtwitter&via=theregister) [](https://www.facebook.com/dialog/feed?app_id=1404095453459035&display=popup&link=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dfacebook) [](https://www.linkedin.com/shareArticle?mini=true&url=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dlinkedin&title=This%20open%20text-to-speech%20model%20needs%20just%20seconds%20of%20audio%20to%20clone%20your%20voice&summary=El%20Reg%20shows%20you%20how%20to%20run%20Zypher%27s%20speech-replicating%20AI%20on%20your%20own%20box) [](https://api.whatsapp.com/send?text=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dwhatsapp) Hands on Palo Alto-based AI startup Zyphra unveiled a pair of open text-to-speech (TTS) models this week said to be capable of cloning your voice with as little as five seconds of sample audio. In our testing, we generated realistic results with less than half a minute of recorded speech.Founded in 2021 by Danny Martinelli and Krithik Puthalath, the startup aims to build a multimodal agent system called MaiaOS. To date, these efforts have seen the release of its Zamba family of small language models, optimizations such as tree attention, and now the release of its Zonos TTS models.Measuring at 1.6 billion parameters in size each, the models were trained on more than 200,000 hours of speech data, which includes both neutral-toned speech such as audiobook narration, and ‘highly expressive’ speech. According to the upstart’s [release notes](https://www.zyphra.com/post/beta-release-of-zonos-v0-1) for Zonos, the majority of its data was in English but there were ‘substantial’ quantities of Chinese, Japanese, French, Spanish, and German. Zyphra tells *El Reg* this data was acquired from the web and was not obtained from data brokers. ![](https://pubads.g.doubleclick.net/gampad/ad?co=1&iu=/6978/reg_software/aiml&sz=300×50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2Z7I14EZ5YbOpfcgDwtWXiAAAAYw&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0)The results are actually two Zonos models: One that uses a fully transformer-based architecture, and the other, a hybrid that combines transformer and [Mamba](https://github.com/state-spaces/mamba) state space model (SSM) architectures. The latter, Zyphra claims, makes it the first TTS model to use this arch. While transformer-based models are without a doubt the most commonly used in generative AI today, alternative architectures like Mamba are gaining traction. ![](https://pubads.g.doubleclick.net/gampad/ad?co=1&iu=/6978/reg_software/aiml&sz=300×50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z7I14EZ5YbOpfcgDwtWXiAAAAYw&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0)From a practical standpoint, both models behave similarly to other text-to-speech models. But unlike those developed by ElevenLabs and others, Zyphra has elected to release its model weights on [Hugging Face](https://huggingface.co/Zyphra/Zonos-v0.1-hybrid) under a permissive Apache 2.0 license.### Testing it outZyphra offers a demo environment where you can play with its Zonos models, along with paid API access and subscription plans on their website. But, if you’re hesitant to upload your voice to a random startup’s servers, getting the model running locally is relatively easy.We’ll go into more detail on how to set that up in a bit, but first, let’s take a look at how well it actually works in the wild.To test it out, we spun up Zyphra’s Zonos demo locally on an Nvidia RTX 6000 Ada Generation graphics card. We then uploaded 20- to 30-second clips of ourselves reading a random passage of text, and fed that into the Zonos-v0.1 transformer and hybrid models along with a 50 or so word text prompt, leaving all hyperparameters to their defaults. The goal is to have the trained model predict your voice, and output it as an audio file, from the provided sample recordings and prompt. ![](https://pubads.g.doubleclick.net/gampad/ad?co=1&iu=/6978/reg_software/aiml&sz=300×50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z7I14EZ5YbOpfcgDwtWXiAAAAYw&t=ct%3Dns%26unitnum%3D426raptor%3Dfalcon%26pos%3Dmid%26test%3D0)Using a 24-second sample clip, we were able to achieve a voice clone good enough to fool close friends and family — at least on first blush. After revealing that the clip was AI generated, they did note that the pacing and speed of the speech did feel a little off, and that they believed they would have caught on to the fact the audio wasn’t authentic given a longer clip.You can listen for yourself, here are two clips. The first sample is a recording of a real-life human, your humble vulture, reading from H.G. Wells’ The Time Machine, while the second is an AI-generated clone reading from Jules Verne’s 20,000 Leagues Under the Sea.**Human sample:**Your browser does not support the video tag**AI generated audio using the non-hybrid model:**Your browser does not support the video tagBoth pacing and speech are parameters that can be controlled, and Zonos supports audio prefixing, which allows for more dynamic ranges such as whispering.In its documentation, Zyphra claims its hybrid transformer-Mamba model performed about 20 percent faster than the pure transformer model. This speed up wasn’t as noticeable for shorter prompts, but we can say there was a notable difference in how the two models sounded.At least to our ears, the hybrid model generated a slightly more polished sounding audio, which ironically took away somewhat the authenticity of the cloned voice. Listening to yourself talk is always kind of a strange experience, however, so we’ll let you be the judge.**AI generated audio using the hybrid model:**Your browser does not support the video tagThe model’s performance was also in line with Zyphra’s claims of it producing about two seconds of audio for every second of runtime, when running on an RTX 4090. The RTX 6000 Ada — which isn’t too far off from an RTX 4090 in terms of compute — required 9 to 10 seconds to convert roughly 50 words into an 18 to 20 second audio clip. We will note that on the first run, we did observe a warm-up period lasting about a minute while the model was loaded in GPU memory, so it won’t start outputting right off the bat.### Try it for yourselfIf you’d like to use Zonos to clone your own voice, deploying the model is relatively easy, assuming you’ve got a compatible GPU and some familiarity with Linux and containerization.### What you’ll need:* A Linux box with a reasonably modern Nvidia graphics card with at least 8 GB of vRAM. You may be able to get this running on as little as 6 GB, but your mileage may vary. For the operating system, we’re using Ubuntu 24.04 LTS.* This guide also assumes you’ve installed the latest version of Docker Engine and the latest release of Nvidia’s Container Runtime. For more information on getting this set up, check out our guide on GPU-accelerated Docker containers [here](https://www.theregister.com/2024/07/07/containerize_ai_apps/). We also assume you’re comfortable with the Linux command line.To get started, we’ll use `git` to pull down the Zonos repo:“`git clone https://github.com/Zyphra/Zonos.git“`From there, we’ll navigate into the folder and spin up the container using Docker Compose:“`cd Zonosdocker compose up“`Note: Depending on your system, you’ll probably need to run this `docker` command with elevated privileges using `sudo` or, in some cases, `doas`.After a few seconds, you should be able to access the Gradio web GUI by navigating to `http://localhost:7860` or, if you’re running this remotely, you’ll need to swap localhost for the machine’s IP address or hostname. We highly recommend you don’t leave this particular service facing the public internet. ![Zypher’s Zonos demo comes packaged with an easy to use Gradio dashboard](https://regmedia.co.uk/2025/02/12/zonos_demo_dashboard.png?x=648&y=361&infer_y=1 ‘Zypher’s Zonos demo comes packaged with an easy to use Gradio dashboard’)Zypher’s Zonos demo comes packaged with an easy-to-use Gradio dashboard – Click to enlargeFrom there, you’ll be greeted with a Gradio dashboard. Here you’ll want to select which version of the Zonos model you’d like to use, upload or record your sample audio, and input the text you’d like to convert.Below this, you’ll find a variety of hyperparameters that allow you to tweak aspects of the generation, including things like pitch and speaking rate. We won’t pretend to fully understand all of these parameters, but, in our testing, we largely left these settings to their defaults.Once you’ve got everything dialed in, click on Generate Audio. Depending on your hardware and the length of your input text, this could take anywhere from a few seconds to minutes. Once complete, the clip should begin playing automatically.* [AI summaries turn real news into nonsense, BBC finds](https://www.theregister.com/2025/02/12/bbc_ai_news_accuracy/)* [DeepSeek or DeepFake? Our vultures circle China’s hottest AI](https://www.theregister.com/2025/02/01/deepseek_kettle_ai/)* [AI agents? Yes, let’s automate all sorts of things that don’t actually need it](https://www.theregister.com/2025/01/27/ai_agents_automate_argument/)* [Mental toll: Scale AI, Outlier sued by humans paid to steer AI away from our darkest depths](https://www.theregister.com/2025/01/24/scale_ai_outlier_sued_over/)### Broader implicationsAs we’ve previously seen with image generation and other AI tech, the voice cloning capabilities presented by Zonos are inherently controversial, from where the training data was mined to how they’re actually used in practice.Considering just how little sample audio is required to achieve a passable result, it’s easy to see how this technology could be abused. Companies like Audible are [exploring](https://www.theverge.com/2024/9/9/24239903/amazon-audible-audiobook-narrators-ai-generated-voice-clones) text-to-speech AI to expand audiobook production, allowing narrators to create AI-generated voice clones of themselves. Meanwhile, [legal challenges](https://www.cbsnews.com/news/two-voice-actors-sue-ai-company-lovo/) surrounding AI voice cloning are already hitting similar businesses.We can also see this technology used to scam unsuspecting victims into believing that a loved one is in trouble, and that they just need a few hundred dollars worth of gift cards to get them out of a bind. Or to ruin someone’s career by using it to make an abusive call with their voice to their boss. Or generate fake political messages, or… the examples are endless.Having said that, there are also benevolent uses for these kinds of models. From an accessibility standpoint, voice cloning and text-to-speech could help someone who has suffered trauma to their vocal cords, or has conditions affecting speech, get their voice back. In fact, this is one of the reasons that Apple gave to [justify](https://machinelearning.apple.com/research/personal-voice) the inclusion of voice cloning tech in iOS in late 2023.The fact that this technology is already widely available — whether on iDevices or through paid services or as open source models — is why we’re even comfortable demonstrating how to deploy and run Zonos locally in the first place.With that said, if you do choose to embrace AI text-to-voice capabilities, we encourage you to do so in the most respectful and responsible way possible. ®***Editor’s Note:** *The Register* was provided an RTX 6000 Ada Generation graphics card by Nvidia, an Arc A770 GPU by Intel, and a Radeon Pro W7900 DS by AMD to support stories like this. None of these vendors had any input as to the content of this or other articles.* [Whitepaper: Top 5 Tips For Navigating Your SASE Journey](https://go.theregister.com/tl/2386/-14369/top-5-tips-for-navigating-your-sase-journey?td=wptl2386bt) Share [](https://www.reddit.com/submit?url=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dreddit&title=This%20open%20text-to-speech%20model%20needs%20just%20seconds%20of%20audio%20to%20clone%20your%20voice) [](https://twitter.com/intent/tweet?text=This%20open%20text-to-speech%20model%20needs%20just%20seconds%20of%20audio%20to%20clone%20your%20voice&url=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dtwitter&via=theregister) [](https://www.facebook.com/dialog/feed?app_id=1404095453459035&display=popup&link=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dfacebook) [](https://www.linkedin.com/shareArticle?mini=true&url=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dlinkedin&title=This%20open%20text-to-speech%20model%20needs%20just%20seconds%20of%20audio%20to%20clone%20your%20voice&summary=El%20Reg%20shows%20you%20how%20to%20run%20Zypher%27s%20speech-replicating%20AI%20on%20your%20own%20box) [](https://api.whatsapp.com/send?text=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dwhatsapp) #### More about* [AI](/Tag/AI/)* [Machine Learning](/Tag/Machine%20Learning/)* [PC](/Tag/PC/) More like these × ### More about* [AI](/Tag/AI/)* [Machine Learning](/Tag/Machine%20Learning/)* [PC](/Tag/PC/) ### Narrower topics* [Copilot+ PC](/Tag/Copilot%2B%20PC/)* [Deep Learning](/Tag/Deep%20Learning/)* [DeepSeek](/Tag/DeepSeek/)* [Gemini](/Tag/Gemini/)* [Google AI](/Tag/Google%20AI/)* [GPT-3](/Tag/GPT-3/)* [GPT-4](/Tag/GPT-4/)* [Jupiter Ace](/Tag/Jupiter%20Ace/)* [Large Language Model](/Tag/Large%20Language%20Model/)* [MCubed](/Tag/MCubed/)* [Neural Networks](/Tag/Neural%20Networks/)* [NLP](/Tag/NLP/)* [Star Wars](/Tag/Star%20Wars/)* [Tensorflow](/Tag/Tensorflow/)* [Tensor Processing Unit](/Tag/Tensor%20Processing%20Unit/)* [TOPS](/Tag/TOPS/)* [workstation](/Tag/workstation/)* [ZX Spectrum](/Tag/ZX%20Spectrum/) ### Broader topics* [Computer](/Tag/Computer/)* [Self-driving Car](/Tag/Self-driving%20Car/) #### More aboutShare [](https://www.reddit.com/submit?url=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dreddit&title=This%20open%20text-to-speech%20model%20needs%20just%20seconds%20of%20audio%20to%20clone%20your%20voice) [](https://twitter.com/intent/tweet?text=This%20open%20text-to-speech%20model%20needs%20just%20seconds%20of%20audio%20to%20clone%20your%20voice&url=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dtwitter&via=theregister) [](https://www.facebook.com/dialog/feed?app_id=1404095453459035&display=popup&link=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dfacebook) [](https://www.linkedin.com/shareArticle?mini=true&url=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dlinkedin&title=This%20open%20text-to-speech%20model%20needs%20just%20seconds%20of%20audio%20to%20clone%20your%20voice&summary=El%20Reg%20shows%20you%20how%20to%20run%20Zypher%27s%20speech-replicating%20AI%20on%20your%20own%20box) [](https://api.whatsapp.com/send?text=https://www.theregister.com/2025/02/16/ai_voice_clone/%3futm_medium%3dshare%26utm_content%3darticle%26utm_source%3dwhatsapp) POST A COMMENT #### More about* [AI](/Tag/AI/)* [Machine Learning](/Tag/Machine%20Learning/)* [PC](/Tag/PC/) More like these × ### More about* [AI](/Tag/AI/)* [Machine Learning](/Tag/Machine%20Learning/)* [PC](/Tag/PC/) ### Narrower topics* [Copilot+ PC](/Tag/Copilot%2B%20PC/)* [Deep Learning](/Tag/Deep%20Learning/)* [DeepSeek](/Tag/DeepSeek/)* [Gemini](/Tag/Gemini/)* [Google AI](/Tag/Google%20AI/)* [GPT-3](/Tag/GPT-3/)* [GPT-4](/Tag/GPT-4/)* [Jupiter Ace](/Tag/Jupiter%20Ace/)* [Large Language Model](/Tag/Large%20Language%20Model/)* [MCubed](/Tag/MCubed/)* [Neural Networks](/Tag/Neural%20Networks/)* [NLP](/Tag/NLP/)* [Star Wars](/Tag/Star%20Wars/)* [Tensorflow](/Tag/Tensorflow/)* [Tensor Processing Unit](/Tag/Tensor%20Processing%20Unit/)* [TOPS](/Tag/TOPS/)* [workstation](/Tag/workstation/)* [ZX Spectrum](/Tag/ZX%20Spectrum/) ### Broader topics* [Computer](/Tag/Computer/)* [Self-driving Car](/Tag/Self-driving%20Car/) #### TIP US OFF[Send us news](https://www.theregister.com/Profile/contact/)[#### Where does Microsoft’s NPU obsession leave Nvidia’s AI PC ambitions?Comment While Microsoft pushes AI PC experiences, Nvidia is busy wooing developersAI + ML28 days -| 26](/2025/01/20/microsoft_nvidia_ai_pcs/?td=keepreading) [#### IBM seeks $3.5B in cost savings for 2025, discretionary spend to be clippedWorkforce rebalancing? Yes, but on the plus side, the next 12 months are all about AI, AI, and more AISoftware17 days -| 9](/2025/01/30/ibm_q4_2024/?td=keepreading) [#### Mental toll: Scale AI, Outlier sued by humans paid to steer AI away from our darkest depthsWho guards the guardrail makers? Not the bosses who hire them, it’s allegedAI + ML24 days -| 15](/2025/01/24/scale_ai_outlier_sued_over/?td=keepreading) [#### Fortify your dataHow cyber resilient storage hardware can defeat ransomwareSponsored Feature](/2024/11/26/fortify_your_data/?td=keepreading) [#### UK’s new thinking on AI: Unless it’s causing serious bother, you can crack onComment Plus: Keep calm and plug Anthropic’s Claude into public servicesAI + ML1 day -| 54](/2025/02/15/uk_ai_safety_institute_rebranded/?td=keepreading) [#### Some workers already let AI do the thinking for them, Microsoft researchers findDammit, that was our job here at *The Reg*. Now if you get a task you don’t understand, you may assume AI has the answersAI + ML6 days -| 47](/2025/02/11/microsoft_study_ai_critical_thinking/?td=keepreading) [#### UK government insiders say AI datacenters may be a pricey white elephantEconomy-boosting bit barn? Not in my back yard, some locals expected to sayOn-Prem4 days -| 116](/2025/02/12/uk_gov_ai_datacenters/?td=keepreading) [#### Only 4 percent of jobs rely heavily on AI, with peak use in mid-wage rolesMid-salary knowledge jobs in tech, media, and education are changing. Folk in physical jobs have less to sweat aboutAI + ML5 days -| 27](/2025/02/11/ai_impact_hits_midtohigh_wage_jobs/?td=keepreading) [#### Copilot+ PCs? Customers just aren’t buying it — yet57% higher price point and app compatibility issues aren’t helpingPersonal Tech10 days -| 90](/2025/02/06/ai_copilot_pc_sales/?td=keepreading) [#### When it comes to AI ROI, IT decision-makers not convincedProof of concept projects stuck in pilot phase as investors get itchy feetAI + ML10 days -| 16](/2025/02/06/lenovo_ai_report/?td=keepreading) [#### A win at last: Big blow to AI world in training data copyright scrapYou gotta fight … for your Reuters … to partyAI + ML5 days -| 34](/2025/02/12/thomson_reuters_wins_ai_copyright/?td=keepreading) [#### Google torpedoes ‘no AI for weapons’ rulesWill now happily unleash the bots when ‘likely overall benefits substantially outweigh the foreseeable risks’AI + ML12 days -| 34](/2025/02/05/google_ai_principles_update/?td=keepreading)

Related Tags:
Play

NAICS: 334 – Computer And Electronic Product Manufacturing

NAICS: 519 – Web Search Portals

Libraries

Threat Intel Solutions

This open text-to-speech model needs just seconds of audio to clone your voice

Search

Popular Posts

Dangerous runC flaws could allow hackers to escape Docker containers

Lost iPhone? Don’t fall for phishing texts saying it was found

NAKIVO Introduces v11.1 with Upgraded Disaster Recovery and MSP Features

Categories

Pages

Archives

Tags

Follow Us