This 6-Million-Dollar AI Changes Accents as You Speak

IEEE SpectrumFOR THE TECHNOLOGY INSIDER
TopicsAerospaceArtificial IntelligenceBiomedicalClimate TechComputingConsumer ElectronicsEnergyHistory of TechnologyRoboticsSemiconductorsTelecommunicationsTransportation
SectionsFeaturesNewsOpinionCareersDIYEngineering Resources
MoreNewslettersPodcastsSpecial ReportsCollectionsExplainersTop Programming LanguagesRobots Guide ↗IEEE Job Site ↗
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
IEEE SpectrumAbout UsContact UsReprints & Permissions ↗Advertising ↗
Follow IEEE Spectrum
Support IEEE SpectrumIEEE Spectrum is the flagship publication of the IEEE — the world’s largest professional organization devoted to engineering and applied sciences. Our articles, podcasts, and infographics inform our readers about developments in technology, engineering, and science.
Join IEEE
Subscribe
About IEEEContact & SupportAccessibilityNondiscrimination PolicyTermsIEEE Privacy PolicyCookie PreferencesAd Privacy Options
© Copyright 2024 IEEE — All rights reserved. A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

Stanford University prides itself on its international diversity, touting that today's undergraduates hail from 70 countries. So a friend-group that included a computer science major from China, an AI-focused management science and engineering (MSE) major from Russia, and a business-oriented MSE major from Venezuela isn't an anomaly. The friends did the normal things Stanford students do with their free time, like fountain hopping, cheering at football games, and hiking the trail around the Stanford Dish radio telescope.

And then came the pandemic.

"Stanford went virtual," Andres Perez Soderi recalls. (He's the member of the trio from Venezuela.) "And we scattered around the Bay Area, to San Francisco, Pleasanton, as well as Palo Alto, and we were keeping in touch online. School just isn't fulfilling when you aren't physically there, and we had a lot of time on our hands."

They also had an idea, sparked by a conversation with another friend, a computer science major who had gone back to his home in Guatemala, where he had gotten a job at a call center doing tech support in order to support his family.

“We knew from our own experience that forcing a different accent on yourself is uncomfortable. … We thought if we could allow software to translate the accent [instead], we could let people speak naturally.”
—Andres Perez Soderi, Sanas

"When he got the job," Soderi said, "we told him that he'd be the best tech support person they'd ever had, he's the smartest guy we've met and always had a smile on his face."

But the job didn't last—his customer satisfaction numbers were too low, because callers struggled to understand his accent and would lash out in frustration.

Given the three spoke English with vastly different accents, the problem hit home.

"We decided to help the world understand and be understood," Soderi said.

They dedicated their empty pandemic hours to building a solution.

"We did a lot of research around what people have done in the past. People have done voice conversion for deep fakes, and that technology is pretty advanced. But there's been little done in accent translation. So, say, if I used an existing system to make me sound like Batman, I would sound like a Chinese-accented Batman" says Shawn Zhang, the trio's member from China.

"We knew about accent-reduction therapy and being taught to emulate the way someone else speaks in order to connect with them. And we knew from our own experience that forcing a different accent on yourself is uncomfortable. I went to a British high school and tried to force a British accent; it was an experience that was hard to digest. We thought if we could allow software to translate the accent [instead], we could let people speak naturally," says Soderi.

"Our first approach was naïve," Zhang says. "We built a system that converted speech to text and then text to speech." That wasn't going to be particularly useful for real-time conversation, their ultimate goal. So they began thinking about how to structure data to use in training a neural network to convert accents directly, speech to speech. They reached out to professors at Stanford and experts in industry to advise them.

And they filed the paperwork to incorporate as a company—Sanas. (Incorporation is something else that is not an unusual step when Stanford undergrads start tinkering with anything.)

The name came from a hunt through random syllables, looking for something that sounded good and was available to use. Sanas jumped out because it is a palindrome—and it turned out to refer to whispers or sounds in some forms of ancient Latin. They assigned the CTO title to Zhang, CFO to Soderi, and CEO to Maxim Serebryakov.

That all happened in the first half of 2020, and things have continued to move quickly. Sanas now has a full-time engineering staff of 14, including the founders, and three more part-time developers, plus two employees working on the business side. All now work remotely, spread out internationally. The company completed a seed funding round of US $5.5 million in late May, a few months shy of Zhang's twenty-first birthday, bringing total investment to about $6 million.

Baris Akis, the president and co-founder of Human Capital, who led the seed round, stated at the time: "As an immigrant from Turkey, I've always felt that getting rid of the accent barrier was a critical next step for a more fair and prosperous world."

Today, Sanas has an algorithm that can shift English to and from American, Australian, British, Filipino, and Spanish accents. They developed it using a neural network, trained with recordings made, for the most part, by professional voice actors.

Says Zhang, "You aren't just doing audio signal processing, changing the pitch and tone. You have to change the phonetics. So we really needed parallel data sets, created by readers using the same source material, so the neural network could learn to map from one to the other, examining both to learn how to transform the pronunciation."

The algorithm runs locally on a CPU (not in the cloud), with 150 milliseconds of delay, at the speech quality of telephone audio, working alongside communications apps like Zoom, Skype, and WhatsApp. A typical Zoom delay is about 50 milliseconds, bringing the total delay to about 200 milliseconds. Soderi indicated that generally anything below 300-to-350 milliseconds is imperceptible in audio communications, so users don't notice a lag. And the algorithm is efficient in terms of CPU usage.

But, Zhang admits, there's plenty of room for improvement. "We are trying to make more clear, natural, and pleasant to hear; it's an ongoing process."

The team plans to add more accents within English, but also work with accents of other languages, including Spanish and French.

Their first customers will be among outsourcing companies, the kinds hired to provide customer service and other telephone support functions. Seven such firms are currently piloting the system.

"But that's just our first use case," says Zhang, "because it is a measurable and controlled environment. We don't see ourselves as a call center company, we want to go into healthcare, entertainment, education, and other spaces. We want to develop this as a tool that helps people with human-to-human interaction, without hurting their cultural identities."

From Your Site Articles

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

This 6-Million-Dollar AI Changes Accents as You Speak

Three international Stanford undergrads start company to “help the world understand”

50 Years Later, This Apollo-Era Antenna Still Talks to Voyager 2

This Blood-Sampling Cytometer Is Small Enough for Mars

Tiny Sensor Aims to Monitor Tumors in Real Time

Related Stories

AI Chip Trims Energy Budget Back by 99+ Percent

Intel’s Gaudi 3 Goes After Nvidia

AI Coding Is Going From Copilot to Autopilot

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

This 6-Million-Dollar AI Changes Accents as You Speak

Three international Stanford undergrads start company to “help the world understand”

50 Years Later, This Apollo-Era Antenna Still Talks to Voyager 2

This Blood-Sampling Cytometer Is Small Enough for Mars

Tiny Sensor Aims to Monitor Tumors in Real Time

Related Stories

AI Chip Trims Energy Budget Back by 99+ Percent

Intel’s Gaudi 3 Goes After Nvidia

AI Coding Is Going From Copilot to Autopilot