Deepfakes, or media that takes a person in an existing image, audio recording, or video and replaces them with someone else’s likeness using AI, are multiplying quickly. In recognition of the threat that deepfakes pose to privacy, civility, and democratic processes, the U.S. Federal Trade Commission (FTC) hosted a workshop yesterday examining a subcategory of deepfakes known as voice cloning, or techniques that generate near-perfect reproductions of a person’s voice. Throughout a series of panel discussions and lectures, guest speakers including FTC commissioner Rohit Chopra, U.S. Department of Justice attorney Mona Sedky, Microsoft Defending Democracy tech and operations director Ashish Jaiman, and Defense Advanced Research Projects Agency (DARPA) science and engineering tech advisor Neil Johnson weighed in on various deepfake audio use cases and methods that might be used to combat them.
The mood was in equal parts grim and optimistic. Presenters anticipated that deepfakes would be used to perpetrate fraud and harassment, but that it might also be used to synthesize voices for those suffering from ALS and other health ailments. That said, all agreed that regulation, methods of detection, and public awareness will be fundamental in a world where AI produces voices indistinguishable from that of real people.
Crime and consent
Chopra noted that according to recent surveys, Americans are rapidly losing trust in technology and tech companies, and that the proliferation of deepfakes might only serve to deepen that mistrust. “Today, technology and data are weaponized by those who wish to do our country and our society harm. Privacy is clearly now a national security issue and a personal security issue,” he said. “Losing control of our own biometrics poses another level of peril … We’ll determine how to control this technology and keep it out of the wrong hands, and to protect our safety and security from the dangers of biometrics theft.”
Several panelists pointed out that “deepfaked” voices will be and already have been used without consent. Some of the most impressive examples to date, like those from Toronto-based research firm Dessa, Imperial College London, and Facebook, were created in the course of research or awareness campaigns. But cybersecurity firm Symantec said in September 2019 it had come across at least three cases of deepfake voice fraud, and Experian predicts that criminals this year will use AI to disrupt commercial enterprises’ operations and create “confusion” among nations.
Sedky pointed out that historically, nefarious actors have avoided communication-focused defrauding schemes for three reasons: They leave fingerprints, they’re costly, and they’re difficult. (A foreign national with an accent might struggle to convince a person they’re someone else, for instance.) Deepfake technologies have flipped this on its head because synthesized voices are relatively easy to produce and scale.
“There will be fraud-based … and harassment-focused criminals who will love this technology. Deepfake audio could be used in tandem with deepfake video to create a very fake and realistic, sexually explicit video that could then be used [for extortion and blackmail],” said Sedky. “[A criminal could] follow up a fake … phishing email with a phone call from somebody who sounds exactly like [a trusted] contact. [Or] somebody posing as the CEO of a company [could] make a bogus earnings call to manipulate the stock price [or] sabotage a competitor.”
Rebecca Damon, executive vice president at SAG-AFTRA, the U.S-based union representing film and television actors, journalists, singers, and other media professionals, noted that the unauthorized use of a person’s likeness could harm livelihoods — or irreparably damage reputations.
Damon offered this scenario: A broadcaster is on the eve of breaking a huge story and they discover audio of their voice saying something they didn’t say that ruins their credibility. “People [should] get to make an informed decision about how their voices are going to be used. It’s not acceptable for performers’ voices to be used in a way that would be inconsistent with their beliefs [or that] might be inconsistent with other agreements,” she said. A lot of times, people get excited and they rush in with new technology, [but] they don’t necessarily think through all the applications. As we look through the implications for this kind of technology, we … believe it has to be done in a way with as many safeguards as possible.”
Tuesday’s seminar wasn’t a strictly doom-and-gloom affair. John Costello, the director of Boston Children’s Hospital’s augmentative communication program, which works with nonverbal patients and those whose speech is severely impaired, highlighted the potential of voice cloning technologies to give those with conditions like amyotrophic lateral sclerosis (ALS), Huntington’s disease, and autism the ability to speak naturally.
“Our voice is our acoustical fingerprint,” he said before proceeding to play footage from Alphabet’s DeepMind documenting the recreation of ALS advocate and former NFL player Tim Shaw’s voice. “Voice … [is a] marker of our personality — it’s important not only for the speaker, but for the communication partners [both human and animal]. And considering the home automation technologies that we’re increasingly relying on in our own homes, where we’re using voice to control, having a high-quality voice for a person who’s unable to speak is really essential in giving a level of independence that otherwise isn’t available to them.”
Rupal Patel, a professor at Northeastern University founder and CEO of VocalID, a startup developing a voice synthesis platform, noted synthesis tech might even be used to create voices for those who lack them at birth. By crowdsourcing speech and capturing vocalization samples from people who aren’t able to speak normally, researchers at companies like VocalID can match voices with nonverbal people likeliest to sound similar.
“[VocalID is] now on generation four of our voice synthesis engine, which allows us to make the most highly replicable voice of [an] individual,” Patel explained. “[While we focus] mostly on individuals who are nonspeaking, [we also help] individuals who were about to lose their voice … Prior to them losing their voice, they bank their voice [with us] and we recreate it.”
On the flip side, physicians and social workers might use cloned voices to “touch more lives,” said Patel — perhaps with digital avatars embedded on web and within apps. “Voice is … a trusted aspect of how [health workers] interact with their patients,” she added. “There’s various different [ways] this technology could be used in order to continue the relationship you have with a known individual.”
Patel pointed out that voice cloning technologies have commercial implications. Brand voices — such as Progressive’s Flo, who’s played by actress and comedian Stephanie Courtney — are often tasked with recording phone trees for interactive voice response (IVR) systems or e-learning scripts for corporate training videos. Synthesization could boost actors’ productivity by cutting down on ancillary recordings and pick-ups (recording sessions to address mistakes, changes, or additions in voiceover scripts) while freeing them up to pursue creative work — and enabling them to collect residuals.
“There will always be with every new technology nefarious uses,” said Patel. “[One of the biggest safeguards] is awareness. People don’t know that this technology exists, and they don’t know how good it can be. I think we need to start educating people about that and understanding this technology and its mechanisms and how it can spread.”
Safeguards and mitigation
Preventing the malicious use of voices synthesized in the style of a particular person will require technological safeguards, according to Costello. He proposes an audio fingerprint layered atop generated voices to serve as a form of authentication.
Detection solutions for cloned voices in the wild are beginning to emerge. Several months ago, startup Resemble released a tool that uses AI to detect deepfakes by deriving high-level representations of voice samples and predicting whether they’re real or generated. Separately, in January 2019, Google published a corpus of speech containing “thousands” of phrases spoken by the company’s text-to-speech models. And companies like ID R&D, whose CEO was in attendance at Tuesday’s event, employ algorithms that look at different features in a voice (like prosody, phase, frequency, and tone) to determine whether the voice is coming from a “reproduction device” (such as a loudspeaker) versus a human vocal tract.
Johnson says that researchers from the University of Albany, University of California Berkeley, SRI International, and others within DARPA’s SemaFor, an incubator for manipulated media detection technologies, are working on programs that can differentiate between real and synthetic speech even if the speaker ID systems fail. Some take into account not only audio features but signals like caller ID, compression levels, and more.
“With the SemaFor program, we’re looking at cross-modalities. The intent is that we’re expecting disinformation and fake news to be around us forever. We’re going to have to live with it, but what we can do is ensure that we have trust in what we see here. What we want to do at DARPA is … [figure out how to] counter fake news and disinformation and … have an indicator to place some kind of trust value in the media that’s being produced and the stories that are being told.”
Siwei Lyu, a professor of computer science at the University of Albany, believes it’s possible that consumers will learn to catch deepfakes on their own after repeated exposure. There’s research in support of this — a study last year by MIT Media Lab and Max Planck Institute researchers found that people improve in their ability to detect fake photos when they get feedback.
“I think in five years, we will reach some kind of equivalent, where consumers will become more sophisticated after hearing a number of deepfake audio samples and develop some kind of immune system to this kind of thing,” said Lyu.
Perhaps hedging their bets, several social media networks have taken steps to prevent the spread of deepfakes — audio or otherwise. Twitter announced plans to implement a policy around media that’s been altered to mislead the public, saying it would remove that which “threaten[s] someone’s physical safety or lead[s] to other serious harm.” For its part, Facebook recently said it would delete any media from its platform that’s been modified “beyond adjustments for clarity or quality” in ways that “aren’t apparent to the average person.”
“Fighting … fake media and detecting fake voices will be a community effort,” said Patrick Traynor, preeminence chair in engineering at the University of Florida. “It [will require] the momentum of commercial companies and typical users. [If] everyone take[s] the due diligence, we can have this under control.”
Regulation and legislation are equally critical pieces of the voice-cloning puzzle, the seminar presenters agreed. Several of them noted that in September, members of Congress sent a letter to National Intelligence director Dan Coats asking for a report from intelligence agencies about the potential impact of deepfakes on democracy and national security, and that in October, California became the first U.S. state to criminalize the use of deepfakes in political campaign promotion and advertising.
Sedky said that in scenarios like fraud, statutes 18 USC 1029 and 1028 under the Credit Card Fraud Act of 1984, which govern the use of an access device like a voice (or cloned voice) particularly in the biometric context (i.e., accessing an online account), deepfake usage is clearly prosecutable. However, she said that it’s an open question whether it’s a crime to acquire a cloned voice without authorization in the first place.
China’s recently implemented deepfake rules might provide a model for future regulation. They explicitly ban the publishing and distribution of “fake news” created with technologies such as AI and virtual reality, and require that any use of AI or VR be prominently marked.
But in the absence of a legal framework, Jaiman advocates a risk model akin to what Microsoft uses internally. It tries to balance any technologies that might be brought to market with the potential harms that technology can bring.
“[You] essentially model [risk] and come up with a governance model where you say, ‘OK, what can we do to bring the risk down?’” he said. “[At Microsoft,] we came up with a ‘harms’ framework [essentially] saying, hey, these are the potential misuses of this technology, and … if we see any of these, we can cut access to [a] service.” Jaiman asserted that a Microsoft customer could create a voice clone only if they could show that the target speaker gave their consent and the application complied with Microsoft’s terms of service. He said, “We [also] have to make sure that we’re keeping pace with society and with technologies as they evolve. [Any model has to have] accountability, privacy, security, reliability, and safety built-in … Trust is a high-value currency.”
In any case, added Sedky, transparency and an abundance of caution will be the keys to grappling with voice synthesis technologies in the years to come.
“Obviously this technology is fantastic,” said Sedky. “Just like the internet can be weaponized against people, it doesn’t mean we shouldn’t have the internet. It just means that these are things that we need to be thinking about, and … that will make it harder to weaponize against people. We need to be upfront about how we’re going to protect consumers who are definitely going to be victimized in ways by criminals. The criminals are usually ahead of us.”