Created by DALL·E, corresponding prompt: Here is the high-resolution illustration of an audiowave with a flowing, ethereal design and thin, delicate lines in blue.

Audio deepfakes – Much noise about one thing

June 07, 2024, 12 minutes read

It’s election year, and audio deepfakes are making headlines. Journalists and politicians are increasingly concerned about AI generated content influencing democratic processes. We have explored how AI has been used in previous elections. And what you can do to prepare yourself.

Audio deepfakes in a nutshell

Audio deepfakes are a form of synthetically generated content. AI is leveraged with the aim to imitate or clone a person’s voice so that it can’t be distinguished from the original. With this voice model new content can be created or existing content can be manipulated. In other words, audio deepfakes can be used to make anyone say anything.

Creating an audio deepfake is relatively easy to produce, less training data and computer power is needed compared to making video deepfakes. Audio deepfakes bring a series of benefits for disinformation agents and specific challenges for fact-checkers and investigators.

Learning by observing

Our exploration of the use of synthetic audio-visual content reveals what kind of audio deepfakes circulated around the latest elections that we might also expect in upcoming ones. And the challenges that come with them for both journalists and voters.

Use case #1: Election countdown

The release of audio deepfakes shortly before or on election day could be observed in recent votes. For example, during Slovakia’s parliamentary election in September 2023, and in the general election in Pakistan in February 2024.

In Slovakia a deepfake appeared less than 48 hours before the ballot boxes opened. Herewith happening during Slovakia’s moratorium for parties and media to remain silent two days prior to elections. It defamed pro-western candidate Michal Šimečka by having him discuss election fraud with journalist Monika Todova. After leading the polls beforehand, Šimečka lost the election. (Sources: GIJN; WIRED; zdf heute)

In Pakistan former president Imran Khan, currently imprisoned, was featured in audio deepfakes shared on social media. The day before voters headed to the polls, he presumably called for an election boycott. (Source: Logically Facts)

While Khan and his PTI party refuse to claim responsibility for this, it highlights a challenge for journalists that comes with AI-generated content: Who is behind the deepfake? And how would you verify this information?

Content analyses

Zooming out of the specifics of detecting synthetic audio, an essential part of verifying digital content is by answering the W’s questions: What, Who, Where, When? Verifying an audio file includes:

What happens in the audio and what exactly has been said?
Who is talking and who has created the audio file?
Where did the event of the recording take place?
When did the event of the recording take place?

Content analysis means taking the information presented in the audio recording apart. Is it the original speech? What is the claim and the narrative in the speech? Would the person say that in the given circumstances. Investigate the mentioned content, numbers and sources for authenticity and truthfulness. Compare the content with known facts, context, or previous statements by the person to identify discrepancies or contradictions. Having a transcript of the conversation will be useful for making notes on specific parts. And where was this person at the given moment?

Additionally, background noises can raise red flags, too. If it’s an indoor recording, background noises should be less. If it was filmed outdoors, consider whether there was nearby a train track or other identifiable sounds. What can you hear in the recording - people, specific urban or rural sounds?

Always make sure to check on the person’s or it’s party’s website as well as trustworthy local news or journalists if they have commented already on the shared audio file.

Use case #2: Return of the dead

Getting advice on whom to vote for by a person who is deceased for years but who’s advice you once valued? Perhaps a little irritating, but would it also affect you in some way?

Both the Indian and Indonesian elections, in 2024, are good examples of this use case. AI resurrected people, such as formerly popular politicians, to let them act as supporters for current candidates. Parties, such as Golkar in Indonesia, hope to reach voters on an emotional level and herewith have an impact on their decisions at the polls. (Sources: CNN; rest of world)

There are multiple technologies to generate audio deepfakes. The most present ones are based on Text-to-Speech (TTS) and Voice Conversion (VC) algorithms.

TTS takes written text and a set of reference recordings of a target speaker (e.g., a politician) as input. Based on that they create audio that reproduces the written text content with the voice of the target speaker, including its timbre and melody.

In contrast, VC requires a speech recording of any voice (without any written text needed) along with a set of reference recordings of a target speaker. The resulting audio file's content resembles the input utterance, while its timbre and melody reflect the characteristics of the target speaker's voice.

Use case #3: Personalised campaigning

The possibility to personalise messages and target individuals on a large scale, highlights best the fine line between potential and risk associated with synthetic audio.

In Indian election campaigning, over 50 million AI generated calls to voters were made. Deepfakes were used as tool for voter-outreach, to translate in real-time and to replace the time-consuming door-to-door campaigning. (Sources: DER STANDARD; WIRED; rest of world)

Not only are these deepfakes misleading individual voters, there is also a huge additional challenge for journalists to monitor, debunk and address these deepfakes and to inform the affected communities.

On the other hand, personalising messages with AI can have inclusive and accessibility factors. Take the example of India, a country with over 120 languages and almost 20.000 dialects. It is impossible for a candidate to cover all these languages naturally. Using AI to translate their message in their voice in 120 languages can help them to reach people, who might otherwise be left behind.

While in a video deepfake, there are also visual clues that help to verify it. Having just an audio deepfake is challenging but not impossible.

Listen to the rhythm of the beat

There is, unfortunately, no one-button solution that can assist and detect any kind of manipulation in audios. However, even in audio deepfakes are information traces that leave detectable hints of any kind of manipulation. Put your headphone on, turn up the volume and close your eyes to

Compare voices. Comparing the audio in question with samples of the person's known real voice to identify similarities or discrepancies such as pronunciations. This task is easier when dealing with public figures due to the availability of databases, interviews, and content like YouTube videos.
Analyse the speech behavior. Each person has a unique speech pattern and behavior, including intonations, breathing patterns, rhythm, speed of speaking and pauses. Here too, assuming the voice is known and can be compared with other examples.
Lip synchronisation. If it’s audio-visual content, compare the lip movements with your own. What would your lips look like when speaking an ‘O’, an ‘A’ or a word like ‘elections’ or voting? Pay attention to the language.

AI technology is improving. It can already well imitate the tone of voice, intonation and pronunciation but accurately cloning a voice remains challenging. There are so many nuanced differences between voices. Of course, an unfamiliar voice might be more difficult to detect. You could also ask people who know the person well. Collaboration is key.

What now?

The outlined use cases don’t answer whether audio deepfakes influence election outcomes, but they show one thing: AI generated audios around elections come in all kinds of shapes and motivations. Everyone, but especially journalists and fact-checkers, should be extra vigilant for audio deepfakes. Particularly, in the days before an election.

While there is no one-button solution to detect audio deepfakes, we will check what is out there in the follow-up to this article. Don’t wait for us, you can start practicing already today:

Put on good headphones, listen to your favorite song and take it apart piece by piece. Concentrate on the drums or the piano, on the singer’s breathing and the beat. You will be surprised how much you will discover in the song. The same goes for any audio content.

Authors

Anna Schild

Julia Bayer