According to the document, the goal of MIT Science and Research Program researchers Tae-Hyun On, Tali Dekel, Changi Kim, Inbar Mosseri, William T. Freeman, and Michael Rubinstein is to create an image with physical characteristics related to the analyzed audio rather than to identically reconstruct people’s faces.

To accomplish this, they used, designed, and trained a deep neural network that analyzed millions of YouTube videos with people talking. During training, the model learned to associate voices with faces, enabling it to generate images with physical characteristics similar to speakers, such as age, gender, and ethnicity.

Without the need to model detailed physical characteristics of the face, the training was carried out under the supervision and with the concurrence of faces and voices from Internet videos.

“The correlations between faces and voices are revealed by our reconstructions, which were obtained directly from the audio.” We numerically assess and quantify how closely our Speech2Face reconstructions from audio resemble real images of speakers’ faces.”

They explain that because this study may have sensitive aspects due to ethnicity or privacy, no specific physical aspects have been added to the recreation of faces, and that, like any other machine learning system, this will improve overtime as each use increases your library of knowledge.

While the results of the displayed tests show that Speech2Face has a high number of face-to-voice matches, it also had some flaws, such as failing to match ethnicity, age, or gender with the voice sample used.

The model is intended to present statistical correlations between facial features and voice. It should be noted that the AI was trained using YouTube videos, which do not represent a representative sample of the world’s population; for example, in some languages, it shows discrepancies with the training data.

In this regard, the study itself recommends at the end of its findings that those who decide to investigate and modernize the system take into account a larger sample of people and voices so that machine learning has a broader repertoire of matching and recreation. of expressions

The program was also able to recreate the voices in cartoons, which have a striking resemblance to the voices in the analyzed audios.

Because this technology could be used for malicious purposes, the recreation of the face only keeps the closest thing to the person and does not provide full faces, as this could be a privacy issue.

Nonetheless, I’ve been astounded by what technology can do with audio samples.

Impressive Artificial Intelligence program that recreates faces from audio

Speech2Face is a study that showed that it is possible to know what a person’s face looks like with just a small fragment of their voice

Recent Posts

How John Wick Directors Chad Stahelski and David Leitch Kept Action Movies Alive Without Superheroes

Shadow and Bone Season 2: Explaining the Dramatic Final

If Thor Was Real: Ways The World Would Be Different

Chad Stahelski and David Leitch Kept Action Movies Alive Without the Superheroes

Heather Jones: From Cult To Netflix

The Pope’s Exorcist : Movie Review

Recent Comments