Technology continues to advance at a breakneck pace, drawing on diverse fields to investigate new capabilities and features. One of them is the ability to “reconstruct” a person’s face from a voice fragment.
The Speech2Face study, which was presented in 2019 at the Vision and Pattern Recognition conference, demonstrated that Artificial Intelligence (AI) can decipher what a person looks like based on short audio segments.
According to the document, the goal of MIT Science and Research Program researchers Tae-Hyun On, Tali Dekel, Changi Kim, Inbar Mosseri, William T. Freeman, and Michael Rubinstein is to create an image with physical characteristics related to the analyzed audio rather than to identically reconstruct people’s faces.
To accomplish this, they used, designed, and trained a deep neural network that analyzed millions of YouTube videos with people talking. During training, the model learned to associate voices with faces, enabling it to generate images with physical characteristics similar to speakers, such as age, gender, and ethnicity.
Without the need to model detailed physical characteristics of the face, the training was carried out under the supervision and with the concurrence of faces and voices from Internet videos.
“The correlations between faces and voices are revealed by our reconstructions, which were obtained directly from the audio.” We numerically assess and quantify how closely our Speech2Face reconstructions from audio resemble real images of speakers’ faces.”
They explain that because this study may have sensitive aspects due to ethnicity or privacy, no specific physical aspects have been added to the recreation of faces, and that, like any other machine learning system, this will improve overtime as each use increases your library of knowledge.
While the results of the displayed tests show that Speech2Face has a high number of face-to-voice matches, it also had some flaws, such as failing to match ethnicity, age, or gender with the voice sample used.
The model is intended to present statistical correlations between facial features and voice. It should be noted that the AI was trained using YouTube videos, which do not represent a representative sample of the world’s population; for example, in some languages, it shows discrepancies with the training data.
In this regard, the study itself recommends at the end of its findings that those who decide to investigate and modernize the system take into account a larger sample of people and voices so that machine learning has a broader repertoire of matching and recreation. of expressions
The program was also able to recreate the voices in cartoons, which have a striking resemblance to the voices in the analyzed audios.
Because this technology could be used for malicious purposes, the recreation of the face only keeps the closest thing to the person and does not provide full faces, as this could be a privacy issue.
Nonetheless, I’ve been astounded by what technology can do with audio samples.