Humans learn naturally by making communications between sight and sound. For example, we can see someone who plays Chilo and get to know that the movements of the Chilo player generate the music we hear.
A new approach developed by researchers from the Massachusetts Institute of Technology and other places that improves the ability of the artificial intelligence model to learn in the same way. This may be useful in applications such as journalism and film production, as the model can help coordinate multimedia content by recovering video and automatic sound.
In the long run, this work can be used to improve the ability of robots to understand environments in the real world, as audio and visual information is often closely connected.
When improving pre -work of their group, researchers have created a method that helps machine learning models align the audio and visual data corresponding to videos without the need for human stickers.
They modified how to train their original model so that he learns more accurate granular correspondence between a specific video frame and the sound that occurs at that moment. The researchers also made some architectural adjustments that help the system to balance two distinguished educational goals, which improves performance.
In combined, these relatively simple improvements enhance the accuracy of its approach to video recovery tasks and in the classification of the procedure in audiovisual scenes. For example, the new method can be automatically and accurately with the sound of the door that collides with its closure in a video.
“We are building Amnesty International systems that can process the world as human beings do, in terms of both audio and visual information and we are able to process both methods smoothly. We look forward, if we can integrate this audio and visual technology into some tools that we use on a daily basis, they can open a lot of new applications,” A paper on this research.
Join the paper by the main author Edson Araojo, a student student at Goethe University in Germany; Yuan Gong, Massachusetts Institute former Technology after PhD; SAURABHAND BATI, Massachusetts Institute for the current Technology Institute; Samuel Thomas, Brian Kingsbury, and Leonid Carlinski of IBM Research; Rogio Ferris, the main scientist and manager of the MIT-IBM Watson Ai Laboratory; James Glass, chief research scientist and head of the spoken language systems group at the Massachusetts L-Likin Institute and the author of Hild Quinn, Professor of Computer Science at Goethe University and professor of MIT-IBM Watson Ai Laboratory. The work will be presented at the computer vision conference and identifying patterns.
synchronization
This work depends on the way the machine learns The researchers developed A few years ago, which provided an effective way to train a multimedia model to process audio and visual data simultaneously without the need for human stickers.
Researchers feed this model, called CAV-MAE, unoccupied videos and a separate visual and sound data in representations called symbols. Using the natural audio of the recording, the model automatically learns to set the corresponding pairs of sound and visual symbols close to each other within the internal acting space.
They found that using two goals for learning balance between the learning process, which enables CAV-Mae to understand the corresponding audio and visual data while improving its ability to recover videos that match the user’s information.
But CAV-MAE treats sound and visual samples as one unit, so a 10-second video clip is set and the doors are set together, even if this audio event occurs in only one second of the video.
In their improved style, called CAV-MAE Sync, researchers divide the sound into smaller windows before the model calculates its data, so that it generates separate representations that correspond to each smaller window of sound.
During training, the form learns to link one video frame with the sound that occurs during this framework only.
“By doing this, the model learns the more accurate sweetheart correspondence, which helps in performing later when we collect this information,” says Araojo.
They also merged architectural improvements that help the model achieve a balance between their educational goals.
Adding “The oscillator”
The model includes a contradictory goal, as it learns to link similar audio and visual data, and the goal of reconstruction aims to recover the sound and visual data specified based on the user’s information.
At CAV-Mae Sync, researchers have introduced two new types of data representation, or distinctive symbols, to improve the learning capacity of the model.
It includes custom “global symbols” that help in the goal of learning to discharge and “recording symbols” that help the model to focus on the important details of the goal of reconstruction.
“Basically, we add a little more space to the model so that it can perform each of these two tasks, the two contradictions and the two restoration, a little more independently. It has benefited in general,” he added.
While researchers had some intuition that would improve the performance of CAV-Mae synchronization, it took a fine set of strategies to change the model in the direction they wanted to go to.
“Since we have multiple methods, we need a good model for both parties themselves, but we also need to make them combat together and cooperate,” says Rodichnko.
In the end, it improved its improvements from the model’s ability to recover videos based on audio inquiries and predicting a visual audio scene category, such as dog barking or a tool.
Its results were more accurate than her previous work, and it was better than the most complex and modern methods that require larger quantities of training data.
“Sometimes, very simple ideas or small patterns that you see in data have a great value when applied over a model you are working on,” says Araojo.
In the future, researchers want to integrate new models generating better data representations in CAV-MAE synchronization, which may improve performance. They also want to enable their system to deal with text data, which will be an important step towards creating an easy -to -visual language model.
This work is partially funded by the German Federal Ministry of Education and the MIT-IBM Watson AI.
(Tagstotranslate) Edson Araujo







