Science

Artificial intelligence learns to make vision sound like humans

Artificial intelligence systems are increasingly mimicking humans naturally linking what they see to what they hear.

MIT researchers have developed a new machine learning approach that helps AI models automatically match corresponding audio and visual information in video clips without the need for human tags to guide the process. Breakthroughs can ultimately help robots better understand the real environment in which sound and vision work together.

The study is based on the way humans connect different senses. When you watch someone play a cello, you naturally associate the musician’s bow and arrow movements with the music you hear. The MIT team hopes to recreate this seamless integration in an artificial system.

Fine-tune audio-visual connections

The researchers improved their early work by creating a method called Cav-Mae Sync, which fully understands the more precise connection between a particular video frame and audio at these moments. Previously, their models would only match a random video framework, such as trying to sync the entire song with a photo.

“We are building artificial intelligence systems that can handle the world like humans, that is, audio and visual information appear simultaneously and be able to process both seamlessly,” explains Andrew Rouditchenko, a graduate student and co-author of the study.

The new method assigns the audio to a smaller window before processing, creating a separate representation corresponding to each smaller time period. During training, the model learns to associate individual video frames with audio that occurs only in these frames, a more granular and realistic approach.

Solve competitive goals

The team solved a fundamental challenge in AI training: balancing two different learning goals that may conflict with each other. The model requires reconstructing missing audio and visual information (such as filling in blanks) and learning to associate similar sounds with similar images.

These goals compete because they require the same internal representation to perform dual responsibilities. Researchers solve this problem by introducing professional “tokens” (i.e., dealing with different aspects of learning without interfering with each other).

Key improvements to the new system include:

  • Fine-grained time alignment between audio clips and video frames
  • Learn the separate “global token” of the Cross-Mode Association
  • “Register Token” helps focus on important details
  • Better balance between reconstruction and contrast learning goals

Architecture tweaks sound technical, but they solve a core problem in AI: how to learn multiple skills at the same time without interfering with another skill.

Real-world applications

The enhanced model shows significant improvements in the actual task. When asked to retrieve videos based on audio queries, it performs more accurately than previous versions. It is also better to predict the scene type from the combined audio-visual cues, which differs between the difference between the dog’s sound and the instrumental performance.

“Sometimes, when you apply on top of a model of data, the very simple ideas or few patterns you see in the data have great value,” noted Edson Araujo, a graduate student at Goethe University in Germany.

The study was applied immediately in news and film production, and the model could automatically curate content by matching audio and video elements. Content creators can use it to find specific types of scenes or sounds in large video libraries.

expect

But the long-term vision is more ambitious. The researchers hope to eventually integrate this audio-visual technology into large language models, namely the AI ​​systems of electric chatbots and virtual assistants.

“Looking forward, if we could integrate this audio-visual technology into some of the tools we use every day, such as large language models, it could open up many new applications,” Rouditchenko said.

The team also hopes to enable their systems to process text data, which will be an important step in creating a comprehensive multi-modal AI that processes language, sound and vision together.

This comprehensive perception may be crucial for robots running in real-world environments. Just as humans rely on vision and sound to browse the world, future robotic systems may require similar functions to naturally interact with their surroundings.

This work represents another step towards artificial AI systems that are like humans, rather than separate data streams, but rather as separate data streams, but rather through multiple interconnected experiences that understand the world through multiple channels.


Discover more from Neuroweed

Subscribe to send the latest posts to your email.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button