In my last article I wrote about Small Language Models - an AI technology with traction as a solution for the cost, security and accuracy challenges associated with Large Language Models. Language models, in general, are great for producing human-like text after being trained with lots of language based communication. However they use a unimodal learning mechanism. This means only one type of input - text in the case of language models - is used. Other unimodal models use images, audio or video as input and the same as output.
A multimodal model allows AI to view the world from different angles by using multiple types of input - modalities - at the same time to train and gain knowledge. They can also produce virtually any type of output. Putting aside the ability to handle multiple types of input/output, the big benefit of multimodal learning is the more complete and nuanced context that the system gains by simultaneously working with different modalities. This, of course, translates to more accurate results for the end user.
Multimodal Learning
Before moving forward, it is worth taking a step back to look at multimodal learning. This is a field of learning strategy that pre-dates multimodal AI by decades. It promotes the idea that when a number of senses are used to learn we absorb, understand and retain more information and concepts.
The central belief in this type of learning is that different people have different learning styles. Some people are visual learners, some aural, some hand-on, etc. Neil Fleming, an educator from New Zealand, theorized a model called VARK in 1987 to formalize the mental model around how people prefer to learn. VARK stands for:
- Visual
- Auditory
- Read/Write
- Kinesthetic
Even though most people actually have a strong preference for one, they learn by using a combination of methods. Their preference can also vary depending on the subject matter that is being learned. If you are curious about how the VARK model would characterize your learning style, you can take a VARK questionnaire. Take the results as an indication of where your learning strengths are as opposed to how you must learn because one can always learn to learn differently.
Whether or not current multimodal AI efforts were inspired by the multimodal learning field, it would be beneficial to study the field further to gain more perspective. It is also clear that Multimodal AI or Multimodal Machine Learning (MML) is a critical evolution in the field of AI and will help create much more natural systems to interact with and the field will grow significantly as systems mature. In fact, the worldwide market size was about $1.35B USD in 2023. By 2030 the market size is expected to be almost $11.5B USD.
MML Mechanics
Let’s break down the MML process to understand how it works. The high level flow starts with data. Relevant data is fed in to the MML system in various forms. The three MML modules take over from there:
- Input module
- Fusion module
- Output module
Once the output module completes its work, the user is presented with the results.
Figure 1: Multimodal Machine Learning Steps
Input Module
The input module accepts different types or modes of data. It could be text, video, picture, audio clips or other sensor data. This data is ingested, processed and encoded into a form, e.g. embeddings, that the learning algorithms can work with. Embeddings are complex mathematical representations (e.g. vectors of numbers) that capture the essence of real world objects/data as well relationships between them.
Fusion Module
This module combines the information from multiple modalities. Different strategies are available for fusion. Key concepts differentiating these strategies are when and how is information combined. As one would expect, there are pros and cons to the different strategies.
Early Fusion
In early fusion data is combined, e.g. through concatenation, before any machine learning takes place. This is usually done if data is of the same type. E.g. two data streams are text. A downside of early fusion is that the raw input data may not contain rich semantic information and the model may not be able to capture complex interactions between modalities. Also, if data modality is very different, e.g. text and image, it would be hard to combine them in a meaningful and efficient way. On the positive side the strategy is efficient because only one learning model needs to be used/trained
Figure 2: Early Fusion Example
Intermediate Fusion
With this strategy information from different modalities is turned into a latent representation or embedding and then fused. I have also heard of this type of fusion being referred to as sketch fusion. The fused representation is then run through a learning model. This is a very widely used type of fusion that allows for a variety of modes of data to be fed into a single model.
Figure 3: Intermediate Fusion Example
Late fusion
In late fusion, each modality is run through learning models appropriate for the modality independently to extract features and then the result is fused through element-wise addition, a weighted average, attention mechanism or even through another learning model.
Figure 4: Late Fusion Example
Hybrid Fusion
Hybrid fusion essentially may use all three of the aforementioned strategies depending on the modes being fused.
Figure 5: Hybrid Fusion Example
To date we can’t say that one fusion strategy is universally better than the others. It depends on the data, situation and goals.
Modality Dropout
With these strategies the assumption is that all modalities are available at inference time. Inference refers to using the trained model to make predictions when new data is presented to it. Sometimes, however, this cannot be guaranteed. Modality dropout is a technique that randomly drops (or hides) modalities during each training iteration. This allows the model to be resilient in situations where an input mode is not available for some reason during inference time.
Output Module
The output module takes the processed information and articulates conclusions to the end user. It presents insights, predictions or decisions based on synthesized data that was provided to the AI system. In essence, it generates the final output to be provided to the user.
Use Cases
The use cases for Multimodal AI are broad and exciting. I’ll outline a small sample, but there are many impactful use cases in almost any sector.
Self driving cars
Self driving car technology relies on multiple input sensors - camera, radar, LiDAR, internal measurements, GPS, Sonar, etc. Besides being able to assess the environment from different angles, having multiple sensors provides redundancy that boosts safety and accuracy when it comes to making decisions. As humans, we don’t just rely on sight. We hear, see, feel and rely on experience to drive safely and in cooperation with others on the road. Similarly, all of the sensor data can be fed into a MML system. Because of the variety of input modes and importance behind maintaining intermodal relationships, a system that incorporates intermediate fusion makes the most sense in these applications. Furthermore, an ML system that considers modality dropout is important. For example, in rainy weather LiDAR may not function accurately so the sensor data may be ignored and radar data may be relied on more. Many of the usual suspects are involved in advancing this use case - Tesla, Waymo, Nvidia, Zoox. Here is a list of other notable companies in the space:
Healthcare
The vast amounts of data sources, knowledge and care styles in healthcare presents a huge opportunity for multimodal AI. Some key challenges in widely deploying AI to the healthcare industry are data privacy, trust in AI results, adoption, regulation and availability of patient data. However, it’s just a matter of time and careful thought before these concerns are addressed and the value of AI assisted healthcare can be fully realized. A variety of areas in healthcare have the potential to make use of multimodal AIs (or are already doing so).
Diagnostic Accuracy
Increased diagnostic accuracy through a coordinated use of imaging, patient records, genetic information, virtual office visits, public trend data, wearable device data, etc can be used in concert to get a holistic view of the concern. Current assumptions are that the humans involved in our diagnostic care have the time, capacity and exertise to:
- absorb all the latest research
- consider the full patient history
- process all diagnostic information
AI can be used to significant boost a health care professional's capacity to process all the information and relate it to the person being treated.
Virtual Health Assistants
It is not uncommon for a person to experience health symptoms and hop on the internet to search for answers in an attempt to self-diagnose. If not satisfied, the person will usually try to make an appointment (or virtual appointment) with a healthcare professional. Imagine you could instead immediately talk to a Virtual Health Assistant (VHA) at any time of the day in a natural way instead of typing your question in the hopes that you used the right words. Imagine further that the VHA could proactively ask you clarifying questions knowing your medical history. Still further, how wonderful would it be if you knew the VHA benefited from the complete body of medical research/knowledge that was available as opposed to what one person studied or experienced as part of their practice? What if the VHA could tailor their presentation to how you like to receive information as opposed to the way a particular individual conveys information? Imagine the VHA could then suggest a personalized treatment plan or even make an appointment with a specialist if in person diagnosis is needed? This can all happen as MML in healthcare matures.
Accessibility
Another wonderful opportunity for multimodal AI is in accessibility. Consider a user whose first language is German with hearing difficulties and they are attempting to listen to a speech by someone in English. With multimodal AI the audio could be translated to German and captions could be added. Of course the person’s lips and facial expressions would still be moving as though they are speaking in English. Lip movement and other facial cues reinforce what is being said so the person loses the benefit of this cue. With multimodal AI the mouth and body language of the person speaking could be modified in real time as if they are speaking German. This would be a huge benefit to the person with hearing difficulties and, in general, make the video seem more natural to anyone who wanted to watch it in other languages.
Education
As discussed earlier, the field of multimodal learning outlines the benefit of receiving information in multiple ways - visual, auditory, written, kinesthetic. With multimodal AI we could create a conversational user interface that provides a personalized learning experience. One that is aware of what the learner knows and provides relevant information to fill the gaps. Perhaps it knows how a person learns best and can provide information optimized to the learner’s preferences. Maybe it knows how to speak to a person (language, style, intonation, etc) and delivers it to the person so that they are receptive and excited to learn. IT could even integrate personalized practice mechanisms on the fly to help reinforce what is being taught.
Multimodal AI Services
Most companies don’t have the resources or expertise to develop multimodal AI from the ground up. Fortunately, there are a number of companies working on the nuts and bolts so that products can be built around to satisfy the use cases. Some examples of multimodal service and companies are:
- GPT-4o from Open AI
- Jina AI
- Reka AI
- Uniphore
- Gemini from Google
- Twelve Labs
- In World
- Runway
- ImageBind from Meta
Although relatively new, 2024 saw multimodal AI making its way into many AI services. It’s still early to say it has cracked the code of human-like interaction, but given how fast this technology is evolving I predict human-like interaction will be possible in 2-3 years.