Multimodal machine learning (MML) significantly transformed the development of artificial intelligence (AI) systems. Instead of working with single source data, it integrated and analysed information from multiple modalities, such as images, audio, text, sensors, and more. The volume of labelled and unlabelled multimodal data increased rapidly, but effectively using them, especially managing unlabelled multimodal data, poses significant challenges. Existing approaches usually depend on supervised learning and struggle to handle the heterogeneity and complexity of such data. These limitations interrupted the creation of good, scalable, and generalised MML systems that can use the full potential of this diverse data.
This thesis addressed this demanding challenge by using multimodal reasoning. To this end, a scheme was introduced for advancing multimodal reasoning by effectively using unlabelled multimodal data. The scheme was designed on inferential steps to use the latent knowledge and patterns hidden within these vast unlabelled datasets. These inferential steps mitigated the limitation of supervised methods, which solely depend on a vast amount of labelled data, which is difficult to get in real-world scenarios. The selection of unique inferential steps was based on their specific strengths in addressing challenges in unlabelled multimodal data. The scheme starts with using the unsupervised approach to extract features, which are then used as input for a clustering approach to group similar data points based on their hidden characteristics. This clustering approach sets the stage for applying a semi-supervised approach to intelligently assign labels to the clustered data, efficiently converting unlabelled data into a useful and structured resource.
The validity of the proposed approach is carefully evaluated on unlabelled vehicular datasets collected in real time. The proposed approach showed the ability to achieve more than 90% accuracy by using a newly labelled dataset. Furthermore, this research dove into the exciting field of transfer learning. It explored its potential to enhance multimodal reasoning by using knowledge gained from one dataset to improve performance on another. A novel model based on the transformer architecture is specifically designed to handle continuous features available in multimodal data. The result of the model was satisfactory and showed that the performance of the state-of-the-art was better than traditional machine learning (ML) algorithms.
This thesis research made significant and multifaceted contributions to the research on MML. It provided an extensive analysis of MML and its challenges, including existing approaches on alignment and fusion, by focusing on their limitations and identifying gaps in current research. Moreover, it introduced an effective approach for labelling unlabelled datasets through a series of carefully designed inferential steps, which shows a path for more efficient and scalable multimodal learning. Finally, it presented the outstanding potential of transfer learning, particularly with a transformer-based model, to advance multimodal reasoning. The insights, techniques, and results presented in this thesis held the potential to reveal a new edge in MML research and provide an opportunity to develop more useful, scalable, and data-efficient models to tackle real-world challenges across a wide range of applications.