Multimodal Data Retrieval

Our brain is the most complicated system in nature. It can learn dramatically using efficient amount of energy. If we consider brain as a computer, neurons and their connections would be the hardware of this complicated system while mind is its operating system. In fact, mind map concepts from one application to another, Decides using all received information and finally is the basis of computational Thinking in human beings. Image processing, speech processing and computer vision are all applications of this complicated system. So far machine learning could simulate most of these applications but was unable of simulating brain operating system or mind. One of the reasons is that this system should be able to handle different types of data modalities like image, audio and video. On the other hand it needs abstraction and connection and above all must interpret or have computational thinking. Recent advances in data recording has lead to different modalities like text, image, audio and video. Different modalities are used for describing different aspects of one abstraction. Images are annotated and audio accompanies video.

In this research, we try to extract a common abstraction model, using the relationship of different modalities and intuition of brain hemispheres. The goal here is first learning a model which is better than others in intra and inter modality retrievals and second can generate one modality from the other.

To this end, we extract high-level representation for each modality by using a modality-specific generalized denoising stacked auto-encoder. Then, we try to hold high-level representations separately instead of merging them. Then, each level of each modality is reconstructed from previous level of the other modality using cross edges. Proposed method considers a shadow network for each one which tries to learn missing concepts. Furthermore, we have proposed a novel representation binding which makes two relative data have the same representation in high layers. multimodal deep networks. This fine-tuning allows us to use any amount of supervision information. In experiments, the proposed method outperforms state-of-the-art retrieval methods on PASCAL-Sentence.

 

People involved: Sarah Rastegar, Hamid R. Rabiee

Sharif University of Technology