Semi-Supervised Multimodal Representation Learning Through a Global Workspace.

Abstract

Recent deep learning models can efficiently combine inputs from different modalities (e.g., images and text) and learn to align their latent representations or to translate signals from one domain to another (as in image captioning or text-to-image generation). However, current approaches mainly rely on brute-force supervised training over large multimodal datasets. In contrast, humans (and other animals) can learn useful multimodal representations from only sparse experience with matched cross-modal data. Here, we evaluate the capabilities of a neural network architecture inspired by the cognitive notion of a “global workspace” (GW): a shared representation for two (or more) input modalities. Each modality is processed by a specialized system (pretrained on unimodal data and subsequently frozen). The corresponding latent representations are then encoded to and decoded from a single shared workspace. Importantly, this architecture is amenable to self-supervised training via cycle-consistency: encoding-decoding sequences should approximate the identity function. For various pairings of vision-language modalities and across two datasets of varying complexity, we show that such an architecture can be trained to align and translate between two modalities with very little need for matched data (from four to seven times less than a fully supervised approach). The GW representation can be used advantageously for downstream classification and cross-modal retrieval tasks and for robust transfer learning. Ablation studies reveal that both the shared workspace and the self-supervised cycle-consistency training are critical to the system’s performance.

Show Full Text

Correlation-Preserving Photo Collage.

Advancements in Learning-Based Navigation Systems for Robotic Applications in MRO Hangar: Review.

A novel method-based reinforcement learning with deep temporal difference network for flexible double shop scheduling problem.

Real-Time Dense Reconstruction with Binocular Endoscopy Based on StereoNet and ORB-SLAM.

Autoencoders reloaded.

Road Surface Classification Using a Deep Ensemble Network with Sensor Feature Selection.

Leave a Reply Cancel reply

Researchers

Journal

Modalities

Models

Abstract

Similar Posts

Leave a Reply Cancel reply