> Abstract

In this work, we propose a new unsupervised image segmentation approach based on mutual information maximization between different constructed views of the inputs. Taking inspiration from autoregressive generative models that predict the current pixel from past pixels in a raster-scan ordering created with masked convolutions, we propose to use different orderings over the inputs using various forms of masked convolutions to construct different views of the data. For a given input, the model produces a pair of predictions with two valid orderings, and is then trained to maximize the mutual information between the two outputs. These outputs can either be low-dimensional features for representation learning or output clusters corresponding to semantic labels for clustering. While masked convolutions are used during training, in inference, no masking is applied and we fall back to the standard con- volution where the model has access to the full input. The proposed method outperforms current state-of-the-art on unsupervised image seg- mentation. It is simple and easy to implement, and can be extended to other visual tasks and integrated seamlessly into existing unsupervised learning methods requiring different views of the data.

> Highlights

  • We introduce a novel unsupervised method for image segmentation based on maximizing the mutual information between different views of the input, where the views themselves arise from the output of autoregressive models operating on different orderings.
  • We propose various forms of masked convolutions to generate all 8 possible raster-scan type orderings.
  • We extend the masked convolutional layers with attention blocks for a larger receptive field and a larger set of possible orderings.
  • We show an improved performance above previous state-of-the-art on unsupervised image segmentation

> Overview

In generative autoregressive models, such as PixelCNN, an image is factorized as a product of conditionals over its pixels. The generative model is then trained to predict the current pixel based on the past values in a raster-scan fashion (i.e., from left to right, top to bottom) using masked convolutions. In this work, we propose to use an autoregressive model based on masked convolutions, taking as input a given unlabeled image, and outputting two predictions, where each prediction is dependent on a given view of the input image as a result of applying a given ordering during the forward pass of the model. Instead of using a single left to right, top to bottom ordering, we propose to use several orderings obtained with different forms of masked convolutions and attention mechanism. The various orderings over the input pixels, or the intermediate representations, are then considered as different views of the input image. Then the model is trained to maximize the mutual information between the outputs over these different views.

Our approach is generic and can be applied for both clustering and representation learning. For a clustering task (Figure bellow, left), we apply a pair of distinct orderings over a given input image, producing two pixel-level predictions in the form of probability distribution over the semantic classes. We then maximize the mutual information between the two outputs at each corresponding spatial location and its intermediate neighbors. Maximizing the mutual information helps avoiding degeneracy (e.g., uniform output distributions) and trivial solutions (e.g., assigning all of the pix- els to the same cluster). For representation learning (Figure bellow, right), we maximize a lower bound of mutual information between the two output feature maps over the different views.

Figure: Overview. Given an encoder-decoder type network \(\mathcal{F}\) and two valid orderings (\(o_1\) , \(o_2\) ) as illustrated in (c). The goal is to maximize the Mutual Information (MI) between the two outputs over the different views, i.e. different orderings. (a) For Autoregressive Clusterings (AC), we output the cluster assignments in the form of a probability distribution over pixels, and the goal is to have similar assignments regardless of the applied ordering. (b) For Autoregressive Representation Learning (ARL), the objective is to have similar representations at each corresponding spatial location and its neighbors over a window of small displacements \(\Omega\).

> Qualitative Results