Yassine

CVPR 2021: An Overview

2021-07-02T00:00:00+00:00

The 2021 CVPR conference, one of the main computer vision and machine learning conferences, concluded its second 100% virtual version last week with a record of papers presented at the main conference. Of about 7500 submissions, 5900 made it to the decision making process and 1660 papers (vs 1467 papers last year) were accepted with an acceptance rate of 23.7% (vs 22.1% last year). Such a huge (and growing) number of papers can be a bit overwhelming, so to get a feel of the general trends at the conference this year, I will present in this blog post a quick look of the conference by summarizing some papers (& listing some) that seemed interesting to me.

First, let’s start with some useful links:

Papers: CVPR2021 open access
Workshops: CVPR2021 workshops
Tutorials: CVPR2021 tutorials
Presentations: Crossminds
Papers search interface: blog.kitware.com & public.tableau.com
Awards: CVPR2021 paper awards
Papers digest: CVPR2021 Paper Digest
Papers & code: CVPR2021 paper & code

Note: This post is not an objective representation of the papers and subjects presented in CVPR 2021, it is just a personnel overview of what I found interesting. Any feedback is welcomed!

CVPR 2021 in numbers
Recognition, Detection & Tracking
Model Architectures & Learning Methods
3D Computer Vision
Image and Video Synthesis
Scene Analysis & Understanding
Representation & Adversarial Learning
Transfer, Low-shot, Semi & Unsupervised Learning
Computational Photography
Other Subjects

CVPR 2021 in numbers

A portion of the statistics presented in this section are taken from this github repo & this public tableau gallery.

Image source: github

The trends of earlier years continued with a notable increase in authors and number of submitted papers, joined by a rising the number of reviewers and area chairs to accommodate this expansion.

Similar to the last two years, China is the first contributor to CVPR in terms of accepted papers, followed by the USA, Korea, UK and Germany.

Image source: public tableau gallery

As expected, the majority of the accepted papers focus on topics related to learning, recognition, detection, and understanding. However, the topic of the year is 3D computer vision with more than 200 papers focusing on this subject alone, followed by deep & representation learning, image synthesis, and computation photography. There is also a notable increase in papers related to explainable AI and medical & biological imaging.

Recognition, Detection & Tracking

Task Programming: Learning Data Efficient Behavior Representations (paper)

Image source: Sun et al.

In behavioral analysis, the location and pose of agents is first extracted from each frame of a behavior video, and then the labels are generated for a given set of behaviors of interest on a frame-by-frame basis based on the pose and movements of the agents as depicted in the figure above. However, to predict the behaviors frame-by-frame in the second step, we need to train behavior detection models which are data intensive and require specialized domain knowledge and high-frequency temporal annotations. This paper studies two alternative ways to better use domain experts instead of a simple increase in the number of annotations: (1) self-supervised learning and (2) creating task programs by domain experts which are engineered decoder tasks for representation learning.

Image source: Sun et al.

Trajectory Embedding for Behavior Analysis (TREBA) uses these two ideas to learn task-relevant low-dimensional representations of pose trajectories, this is done by training a network to jointly reconstruct the input trajectory and predict the expert-programmed decoder tasks (see figure above). For the first task, given a set of unlabelled trajectories, where each trajectory is a sequence of states (eg, location or pose of the agents), the history of the agent stages is encoded using an RNN encoder, and the RNN the decoder (ie, a trajectory variational autoencoder or TVAE) then predicts the next states. As for the second task, we first need to create decoder tasks for trajectory self-supervised learning by domain experts (which is called the process of Task Programing). First, the experts find the attributes from the trajectory data that are useful to detect the agents behaviors of interest, then write a program to compute these attributes (eg, distance or angle between two interacting mice) based on the trajectory data using systems like MARS or SimBA. These programs are finally used to generate training data for self-supervised multi-task learning.

Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks (paper)

One of the main challenges of deploying deep nets on mobile and low-power computational platforms for large scale usage is their large memory and computational requirements. Luckily, such deep nets are often overparameterized, living some room for compression to reduce their memory and computational demands wit a minimal accuracy hit. One way to compress deep nets is scalar quantization, which compresses each parameter individually, but the compression rates are still limited by the number of parameters. Another approach is vector quantization which compresses multiple parameters into a single code, thus exploiting redundancies among groups of network parameters, but finding the parameters to group can be challenging (eg, in fully connected layers for example, there is no notion of spatial dimensions to group the parameters by).

Image source: Martinez et al.

The paper proposes Permute, Quantize, and Fine-tune (PQF), a method that (1) searches for permutations of the network weights that yield functionally equivalent, yet an easier-to-quantize network, (2) quantizes the permuted weights, and finally, (3) fine tunes the permuted-and-quantized network to recover the accuracy of the original uncompressed network. Since the second step consists splitting each weight matrix \(W\) into subvectors (say \(d\) elements per subvector) which are then compressed individually, where each subvector is approximated by a smaller code or centroid, the objective of the first step is to find a permutation \(P\) of the weight matrix such that the constructed subvectors are easier to quantize into codes (see figure above). The optimization of \(P\) is done by minimizing the determinant of the covariance of the resulting subvectors (see the paper for why they do this). Finally, a fine tuning of the compressed network is conducted to remove the accumulated errors in the activations after the quantization which degrade the performances.

Towards Open World Object Detection (paper)

In this work, the authors propose a novel computer vision problem called Open World Object Detection, where a model is trained to: 1) identify objects that have not been introduced to it as unknown without explicit supervision to do so (ie, open set learning), and 2) incrementally learn these identified unknown categories without forgetting previously learned classes when the corresponding labels are progressively received (ie, incremental and continual learning). In open set recognition, the objective is to identify the new instances not seen during training as unknowns, while open world recognition extends this framework by requiring the classifier to recognize the newly identified unknown classes. However, adapting the open set and open world methods from recognition to detection is not trivial since the object detector is explicitly trained to detect the unknown classes as background, making the task of detecting unknown classes harder.

Image source: Joseph et al.

To solve this problem, the paper proposes ORE, Open World Object Detector, that learns a clear discrimination between classes in the latent space. This way, (1) the task of detecting an unknown instance as a novelty can be reduced to comparing its representation with the representations of the known instances (ie, open set detection), and also (2), it facilitates learning feature representations for the new class instances without overlapping with the previous classes (ie, incremental and continual learning, thus extending open set to open world detection). To have such a learning behavior, a contrastive clustering objective is introduced in order to force instances of same class to remain close-by, while instances of dissimilar classes would be pushed far apart. This is done by minimizing the distances between the class prototypes and the class representations, where the instances of class unknown are instances with high objectness score, but do not overlap with a ground-truth objects. The classification head of the trained detector is then transformed into an energy function to label an instance as unknown or not.

Learning Calibrated Medical Image Segmentation via Multi-Rater Agreement Modeling (paper)

For a standard vision task, it is a common practice to adopt the ground-truth labels obtained via either the majority-vote or by simply one annotation from a preferred rater as the single source of the training data. However, in medical images, the typical practice consists of collecting multiple annotations, each from a different clinical expert or rater, in the expectation that possible diagnostic errors could be mitigated. In this case, using standard training procedure of other vision tasks will overlook the rich information of agreement or disagreement ingrained in the raw multi-rater annotations available in medical image analysis. To take this into consideration, the paper proposes MR-Net to explicitly model the multi-rater agreement or disagreement.

Image source: Ji et al.

MRNet contains a coarse to fine two-stage processing pipeline (figure above, right). The first stage consists of a U-Net encoder with a ResNet34 backbone pretrained on ImageNet. Together with an Expertise-aware Inferring Module (EIM), inserted at the bottleneck layer to embed the expertise information of individual raters, named expertness vector, into the extracted high-level semantic features of the network. The outputs are then passed to the U-Net decoder to give a coarse prediction. The second stage then refines the coarse predictions using two modules. The first one is Multi-rater Reconstruction Module (MRM) that reconstructs the raw multi-rater’s grading, the reconstructions are then used to estimate the pixel-wise uncertainty map that represents the inter-observer variability across different regions. Finally, a Multi-rater Perception Module (MPM) with a soft attention mechanism utilizes the produced uncertainty map to refine the coarse predictions of the first stage and predict the final fine segmentation maps (see section 3 of the paper for details about each component).

Re-Labeling ImageNet: From Single to Multi-Labels, From Global to Localized (paper)

One of the flaws of ImageNet is the presence of a significant level of label noise, where many instances contain multiple classes while having a single-label ground-truth. This mismatch between the single label annotations of ImageNet and the multi-label nature of its images becomes even more problematic with random-crop training, where the input may contain an entirely different class than the ground-truth. To aviate this problem, the paper proposes to generate the multi-labels using a strong image classifier trained on an extra source of data, in addition to leveraging pixel-wise multi-label predictions before the final pooling layer as a complementary location-specific supervision signal.

Image source: Yun et al.

Re-labeling ImageNet consists of using a machine annotator, which is a classifier trained on a super-ImageNet scale datasets (like JFT-300M and InstagramNet-1B) and fine-tunned on ImageNet. While being trained on single-label classification, such classifiers are still capable of multi-label predictions for images with multiple categories, making them suitable for relabeling ImageNet. In addition to generating the multi-label classes, Re-labeling also leverages the location-specific labels, which are the spatial features before the global pooling layer weighted by the weights of the last fully-connected layers. The model is then trained on both the original ImageNet labels and the generated location-specific labels. For the second part, the paper proposes LabelPooling which conducts a regional pooling (RoI Align) operation on the label map corresponding to the coordinates of the random crop (see figure above, right), which are passed to the machine annotator (pretrained then finetuned classifier) to generate the multi-class labels used for the multi-label loss.

Other papers to check out

Model Architectures & Learning Methods

Pre-Trained Image Processing Transformer (paper)

This paper presents Image Processing Transformer (IPT), a transformer model for low-level computer vision tasks (mainly denoising, super-resolution and deraining). IPT has 4 main blocks. First, multiple heads for extracting features from the inputs, which are the corrupted images such as images with noise and low-resolution images. Each head is a small convolutional net (3 layers) that outputs \(C\) dimensional feature maps of the same spatial dimensions as the input. The standard encoder-decoder transformer blocks for feature refinement and information recovery, with the introduction of task embedding as input seeds into the decoder. Finally, multiple tails that map the transformer’s output features back into the image space.

Image source: Chen et al.

For Pre-training, IPT is trained on corrupted ImageNet inputs. For a given image, first, a corruption function is applied (such as bicubic degradation to generate low-resolution images for super-resolution, or adding Gaussian noise for denoising), the model is then train with an L1 loss between the reconstructed image from the output of IPT and the clean image. Additionally, in order to make IPT applicable to other low-level tasks outside of the introduced corruptions, they also train with a contrastive loss where the model is trained to maximize the similarity between the features of the patches coming from the same input image. After pre-training, the model can then be fine-tuned on the low level task of choice.

RepVGG: Making VGG-Style ConvNets Great Again (paper)

The popular CNN architectures, while delivering good results, still have some drawbacks: 1) The complicated multi-branch designs, such as residual connections in ResNet and branch-concatenation in Inception, making the models more difficult to implement/customize, and slowing down the inference and reduce the memory utilization. 2) Some components such as depthwise convs in MobileNets and channel shuffle in ShuffleNets increase the memory access cost and lack support in various devices. This paper takes a step back in time, and proposes RepVGG, a network with VGG-like design, where at inference-time, the network is composed of nothing but a stack of 3x3 convolutions and ReLUs.

Image source: Ding et al.

While plain CNNs have many strengths, they possess one fatal weakness: poor performance :). So in the design of RepVGG, there needs to be a multi-branch architecture introduced in training time, and at inference time, RepVGG can then fall back in the plain one branch architecture for efficiency. In order to do this, the design of RepVGG is based on what the paper calls Structural Re-param, which describes how to convert a trained multi-branch block into a single 3x3 conv layer for inference. For a single block with three branches, a 3x3 conv, a 1x1 conv and identity (can be viewed as 1x1 conv with an identity kernel), each one followed by a batch norm, where the output of the block is the sum of the outputs of the three branches. At inference time, RepVGG first converts each conv & the following batch norm of each branch into a single conv with a bias vector. Then, the remaining 3 convs are combined into a single 3x3 conv layer by adding up the biases and adding up the kernels (where the center of 3x3 conv gets the single value in the 1x1 kernel).

Image source: Ding et al.

Bottleneck Transformers for Visual Recognition (paper)

Image source: Srinivas et al.

BoTNet, or Bottleneck Transformer, consists of a simple adjustment of the ResNet architecture to incorporate self-attention. This is done by just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no other changes. With such an adjudgment, the resulting BoTNet improves upon the ResNet baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency.

Image source: Srinivas et al.

Scaling Local Self-Attention for Parameter Efficient Visual Backbones (paper)

While self-attention models have recently been shown to provide notable improvements on accuracy-parameter trade-offs compared to baseline convolutional models such as ResNet-50, they are still not on par with the high-performing convolutional models such as EfficientNet (V2). In order to close this gap, this paper proposes a new self-attention model family called HaloNets, which is based on a more efficient implementation of self-attention, that improves the speed, memory usage, and accuracy of these models.

Image source: Vaswani et al.

Global self-attention, in which all locations attend to each other, is too expensive for most image scales due to the quadratic computation cost with respect to \(k\) (the kernel size). Thus, self-attention based models like SASA use a local form of self-attention that aggregates the information around each pixel similar to convolutions. However, to do this, we need to extract local 2D grids around each pixel, and such an operation can be quite expensive both computationally and memory wise, since for each pixel, we need to fetch \(k^2\) pixels, and this operation contains a lot of duplicates since two neighboring pixel share most of their neighbors (\(k \times (k-1)\) out of \(k^2\)). To solve this, HaloNets use blocked local self-attention (see figure above), where the local neighborhood for a block of pixels is extracted once together, instead of extracting separate neighborhoods per pixel. This operation consist of first dividing the input tensor into non-overlapping blocks, where each block behaves as a group of query pixels, then a shared neighborhood is constructed around each block, which is used to compute the keys and values and finally the outputs. This way, we only compute one neighborhood per block instead of one neighborhood per pixel (see section 2.2 for more details). After defining such an operation, HaloNets is designed based on a similar architecture to ResNets with blocked local self-attention instead of convolutions.

Involution: Inverting the Inherence of Convolution for Visual Recognition (paper)

Convolution kernels have two main properties, spatial-agnostic, where the same kernel is applied to all of the spatial location of the input volume, and channel-specific, where many kernels is applied over the same input, resulting in an output with multiple channels in order to collect diverse information. These two properties result in an enhanced efficiency and makes the convolution operation translation equivalence. However, with such a design comes some limitations, such as a constrained receptive field where a single conv operation can’t capture long range spatial interactions and a possible inter-channel redundancy inside the convolution filters.

Image source: Li et al.

This paper takes a contrarian approach, and proposes an operation called involution that switches the two properties of convolutions, resulting in a spatial-specific and channel-agnostic operation. To have a spatial-specific operation, the involution kernel belonging to a specific spatial location is to be generated solely conditioned on the incoming feature vector at the corresponding location itself. This is implemented in two steps, first a 1x1 conv, where the output channels correspond to the size of the kernel (\(C = k^2\)) is first applied, generating spatial location specific kernel weights (each spatial location will have its own specific conv kernel). The output is then reshaped from one vector per spatial location into a spatial kernel, and then applied over the input volume followed by an aggregation operation (average pooling) over \(k \times k\) volume per spatial location. As for channel-agnostic, it can be obtained by simply sharing the convolution over all channels. This operation is then used to design RedNet, a ResNet style model with involutions.

On Feature Normalization and Data Augmentation (paper)

The usage of normalization techniques such as bach norm in recognition models have become a standard practice, where the moments (ie, mean and standard deviation) of latent features are often removed when training image recognition models, which helps increase their stability and reduce the training time. However, such moments can play a much more central role in some vision tasks like image generation, where they capture style and shape information of an image and can be instrumental in the generation process. In this context, this paper proposes Moment Exchange or MoEx, an implicit data augmentation method that encourages the recognition model to utilize the moment information for better performances and better robustness.

Image source: Li et al.

MoEx is an operation applied in the feature space in order to systematically regulate how much attention a network pays to the signal in the feature moments. As illustrated above, given an input, the mean and variance across channels are first extracted after the first layer. Then, instead of removing them, they are swapped with the moments of an other image that are extracted in the same manner. This results in a set of features that contain information about both images, and then the model is then trained to predict an interpolation of the labels of the two inputs. This way, the model is pushed to focus on two different signals for classification, the normalized feature of the first image and the moments of the second.

3D Computer Vision

MP3: A Unified Model to Map, Perceive, Predict and Plan (paper)

Most modern self-driving stacks require up-to-date high-definition maps that contain rich semantic information necessary for driving, such as the topology and location of the lanes, crosswalks, traffic lights, intersections as well as the traffic rules for each lane. While such maps greatly facilitate the perception and motion forecasting tasks, as the online inference process has to mainly focus on dynamic objects (eg, vehicles, pedestrians, cyclists), scaling them is hard given their complexity and cost, and given that even very small errors in the mapping might result in fatal mistakes. This motivates the development of mapless technology, which can serve as a fail-safe alternative in the case of localization failures or outdated maps, and potentially unlock self-driving at scale at a much lower cost.

Image source: Casas et al.

However, with a mapless approach comes a number of challenges: (1) The sole source of training signal is the controls of an expert driver (such steering and acceleration), without providing intermediate interpretable representations that can help explain the self-driving vehicle decisions. (2) Without any mechanism to inject structure and prior knowledge, such an approach can be very brittle to distributional shift such missing a lane. To address these issues, the paper presents MP3, an end-to-end approach to mapless driving that is interpretable, does not incur any information loss, and reasons about uncertainty in the intermediate representations.

Image source: Casas et al.

The MP3 model takes as input a high-level goal, a history of LiDAR point clouds to extract rich geometric and semantic features from the scene over time (see this for more details), and odometry data to compensate for the vehicle’s motion. The inputs are then processed using a backbone network and fed into a set of probabilistic spatial layers to model the static and dynamic parts of the environment. The static environment is represented by a planning-centric online map which captures information about which areas are drivable and which ones are reachable given the traffic rules. The dynamic actors are captured in a novel occupancy flow that provides occupancy and velocity estimates over time. The motion planning module then leverages these representations to retrieve dynamically feasible trajectories, predicts a spatial mask over the map to estimate the route given an abstract goal, and leverages the online map and occupancy flow directly as cost functions for explainable, safe plans.

In a standard driving situation such as the one depicted in the left image bellow, the vehicle must capture the global context of the scene involving the interaction between the traffic light (yellow) and the vehicles (red) for a safe navigation. To do this, the different modalities (ie, the LiDAR point cloud and the camera view) must be fused together in order to obtain such a global view. This raises the following questions: how to fuse such multi-modal representations? to what extent each modality should be processed independently before fusion? and how can such fusion be conduced?. The paper proposes a Modal Fusion Transformer (TransFuser), a transformer-based model designed to integrate both LiDAT and camera views with global attention, thus capturing the necessary 3D context for safe navigation.

Image source: Prakash et al.

The paper considers the task of point-to-point navigation in an urban setting, where the goal is to complete a given route while safely reacting to other dynamic agents and following traffic rules. The TransFuser is trained using an L1 loss between the predicted trajectories and the correct trajectories using a dataset consisting of high-dimensional observations of the environment, and the corresponding expert trajectory. TransFuser takes as input RGB images and LiDAR BEV representations and uses several transformer modules to fuse the intermediate feature maps between both modalities. The fusion is applied at multiple resolutions and throughout the feature extractor resulting in a 512-dimensional feature vector output from both the image and LiDAR BEV stream, which is the desired compact representation of the environment that encodes the global context of the 3D scene. This compact representation can then be used as input to an auto-regressive (a GRU) prediction network that outputs the trajectories in the form of waypoints in vehicle’s coordinate frame.

Neural Lumigraph Rendering (paper)

The recent neural rendering techniques are capable of generating photorealistic image quality for novel view synthesis and 3D shape estimation from 2D images. But they are either slow to train and/or require a considerable rendering time time for high image resolutions. NeRF for instance does not offer real-time rendering due to the use of neural scene representation and volume rendering. To overcome this, the paper proposes the use SDF-based sinusoidal representation network (SIREN) to implicitly model the surface of objects (ie, implicitly defining an object or a scene using a neural network and training directly with 3D data), which can be extracted using the marching cubes algorithm and exported into traditional mesh-based representations for real-time rendering.

Image source: Kellnhofer et al.

However, given the high capacity of SIRENs, they are prone to overfitting, making them incapable of rendering new views that are interpolations of views encountered during training. To solve this, the authors propose a novel smoothness loss function that maintains SIREN’s high-capacity encoding in the image domain while constraining it in the angular domain to prevent overfitting on these views. The SIREN based network can then be trained using a sparse set of multi-view 2D images while providing a high-quality 3D surface that can be directly exported for real-time rendering algorithms at test time.

NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis (paper)

NeRV extends NeRF to arbitrary lighting conditions, taking as input a set of images of a scene illuminated by unconstrained lighting, and producing a 3D representation that can be rendered from novel viewpoints under novel and unobserved lighting conditions.

Image source: Srinivasan et al.

Given that NeRF models just the amount of outgoing light from a location, the fact that this out going light is itself the result of interactions between incoming light and the material properties of an underlying surface is ignored, so rendering viewpoints under novel lighting conditions is not possible. A naive way (figure above, right) to solve this is to query NeRF’s MLP for the volume density at samples along the camera ray to determine the amount of light reflected at each location that reaches the camera. Then for each location, query the MLP for the volume density at points between the location and every light source. This procedure is clearly too computationally infeasible. The paper solve this by proposing NeRV, a method to train a NeRF-like model that can simulate realistic environment lighting and global illumination by using an additional MLP as a lookup table into a visibility field during rendering. As a result, the training consists of jointly optimizing the visibility MLP (or reflectance MLP) for estimating the light surface visibility at a given 3D position along, alongside the NeRF’s MLP (also called the shape MLP) for volumetric representation. At test time, the rendering is then conducted along the ray by querying the shape and reflectance MLPs for the volume densities, surface normals, and BRDF parameters at each point.

Neural Body: Implicit Neural Representations With Structured Latent Codes for Novel View Synthesis of Dynamic Humans (paper)

Image source: Peng et al.

Given a very sparse set of camera views (say 3 or 4 views as depicted above), learning implicit neural representations of 3D scenes becomes infeasible. To solve this, the paper proposes Neural Body, a method that leverages observations over video frames order to learn a new human body representation that is consistent over the different frames, and shares the same set of latent codes anchored to a deformable mesh, so that the observations across frames can be naturally integrated.

Image source: Peng et al.

First, a set of latent codes are defined and anchored to the vertices of the SMLP deformable human model, so that the spatial location of each vertex varies with the human pose. Then, the 3D representation at a given frame is estimated from sparse camera views, and then applied to the code location based on the obtained 3D human pose. Finally, the network is trained to regress the density and color for any 3D point based on these latent codes. Both the latent codes and the network are jointly learned from images of all video frames during the reconstruction process.

pixelNeRF: Neural Radiance Fields From One or Few Images (paper)

The NeRF framework consists of optimizing the representation of every scene independently, requiring many views per scene and a significant compute time. pixelNeRF adapts NeRF in order to be trained across multiple scenes jointly to learn a scene prior, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views (as few as one). This is done by using spatial features of a given image produced by a CNN, which are aligned to each pixel as an input, and then passes through NeRF model for training. This simple image conditioning allows the model to be trained on a set of multi-view images, where it can learn scene priors to perform view synthesis.

Image source: Yu et al.

pixelNeRF consists of two components: a fully-convolutional image encoder, which encodes the input image into a pixel-aligned feature grid, and a NeRF network which outputs color and density given a spatial location and its corresponding encoded feature. When multiple input views are available, each view is encoded into the feature grid, the multiple features are processed in parallel and then aggregated into the final color and opacity.

Other papers to check out

Image and Video Synthesis

GIRAFFE: Representing Scenes As Compositional Generative Neural Feature Fields (paper)

While GaNs are capable of generating photorealistic and diverse high resolutions images, having a fine-grained control over the factors of variation in the data and the compositionality of the generated scenes is sill limited, operating only in 2D and ignoring the three-dimensional nature of the underlying scenes.

Image source: Niemeyer et al.

GIRAFFE proposes to represent scenes as compositional generative neural feature fields. During training, instead of having a single latent code as in the standard GaN setting, GIRAFFE randomly generated a set of Shape and Appearance codes for each object (& background) in the scene, which are used to generated Feature Fields. Then a second set of latent codes are generated, this time representing the pose transformation, which are applied to the generated Feature standard to obtain the Posed Feature Fields. Finally, the final Posed Feature Fields are aggregated into a single scene representation, and given a camera pose, are used as input to the neural rendering network to generate a 2D image. The generated and real 2D images are then passed to the discriminator to compute the adversarial loss. The whole model is trained end-to-end, and at test time, the composition of the generated images can be controlled using the Shape & Appearance and Pose latent codes.

GeoSim: Realistic Video Simulation via Geometry-Aware Composition for Self-Driving (paper)

Image source: Chen et al.

The ability to simulate/enhance various real world scenarios is an important yet challenging open problem, especially for safety-critical domains such as self-driving where producing visually appealing and realistic results requires a physics-based rendering, which is very costly. As a, alternative, GeoSim exploits the recent advances in image synthesis by combining data driven approaches (such as generative modeling) and computer graphics to insert dynamic objects into existing videos, while maintaining high visual quality through physically grounded simulation (see the example above).

Image source: Chen et al.

The first step of GeoSim is to create a large 3D assets (vehicles of different types and shapes) with accurate pose, shape and texture. Instead of using artists to create these assets, GeoSim leverages publicly available datasets to construct 3D assets of the objects. This is done using a learning-based, multi-view, multi-sensor reconstruction approach that leverages the 3D bounding boxes and trained in a self-supervised manner so that there is an agreement between the predicted 3D shape and that of the camera and LiDAR observations. GeoSim then exploits the 3D scene layout from high-definition maps and LiDAR data to add these learned 3D assets in plausible locations and make them behave realistically by considering the full scene. Finally, using this new 3D scene, GeoSim performs image-based rendering to properly handle occlusions, and neural network-based image in-painting to ensure the inserted object seamlessly blends in by filling holes, adjusting color inconsistencies due to lighting changes, and removing sharp boundaries. Check out the paper’s website for some results.

Taming Transformers for High-Resolution Image Synthesis (paper)

The recently introduced vision transformers (such as ViT) demonstrated that they can perform on par with CNNs, and given enough training data, they tend to learn convolutional structures. This raises an obvious question, do we have to relearn such an inductive bias from scratch each time we train a vision model?. This paper proposes to merge both CNNs and transformers into a single framework for image synthesis. Thus leveraging the efficiency of CNNs and their inductive image biases while still retaining the flexibility of transformers.

Image source: Esser et al.

As depicted above, the proposed framework consists of CNN encoder-decoder network trained adversarially for Neural Discrete Representation Learning, and then a transformer that operates over the discrete representations in an autoregressive manner. More specifically, the training consist of two stages. First, the encoder, the decoder and the discriminator, all CNN-based, are trained using a reconstruction loss, a perceptual loss, a commitment loss (ie, used to refine the codebook, see VQVAE for more details), and an adversarial loss. At the end of training, we end-up with a learned codebook where each spatial location in the input image can be represented by an index in the codebook. By using such a formulation, the input image can be viewed as a sequence of codebook indices, and such a sequence can be used to train the auto-regressive transformer in the second step of the training process. Starting from the top left corner, at each time step, the transformer is tasked with predicting the next codebook index, and in order to reduce the computation, the input to the transformer is restricted into a sliding window without a significant loss in performance. Finally, at test time, we can use the trained transformer to generate large sequences without any restrictions (and with any type of conditioning, see section 3.2 of the paper), which correspond to very large images. The predicted indices are then used to fetch the discrete representations from the codebook, which are then passed to the decoder to synthesis an image.

Rethinking and Improving the Robustness of Image Style Transfer(Paper)

The objective of image style transfer is to map the content of a given image into the style of a different one, and in such a task, VGG network has demonstrated a remarkable ability to capture the visual style of an image. However, when such a network is replaced with a more modern, and a better performing network such as ResNet, the stylization performance degrades significantly as shown in the figure bellow.

Image source: Wang et al.

In this paper, the authors investigate the root cause of this behavior, and find that residual connections, which represent the main architectural difference between VGG and ResNet, produce peaky feature maps of small entropy, which are not suitable for style transfer. To improve the stylization of the ResNet architecture, the authors then propose a simple yet effective solution based on a softmax transformation of the feature activations that enhances their entropy. This method, dubbed Stylization With Activation smoothinG (SWAG), consists of adding a softmax-based smoothing transformation to all of the activations in order to push the model to produce smoother activations, thus reducing large peaks and increasing small values, creating a more uniform distribution.

Learning Continuous Image Representation With Local Implicit Image Function (Paper)

This paper proposes Local Implicit Image Function (LIIF) for representing natural and complex images in a continuous manner. With LIIF, an image is represented as a set of latent codes distributed in spatial dimensions. Given a coordinate, the decoding function takes the coordinate information and queries the local latent codes around it, and then predicts the RGB value at the given coordinate as an output. Since the LIIF representation is continuous, we can query an arbitrary high target resolution up to x30 higher than the resolution encountered during training.

Image source: Chen et al.

The proposed framework consists of an encoder that produces 2D feature maps given an input image, where the feature maps are evenly distributed in the 2D space of the continuous image domain, and each feature at a given spatial location is called a latent code. Then, the decoder takes as input a 2D coordinate in the image domain in addition to a weighted average of the 4 nearest latent codes of the chosen 2D coordinate and outputs the RGB values. Now, in order to train both the encoder and the decoder jointly using self-supervision, this is done by taking a training image, randomly down-sampling it, the encoder then encodes the down-sampled image, while the decoder is queried to produced the RGB values of the original image, which is used as ground-truth.

Other papers to check out

Scene Analysis & Understanding

Rethinking Semantic Segmentation From a Sequence-to-Sequence Perspective With Transformers (Paper)

The proposed SEgmentation TRansformer (SETR) is based on an alternative formulation of semantic segmentation as a sequence-to-sequence task. This way, instead of using the standard encoder-decoder architecture, such a formulation gives us the possibility to employ a pure transformer without convolution and resolution reduction for pixel-level classification, since it is in line with the way transformers operate over the inputs and how they produce the predictions. In addition to leveraging the capability of a transformer layer to model the global context which important for sematic segmentation in order to obtain coherent masks.

Image source: Zheng et al.

SETR (figure above, a) treats an input image as a sequence of image patches, where each image is first decomposed into a fixed-sized patches. Then, each path is flattened into a vector of pixel values and passed through a linear layer, outputting the patch embedding. These patch embedding are passed as a sequence to the transformer-encoder (ie, 24 transformer layers) with the global self-attention in order to discriminative features tailored for the segmentation task. The produced representations are then reshaped from 2D shape (number of patches x embedding dimensionality) back into the standard 3D features map shape (H x W x embedding dimensionality). The reshaped features are then passed to the decoder to predict the final per pixel classification at the original input size. Here, SETR proposes 3 types of decoders: (1) Naive upsampling: a 2-layer network followed by bilinear upsampling, (2) Progressive UPsampling: alternates between conv layers and a single 2x bilinear upsampling operation (figure above, b). (3) Multi-Level feature Aggregation: apply many levels of conv layers & 4x by bilinear upsampling over the encoder’s output, merge them and apply a final upsampling (figure above, c).

MaX-DeepLab: End-to-End Panoptic Segmentation With Mask Transformers (Paper)

MaX-DeepLab is an end-to-end model for panoptic segmentation without any hand-designed components such as box detection, non-maximum suppression, or thing-stuff merging. MaX-DeepLab directly predicts class-labeled masks with a mask transformer, and is trained with a panoptic quality inspired loss via bipartite matching.

Image source: Wang et al.

Building upon the recent transformer-based end-to-end object detectors such as DERT or Deformable DETR, MaX-DeepLab directly outputs non-overlapping masks and their corresponding semantic labels with a mask transformer. The model is then trained with a novel panoptic quality style loss (see section 3.2). This loss measures the similarity between ground-truth and predicted class-labeled masks as the multiplication of their mask similarity and their class similarity, and since the MaX-DeepLab outputs a larger number of masks than the ground-truths, a one-to-one matching is applied (as in DERT) before computing the loss.

As for the architecture, MaX-DeepLab integrates transformer blocks (called dual-path transformer) along a given CNN backbone in a dual-path fashion, with bidirectional communication blocks between the two paths. Each 2D pixel-based CNN is augmented with 1D global memory (of the same size as the number of predictions) with different types of attention. Specifically, a dual-path transformer takes as input the 2D CNN features and 1D memory, and applies 4 types of attention: 1) pixel-to-pixel over the CNN features, and since the attention over spatial dimensions is expensive, they use axial-attention. 1) memory-to-memory updating the memory features with global context, and then the cross-attention, 3) pixel-to-memory and 4) memory-to-pixel attention, where each time the query of one is applied to the keys and values of the other to update either the pixel or memory features conditioned on the other.

Path planning, whether for robotics or automotive applications, requires accurate perception, and one of main task used to acquire such an accurate perception is depth information. To infer depth, the most popular strategy is to use LiDAR which is capable of estimating depth, but only at sparse locations and can be quite expensive. Another alternative is to only use monocular cameras (this is the Tesla approach) to construct the optical flow between consecutive frames, which carries information on the scene’s depth, while greatly reducing the acquisition and maintenance costs. But this approach also has its drawback since it can only be estimated reliably in constrained and simple scenes. In this context, and since the objective behind the perception is to inform decisions, Binary TTC proposes to replace learning to infer depth with a new and simpler task that can be directly used to inform planning decisions.

Image source: Badki et al.

The proposed Binary TTC task is based on the concept of time-to-contact (TTC), which is the time for an object to collide with the camera plane under the current velocity conditions, and consists of predicting a binary classification map, where the objects that will collide with the camera plane within a given time are assigned labels of 1. More specifically, a binary classification network is trained to detect the objects that will collide with the camera plane for a given time interval. To train the network, the labels are generated using two images of a given dynamic scene, then, the sizes of each moving object in the two images are compared (after a scaling is applied to take into account the chosen time of collision). If the size becomes larger from the first to the second image, this means that the object is getting closer to the camera plane and can be labeled as 1.

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging (Paper)

Image source: Miangoleh et al.

The paper starts with a analysis of the behavior of standard monocular depth estimation models (they use MiDaS) when fed images at different resolutions. With small resolutions, the estimations lack many high-frequency details while generating a consistent overall structure of the scene. As the input resolution gets higher, more details are generated in the result but with inconsistencies in the scene structure characterized by gradual shifts in depth between image regions.

Based on the above observation, the paper proposes to merge depth estimates at different resolution with an image-to-image translation network that merges these estimates into a final prediction. In order to determine the input resolutions which to be merged later, the authors propose to base this selection on the number of pixels that do not have any contextual cues nearby. These regions with the lowest contextual cue density will dictate the maximum resolution that can be used for an image. Additionally, in order to also benefit from higher-resolution estimations to generate more high-frequency details, a patch based selection method is used to find regions with higher contextual cue density that requires more high-frequency details, which are then merged together for the final results.

Check out the authors video for a brief but great explanation of the work.

Polygonal Building Extraction by Frame Field Learning (Paper)

For the task of building segmentation where the objective is to output a polygon for each building in a given aerial photo, the existing approach are either based on vector representations, directly predicting vector polygons, or two step approach that first produce probability a map, which is then followed by polygon simplification. However, both approaches are either hard to train or contain many steps before the final output.

Image source: Girard et al.

For an end-to-end method that is easy to optimize, the authors propose to build on the semantic segmentation methods, where in addition to predicting a segmentation maps corresponding to the buildings in the image, they also task the model with predicting a frame field as a geometric prior to constrain the segmentation maps to have sharpe corners.

Other papers to check out

Representation & Adversarial Learning

Exploring Simple Siamese Representation Learning (Paper)

In the recent contrastive learning methods used for learning useful visual representation in an unsupervised manner, a model is trained to map similar input images close by in the embedding space. Such methods, such MoCo, SimCLR, or BYOL, add some additional conditions to the similarity maximization objective to avoid collapsing solutions, such as negative sample pairs, large batches, or momentum encoders.

Image source: Chen et al.

This paper investigates the usage of SimSiam, a simple Siamese based setup where such conditions are not necessary. Such an architecture consists of an encoder and a projector/predictor, which are then trained to maximize the similarity between a two given features (one after projection and one without) corresponding to two augmented version of an input image with a stop gradient operator applied to the second non projected output. The obtained results show that this simple approach gives are similar to other more elaborate approaches, indicating that the Siamese architecture may be an essential reason for the common success of the other related contrastive methods, see section 5 for more details.

Where and What? Examining Interpretable Disentangled Representations (Paper)

In disentangled representation learning, the objective is produce a representation of an input, where each dimension captures variations with a semantic meaning. One of the main limitations of existing work is their inability to differentiate between entangled and disentangled representations in the solution pool.

Image source: Zhu et al.

To solve this non-uniqueness problem, the paper defines disentanglement from three perspectives: informativeness, independence, and interpretability. By adding interpretability condition to the produced representations, we force the model to only produce disentangled representations that do correspond to human-defined concepts. Now, the question becomes how can we enforce such an interpretability constraint in the representations without supervision?. To solve this, the authors propose to exploit two hypotheses about interpretability to learn disentangled representations. The first one is Spatial Constriction (SC): a representation is usually interpretable if we can consistently tell where the controlled variations are in an image. The second hypothesis is Perceptual Simplicity: an interpretable code usually corresponds to a concept consisting of perceptually simple variations.

Based on these two hypothesis, a new model is introduced, where the Spatial Constriction is enforced with a SC module that restricts the impact of each latent code to specific areas on feature maps during generation. As for Perceptual Simplicity, the model is trained with a loss that encourages the model to embed simple data variations along each latent dimension.

The paper presents a for cross-modal discrimination for self-supervised learning in order to learn audio-visual representations from video and audio. The proposed contrastive learning frame-work consists of contrasting video representations against multiple audios representations at once (and vice versa), thus the cross-modal nature of the method.

Image source: Morgado et al.

The proposed approach learns both a cross-modal similarity metric by grouping video and audio instances that co-occur over multiple instances, in addition to optimizing for visual similarity rather than just cross-modal similarity.

UP-DETR: Unsupervised Pre-Training for Object Detection With Transformers (Paper)

While DETR proposed a simple end-to-end object detector, thus removing all hand-designed components, it still comes with some training and optimization challenges, requiring large-scale training data and long training times (up to 500 epochs). UP-DETR proposes a new unsupervised pre-training tasks in order to reduce the amount of training time and data required, where DETR is first pre-trained on a pretext task designed specifically for object detection as a desired down-stream task, then fine-tunned for object detection.

Image source: Dai et al.

Unsupervised Pre-training DETR (UP-DETR) defines random query patch detection as a pretext task to pre-train DETR in a self-supervised manner. For this task, a set of patches are first cropped from the input image at random, the model is then trained to predict the bounding boxes of these patches. The objective of this pre-training stage is to equip the model with better localization while maintaining its classification features. So to avoid suppressing the learned classification features of the pre-trained backbone, UP-DETR freezes the backbone and adds a patch feature reconstruction loss. Additionally, in order to specify the query patch the model needs to detect, the CNN features of the patch itself are added to the object queries before feeding them to the decoder. For multi-patch detection, the patches are grouped into sets of patches and each set is assigned to a given object query, and the attention aggregation is applied with a masking operation so that each prediction is not dependent on the rest (see section 3.2).

Fast End-to-End Learning on Protein Surfaces (Paper)

Chemically, proteins are composed of a sequence of amino acids which determines the structure (called fold) of the protein, and this structure in turn determines the function the protein will have. The task of either structure prediction from a sequence of amino acids (what AlphaFold does) or the protein design from its structure, are both major unsolved problems in structural biology. Another challenging problem, the task of interest in this paper, is the study of the interactions of a given molecule with other molecules given their composition in order to identify interaction patterns on protein surfaces.

Image source: Sverrisson et al.

In this context, the paper proposes dMaSIF (differentiable molecular surface interaction fingerprinting), a new and computationally efficient deep learning approach that operates directly on the large set of atoms that compose the protein. It first generates a point cloud representation for the protein surface, learns task-specific geometric and chemical features on the surface point cloud, and finally, applies a new convolutional operator that approximates geodesic coordinates in the tangent space (please see the paper since I’m way over my head here).

Natural Adversarial Examples (Paper)

ImageNet test examples tend to be simple, clear, close-up images, so the current test set may be too easy and may not represent harder images encountered in the real world, and a large capacity model can leverage spurious cues or shortcuts to solve the ImageNet classification problem. To counteract this, the paper proposes two hard ImageNet test sets: ImageNet-A and ImageNet-O of natural adversarial examples with adversarial filtration.

Image source: Hendrycks et al.

ImageNet-A consists of real-world adversarially filtered images that fool current ImageNet classifiers. Starting from a large set of images related to an ImageNet class, adversarially filtered examples are found by removing the samples that are correctly classified by ResNet-50 trained on ImageNet. ImageNet-O on the other hand in is a dataset of adversarially filtered examples for ImageNet out-of-distribution detectors. Starting from ImageNet-22K, first, all of the examples that belong to ImageNet-1K classes are removed. Then, using a ResNet-50, the only maintained examples are the ones that were classified by the model into ImageNet-1K classes with high confidence.

Transfer, Low-shot, Semi & Unsupervised Learning

DatasetGAN: Efficient Labeled Data Factory With Minimal Human Effort (Paper)

DatasetGAN is a method that generates massive datasets of high-quality semantically segmented images with minimal human effort. Based on the observation that GaNs are capable of acquiring rich semantic knowledge in order to render diverse and realistic examples of objects, DatasetGAN exploits the feature space of a trained GAN and trains a shallow decoder to produce a pixel-level labeling, where such a decoder is first trained on a very small number of labeled examples, and can then be used to labeled an infinite amount of synthetic images. The generated dataset can be used to train a model in an semi-supervised manner on the synthetic dataset, which can then be tested on real world images.

Image source: Zhang et al.

The architecture of DatasetGAN consist of two models, a StyleGAN that generates synthetic image, in addition to Style-Interpreter in the form of an ensemble of three-layer MLP classifiers, where each classifier takes as inputs feature maps from StyleGAN (outputs of AdaIN layers), upsamples them to the image resolutions, and predicts the pixel-level labels. The final prediction is the aggregation of the predictions of all of the MLP classifiers, which then trained with a small number of finely annotated examples and used for labeling the synthetic images.

Ranking Neural Checkpoints (Paper)

During a given deep learning experiment, it is common practice to collect many checkpoint, which are different versions of the final model at different training iterations. This paper is concerned with ranking such checkpoints to find the models that best transfers to the downstream task of interest.

More specifically, given a number of pretrained neural nets, called checkpoints \(\mathcal{C}\). The objective is to find the best checkpoint over a distribution of downstream tasks \(\mathcal{T}\). Each task consists of training and testing sets, and an evaluation procedure \(\mathbf{G}\) that adjusts the pretrained model by adding the task specific head, finetunes the model on the training set with a hyperparameter sweep under a given computation contraint, and retuns a performance measure \(\mathbf{R}\). The objective of the paper is to find the best measure \(\mathcal{M}\) to correctly rank the checkpoints.

\[\mathbf{R}^{*} \leftarrow \underset{\mathbf{R} \in \mathcal{R}}{\arg \max } \mathbb{E}_{t \sim \mathcal{T}} \mathcal{M}\left(\mathbf{R}_{t}, \mathbf{G}_{t}\right)\]

The paper proposes a measure called NLEEP, an extention of LEEP that evaluates the degree of transferability of the learned representations from source to target data without training. LEEP consists of computing the empirical conditional distribution of target labels given dummy source labels to measure the degree of transferability. NLEEP simply replaces the softmax classifier used for generating the dummy source labels with a Gaussian mixture model for more reliable class assignments. This way, we can evaluate the checkpoints on downstrem tasks while reducing cost of the evaluation procedure since LEEP does not require fine-tuning on the target data.

Other papers to check out

Computational Photography

Real-Time High-Resolution Background Matting (Paper)

Image source: Lin el al.

While many tools now provide background replacement functionality, they often yield artifacts at the boundaries, particularly in areas where there is fine detail like hair or glasses. On the other hand, the traditional image matting methods provide much higher quality results, but do not run in real-time, at high resolutions, and frequently require manual input.

Image source: Lin el al.

This paper proposes a real-time and high-resolution background matting method capable of processing 4K (3840x2160) images at 30fps and HD (1920x1080) images at 60fps. To acheive this, the model needs to be trained on large volumes of images with high-quality alpha mattes to generalize. To this end, the paper introduces two datasets with high-resolution alpha mattes and foreground layers extracted with chroma-key software. The model is then trained on these datases to learn strong priors, then fine-tunned on public dataset to learn fine-grained details. As for the network design, the model contains two processing paths; a base network that predicts the alpha matte and foreground layer at lower resolution, along with an error prediction map which specifies areas that may need high-resolution refinement. Based on this maps, a refinement network then takes the low-resolution result and the original image to generate high-resolution output, but only at select regions for efficency.

Im2Vec: Synthesizing Vector Graphics without Vector Supervision (Paper)

Despite the large amount of generative methods for images, there a limited amount of approaches that operate directly on vector graphics and require direct supervison. To solve this, the paper proposes Im2Vec, a new neural network that generates complex vector graphics with varying topologies, and only requires indirect supervision from readily-available training imges with no vector counterparts.

Image source: Reddy et al.

Im2Vec consists of a standard encode-decoder architecture. Given an input image, the encoder maps it into a latent variable, which is then decoded into a vector graphic structure. The decoder on the other hand is designed so that it can generate complex graphics (see section 3 of the paper for more details).

Other papers to check out

Other

Biometrics, Face, Gesture and Body Pose

Vision & Language

Datasets

Explainable AI & Privacy

Video Analysis and Understanding

ECCV 2020: Some Highlights

2020-09-02T00:00:00+00:00

The 2020 European Conference on Computer Vision took place online, from 23 to 28 August, and consisted of 1360 papers, divided into 104 orals, 160 spotlights and the rest of 1096 papers as posters. In addition to 45 workshops and 16 tutorials. As it is the case in recent years with ML and CV conferences, the huge number of papers can be overwhelming at times. Similar to my CVPR2020 post, to get a grasp of the general trends of the conference this year, I will present in this blog post a sort of a snapshot of the conference by summarizing some papers (& listing some) that grabbed my attention.

First, some useful links:

All of the papers can be found here: ECCV Conference Papers
A list of available presentations on YT: Crossminds ECCV. In addition to this YT playlist.
One sentence description of all ECCV-2020 papers: ECCV Paper Digest
ECCV virtual website: ECCV papers and presentations

Disclaimer: This post is not a representation of the papers and subjects presented in ECCV 2020; it is just a personnel overview of what I found interesting. Any feedback is welcomed!

General Statistics
Recognition, Detection, Segmentation and Pose Estimation
Semi-Supervised, Unsupervised, Transfer, Representation & Few-Shot Learning
3D Computer Vision & Robotics
Image and Video Synthesis
Vision and Language
The Rest

General Statistics

The statistics presented in this section are taken from the official Opening & Awards presentation. Let’s start by some general statistics:

Image source: Official Opening & Awards presentation.

The trends of earlier years continued with more than 200% increase in submitted papers compared to the 2018 conference, and with a similar number of papers to CVPR 2020. As expected, this increase is joined by a corresponding increase in the number of reviewers and area chairs to accommodate this expansion.

Image source: Official Opening & Awards presentation.

As expected, the majority of the accepted papers focus on topics related to deep learning, recognition, detection, and understanding. Similar to CVPR 2020, we see an increasing interest in growing areas such as label-efficient methods (e.g., unsupervised learning) and low-level vision.

Image source: Official Opening & Awards presentation.

In terms of institutions; similar to ICML this year, Google takes the lead with 180 authors, followed by The Chinese University of Hong Kong with 140 authors and Peking University with 110 authors.

In the next sections, we’ll present some paper summaries by subject.

Recognition, Detection, Segmentation and Pose Estimation

End-to-End Object Detection with Transformers (paper)

The task of object detection consists of localizing and classifying objects visible given an input image. The popular framework for object detection consist of pre-defining a set of boxes (ie., a set of geometric priors like anchors or region proposals), which are first classified, followed by a regression step to the adjust the dimensions of the predefined box, and then a post-processing step to remove duplicate predictions. However, this approach requires selecting a subset of candidate boxes to classify, and is not typically end-to-end differentiable. In this paper, the authors propose DETR (DEtection TRansformer), an end-to-end fully differentiable approach with no geometric priors. Bellow is a comparison of DETR and Faster R-CNN pipelines (image taken from the authors presentation), highlighting the holistic nature of the approach.

Image source: Carion et al.

DETR is based on the encoder-decoder transformer architecture. The model consists of three components: the CNN feature extractor, the encoder, and the decoder. A given image is first passed through the feature extractor to get image features. Then, positional encodings generated using sinusoids at different frequencies are added to the features to retain the 2D structure of the image. The resulting features are then passed through the transformer encoder to aggregate information across features and separate the object instances. For decoding, object queries are passed to the decoder with the encoded feature producing the output feature vectors. These queries are a fixed set of learned embeddings called object queries, which are randomly initialized embeddings that are learned during training then fixed during evaluation, and their number defines an upper bound on the number of objects the model can detect. Finally, the output feature vectors are fed through a (shared) fully connected layer to predict the class and bounding box for each query. To compute the loss and train the model, the outputs are matched with the ground truths with a one-to-one matching using the Hungarian algorithm.

Image source: Carion et al.

MutualNet: Adaptive ConvNet via Mutual Learning from Network Width and Resolution (paper)

Traditional neural network can only be used if a specific amount of compute is available, and if the resource constraints are not met, the model becomes unusable. However, this can greatly limit the usage of the models in real applications. For example, if the model is used for in-phone inference, the computational constrains are always changing depending on the load and the phone’s battery charge. A simple solution is to keep several models of different sizes on the device, and use the one with the corresponding constrains each time, but this requires a large amount of memory and cannot be scaled to different constraints. Recent methods like S-Net and US-Net sample sub-networks during training so the model can be used at different width during deployment. But the performance drops dramatically with very low constraints.

Image source: Yang et al.

This paper proposes to leverage both the network scale and the input scale to find a good trade-off between the accuracy and the computational efficiency. As illustrated above, for a given training iteration, four sub-networks are sampled, a full one and three sub-networks with varying widths. The full network is trained on the original size of the image with the ground-truth labels using the standard cross-entropy loss, while the rest of the sub-networks are trained with randomly down-scaled version of the input image using KL divergence loss between their outputs and the output of the full network (ie., a distillation loss). This way, each sub-network will be able to learn multi-scale representations from both the input scale and the network scale. During deployment, and given a specific resource constraint, the optimal combination of network scale and input scale can be chosen for inference.

Gradient Centralization: A New Optimization Technique for Deep Neural Networks (paper)

Using second order statistics such as mean and variance during optimization to perform some form standardization of the activations or network’s weight, such as Batch norm or weight norm, have become an important component of neural network training. So, instead of operating on the weights or the activations with additional normalization modules, Gradient Centralization (GC) operates directly on gradients by centralizing the gradient vectors to have zero mean, which can smooth and accelerate the training process of neural networks and even improve the model generalization performance.

Image source: Yong et al.

The GC operator, given the computed gradients, first computes the mean of the gradient vectors as illustrated above, then removes the mean of them. Formally, for a weight vector \(\mathbf{w}_i\) whose gradient is \(\nabla_{\mathbf{w}_{i}} \mathcal{L}(i=1,2, \ldots, N)\), the GC operator \(\Phi_{G C}\) is defined as:

\[\Phi_{G C}\left(\nabla_{\mathbf{w}_{i}} \mathcal{L}\right)=\nabla_{\mathbf{w}_{i}} \mathcal{L}-\frac{1}{M} \sum_{j=1}^{M} \nabla_{w_{i, j}} \mathcal{L}\]

Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval (paper)

In image retrieval, the objective is to retrieve images of the same class as the query image from a large collection of images. This tasks differs from classification where the classes encountered during testing were already seen during training, in image retrieval, we might get an image with a novel class and we need to fetch similar images, ie., an open set problem. The general pipeline of image retrieval consists of extracting embeddings for the query image, and also embeddings for all of image collection using a CNN feature extractor, compute the cosine similarity score between each pair, then rank the images in the collection based on such a similarity. The feature extractor is then trained to have a good ranking. The ranking performance is measured using Average Precision (AP), computing the sum of the rank of each positive over its rank on the whole image collection. However, computing the ranking of a given image consists of a thresholding operation using a Heaviside step function, making it non-differentiable, so we cannot train the model end-to-end to directly optimize the ranking.

Image source: Brown et al.

To solve this, the authors proposed to replace the Heaviside step function with a smooth temperature controlled sigmoid, making the ranking differentiable and useable as a loss function for end-to-end training. Compared to the triplet loss, the smooth-Ap loss optimizes a ranking loss, while the triplet loss is a surrogate loss that indirectly optimizes for a good ranking.

Hybrid Models for Open Set Recognition (paper)

Existing image classification methods are often based on a closed-set assumption, ie., the training set covers all possible classes that may appear in the testing phase. But this assumption is clearly unrealistic, given that even with large scale datasets such as ImageNet with 1K classes, it is impossible to cover all possible real-world classes. This where open-set classification comes, and tries to solves this by assuming that the test set contains both known and unknown classes.

Image source: Zhang et al.

In this paper, the authors use Flow-based model to tackle the problem of open-set classification. Flow-based are able to fit a probability distribution to training samples in an unsupervised manner via maximum likelihood estimation. The flow model can then be used to predict the probability density of each example. When the probability density of an input sample is large, it is likely to be part of the training distribution with a known class, and outliers will have a small density value. While previous methods stacked a classifier on top of the flow model, the authors propose to learn a joint embedding for both the flow model and the classifier since the embedding space learned from only flow-based model may not have sufficient discriminative features for effective classification. As illustrated above, during training, images are mapped into a latent feature space by the encoder, then the encoded features are fed into both the classifier trained with a cross-entropy loss, and the flow model for density estimation. The whole architecture is trained in an end-to-end manner. For testing, the \(\log p(x)\) of each image is computed and then compared with the lowest \(\log p(x)\) taken over the training set. If it is greater than the threshold, it is sent to the classifier to identify its specific known class, otherwise it is rejected as an unknown sample.

Conditional Convolutions for Instance Segmentation (paper)

Instance segmentation remains as one of the challenging tasks in computer vision, requiring a per-pixel mask and a class label for each visible object in a given image. The dominant approach is Mask R-CNN which consists of two steps, first, the object detector Faster R-CNN predicts a bounding box for each instance. Then, for each detected instance, the regions of interest are cropped from the output feature maps using ROI Align, resized to the same resolution, and are then fed into a mask head which is a small fully convolutional network used to predict the segmentation mask. However, the authors point out the following limitation with such an architecture; (1) the ROI Align might fetch irrelevant features belonging to the background or to other instances, (2) the resizing operation restricts the resolution of the instance segmentation, and (3) the mask head requires a stack of 3x3 convolutions to induce a large enough receptive field to predict the mask, which considerably increases the computational requirements of the mask head.

Image source: Tian et al.

In this paper, the authors propose to adapt FCNs used for semantic segmentation for instance segmentation. For effective instance segmentation, FCNs require two type of information, appearance information to categorize objects and location information to distinguish multiple objects belonging to the same category. The proposed network, called CondInst (conditional convolutions for instance segmentation), is a network based on CondConv and HyperNetworks, where for each instance, a sub-network will generate the mask FCN head’s weights conditioned on the center area of each instance, which are then used to predict the mask of the given instance. Specifically, as shown above, the network consists of multiple heads applied at multiple scales of the feature map. Each head predicts the class of a given instance at pre-defined positions, and the network’s weights to be used by the mask FCN head. Then the mask prediction is done using the parameters produced by each head.

Multitask Learning Strengthens Adversarial Robustness (paper)

One of the main limitations of deep neural networks is their vulnerability to adversarial attacks, where very small and invisible perturbations are injected into the input, resulting in the wrong outputs, even if the appearance of the input remains the same. In recent years, the adversarial robustness of deep nets was rigorously investigated at different stages of the pipeline, from the input data (eg., using unlabeled data and adversarial training) to the model itself using regularization (eg., Parseval Networks), but the outputs of the model are still not utilized to improve the robustness of the model. In this paper, the authors investigate the effect of having multiple outputs for multi-task learning on the robustness of the learned model, such a setting is useful since a growing number of machine learning applications call for models capable of solving multiple tasks at once.

Image source: Mao et al.

Using p-norm ball bounded attack, where the adversarial perturbation is found within a p-norm ball for a given radius of a given input example. Then, the vulnerability computed as the total loss change. The authors showed an improved robustness when training on a pair of tasks (eg., two tasks are chosen from: segmentation, depth, normals, reshading, input reconstruction, 2d and 3d keypoints…). The improved robustness is observed on single tasks attacks (ie., the perturbation is computed using one output) and multi tasks attacks (ie., the maximal perturbation of all the perturbations computed using all of outputs). The authors also theoretically showed that such a multi task robustness is only obtained if the tasks are correlated.

Dynamic Group Convolution for Accelerating Convolutional Neural Networks (paper)

Group convolutions were first introduced in AlexNet to accelerate training, and subsequently adapted for efficient CNNs such as MobileNet and Shufflenet. They consist of equally splitting the input and output channels in a convolution layer into mutually exclusive sections or groups while performing a normal convolution operation within each individual groups. So for \(G\) groups, the computation is reduced by \(G\) times. However, the authors argue that they introduce two key limitations: (1) they weaken the representation capability of the normal convolution by introducing sparse neuron connections, and (2) they have fixed channel division regardless of the properties of each input.

Image source: Su et al.

In order to adaptively select the most related input channels for each group while keeping the full structure of the original networks, the authors propose dynamic group convolution (DGC). DCG consists of two heads, in each head, there is a saliency score generator that assigns an importance score to each channel. Using these scores, the channels with low importance scores are pruned. Then the normal convolution is conducted based on the selected subset of input channels generating the output channels in each head. Finally, the output channels from different heads are concatenated and shuffled.

Disentangled Non-local Neural Networks (paper)

The non-local block models long-range dependency between pixels using the attention mechanism, and has been widely used for numerous visual recognition tasks, such as object detection, semantic segmentation, and video action recognition.

Image source: Yin et al.

In this paper, the authors try to better understand the non-local block, find its limitations, and propose an improved version. First, they start by reformulating the similarity between a pixel \(i\) (referred to as a key pixel) to pixel \(j\) (referred to as a query pixel) as the sum of two term: a pairwise term, which is a whitened dot product term representing the pure pairwise relation between query and key pixels, and a unary term, representing where a given key pixel has the same impact on all query pixels. Then, to understand the impact of each term, they train using either one, and find that pair-wise term is responsible for the category information, while the unary is responsible for boundary information. However, by analyzing the gradient of the non-local block, when the two are combined in the normal attention operator, their gradients are multiplied, so if the gradients of one of the two term is zero, the non-zero gradients of the other wont have any contribution. To solve this, the authors proposed a disentangled version of the non-local, where each term is optimized separately.

Hard negative examples are hard, but useful (paper)

Deep metric learning optimizes an embedding function that maps semantically similar images to relatively nearby locations and maps semantically dissimilar images to distant locations. A popular way to learn the mapping is to define a loss function based on triplets of images: an anchor image, a positive image from the same class, and a negative image from a different class. The model is then penalized when the anchor is mapped closer to the negative image than it is to the positive image. However, during optimization, most triplet candidates already have the anchor much closer to the positive than the negative making them redundant. On the other hand, optimizing with the hardest negative examples leads to bad local minima in the early phase of the training. This is because in this case, the anchor-negative similarity is larger than the anchor-positive similarity as measured by the cosine similarity, ie., dot product between normalized feature vectors.

Image source: Xuan et al.

The authors show that such problems with the usage of hard-negatives come from the standard implementation of the triplet loss. Specifically, (1) if the normalization is not considered during the gradient computation, a large part of the gradient is lost, and (2) if two images of different classes are close by in the embedding space, the gradient of the loss might pull them closer instead of pushing them away. To solve this, instead of pulling the anchor-positive pair together to be tightly clustered as done in the standard triplet loss, the authors propose to avoid updating the anchor-positive pairs resulting in less tight clusters for a class of instances. This way the network focuses only on directly pushing apart the hard negative examples away from the anchor.

Volumetric Transformer Networks (paper)

One of the keys behind the success CNNs is their ability to learn discriminative feature representations of semantic object parts, which are very useful for computer vision tasks. However, CNNs still lacks the ability to handle various spatial variations, such as scale, view point and intra-class variations. Recent methods, such as spatial transformer networks (STNs), try to suppress such variations by first wrapping the feature maps of spatially different images to a standard canonical configuration, then train classifiers on such standard features. But such methods apply the same wrapping to all the feature channels, which does not take into consideration the fact that the individual feature channels can represent different semantic parts, which may require different spatial transformations with respect to the canonical configuration.

Image source: Kim et al.

To solve this, the paper introduces Volumetric transformer network (VTN) shown above, a learnable module that predicts per channel and per spatial location wrapping transforms, which are used reconfigure the intermediate CNN features into a spatially agnostic and standard representations. VTN is an encoder-decoder network with modules dedicated to letting the information flow across the feature channels to account for the dependencies between the semantic parts.

Faster AutoAugment: Learning Augmentation Strategies Using Backpropagation (paper)

Data augmentations (DA) have become a important and indispensable component of deep learning methods, and recent works (eg., AutoAugment, Fast AutoAugment and RandAugment) showed that augmentation strategies found by search algorithms outperform standard augmentations. With a pre-defined set of possible transformations, such as geometric transformations like rotation or color enhancing transformations like solarization, the objective is to find the optimal data augmentation parameters, ie., the magnitude of the augmentation, the probability of applying it, and the number of transformations to combine as illustrated in the left figure below. The optimal strategy is learned with a double optimization loop, so that the validation error of a given CNN trained with a given strategy is minimized. However, such an optimization method suffers from a large search space of possible policies, requiring sophisticated search strategies, and a single iteration of policy optimization requires the full training of the CNN. To solve this, the authors propose to find the optimal strategy using density matching of original and augmented images with gradient based optimization.

Image source: Hataya et al.

By viewing DA as a way to fill missing points of original data, the objective then is to minimize the distance between the distributions of augmented data and the original data using adversarial learning, and in order to learn the optimal augmentation strategy, the policy needs to be differentiable with respect to the parameters of the transformations. For the probability of applying a given augmentation, the authors use a stochastic binary variable sampled from a Bernoulli distribution, and optimized using the Gumbel trick, while the magnitude is approximate with a straight-through estimator and the combination are learned as a combination of one-hot vectors.

Other Papers

Semi-Supervised, Unsupervised, Transfer, Representation & Few-Shot Learning

Big Transfer (BiT): General Visual Representation Learning (paper)

In this paper, the authors revisit the simple paradigm of transfer learning: pre-train on a large amount of labeled source data (e.g., JFT-300M and ImageNet-21k datasets), then fine-tune the per-trained weights on the target tasks, reducing both the amount of data needed for target tasks and the fine-tuning time. The proposed framework is BiT (Big Transfer), and consists of a number of components that are necessary to build an effective network capable of leveraging large scale datasets and learning general and transferable representations.

On the (upstream) pre-training side, BiT consists of the following:

For very large datasets, the fact that Batch Norm (BN) uses statistics from the training data during testing results in train/test discrepancy, where the training loss is correctly optimized while the validation loss is very unstable. In addition to the sensitivity of BN to the batch size. To solve this, BiT uses both Group Norm and Weight Norm instead of Batch Norm.
A small model such as ResNet 50 does not benefit from large scale training data, so the size of the model needs to also be correspondingly scaled up.

For (down-stream) target tasks, BiT proposes the following:

The usage of standard SGD, withoyt any layer freezing, dropout, L2-regularization or any adaptation gradients. In addition to initializing the last prediction layer to all 0’s.
Instead of resizing all of inputs to a fixed size, eg., 224. During training, the images are resized and cropped to a square with a randomly chosen size, and randomly h-flipped. At test time, the image is resized to a fixed size,
While mixup is not useful for large scale pre-training given the abundance of data, BiT finds that mixup regularization can be very beneficial for mid-sized dataset used for downstream tasks.

Learning Visual Representations with Caption Annotations (paper)

Training deep models on large scale annotated dataset results in not only a good performance on the task at hand, but also enables the model to learn useful representation for downstream tasks. But can we obtain such useful features without such an expensive fine grained annotations?. This paper investigates weakly-supervised pre-training using noisy labels, which are image captions in this case.

Image source: Sariyildiz et al.

With the objective of using a limited set of image-caption pairs to learn visual representations, how can the training objective be formulated to push for an effective interaction between the images and their captions? Based on masked image modeling used in BERT which randomly masks 15% of the input tokens, and the model is then trained to reconstruct the masked input tokens using the encoder part of the transformer model. The paper proposed image-conditioned masked language modeling (ICMLM), where the images are leveraged to reconstruct the masked tokens of their corresponding captions. To solve this objective, the authors proposes two multi modal architectures, (1) ICMLM tfm, where the image is encoded using a CNN, the masked caption using the BERT model, the caption and image features are then concatenated and passed through a transformer encoder resulting in multi-modal embedding used to predict the masked token. And (2), ICMLM att+fc, similarity, the caption and image features are first produced, then passed through a pair-wise attention block to aggregate the information between the caption and the image. The resulting features are then pooled and passed through a fully connected layer for masked token prediction.

Memory-augmented Dense Predictive Coding for Video Representation Learning (paper)

The recent progress in self-supervised representation learning for images showed impressive results on down-stream tasks. However, although multi-model representation learning for videos saw similar gains, self-supervision using video streams only, without any other modalities such as text or audio, is still not as developed. Even if the temporal information of videos provide a free supervisory signal to train a model to predict the future states from the past in self-supervised manner. The task remains hard to solve since the exact future is not deterministic, and at a given time step, there many likely and plausible hypotheses for future states (eg., when the action is “playing golf”, a future frame could have the hands and golf club in many possible positions).

Image source: Han et al.

This paper uses contrastive learning with a memory module to solve the issues with future prediction. To reduce the uncertainty, the model predicts the future at the feature level, and is trained using a contrastive loss to avoid overstrict constrains. And to deal with multiple hypothesis, a memory module is used to infer multiple future states simultaneously. Given a set of successive frame, a 2d-3d CNN encoder (ie., \(f\)) produces context features and a GRU (ie., \(g\)) aggregates all the past information, which are then used to select slots from the shared memory module. A predicted future state is then produced as a convex combination of the selected memory slots. The predicted future state is then compared with true features vectors of the future states using a contrastive loss. For downstream tasks, the feature produced by \(g\) are pooled and then fed to the classifier.

SCAN: Learning to Classify Images without Labels (paper)

To group the unlabeled input images into semantically meaningful clusters, we need to find the solutions using the visual similarities alone. Prior work either, (1) learn rich features with a self-supervised method, then applies k-means on the features to find the cluster, but this can lead to degeneracy quite easily. (2) end-to-end clustering approaches that either leverage CNNs features for deep clustering, or are based on mutual information maximization. However, the produced clusters depend heavily on the initialization and is likely to latch into low-level features.

Image source: Gansbeke et al.

To solve the issues found in prior work, the paper proposes SCAN (semantic clustering by adopting nearest neighbors) consiting of a two step procedure. In a first step, the feature representations are learned through a pretext task, then, to generate the initial cluster, SCAN mines the nearest neighbors of each image based on feature similarity instead of applying K-means. In a second step, the semantically meaningful nearest neighbors as are used as a prior to train the model to classify each image and its mined neighbors together. This is optimized using a loss function that maximizes their dot product after softmax, pushing the network to produce both consistent and discriminative (one-hot) predictions.

GATCluster: Self-Supervised Gaussian-Attention Network for Image Clustering (paper)

Clustering consists of separating data into clusters according to sample similarity. Traditional methods use hand-crafted features and domain specific distance function to measure the similarity, but such hand crafted feature are very limited in expressiveness. Subsequent work leveraged deep representations with clustering algorithms, but the performance of deep clustering still suffers when the input data is complex. For an effective clustering, in terms of the features, they must contain both high-level discriminative features, and capture object semantics. In terms of the clustering step, trivial solutions such as assigning all samples to a single or few clusters must be avoided, and the clustering needs to be efficient to be applied to large-sized images.

The paper proposes GATCluster, which directly outputs semantic cluster labels without further post-processing, where the learned features are one-hot encoded vectors to guarantee the avoidance of trivial solutions. GATCluster is trained in an unsupervised manner with four self-learning tasks under the constraints of transformation invariance, separability maximization, entropy analysis, and attention mapping.

Associative Alignment for Few-shot Image Classification (paper)

In few-shot image classification, the objective is to produce a model that can learn to recognize novel image classes when very few training examples are available. One of the popular approaches is Meta-learning that extracts common knowledge from a large amount of labeled data containing the base classes, and used to train a model. The model is then trained to classify images from novel concepts with only a few examples. The meta objective is to find a good set of initial weights that converge rapidly when trained on the new concepts. Interestingly, recent works demonstrated that standard transfer learning without meta learning, where a feature extractor is first pre-trained on the base classes, then fine-tunes a classifier on top of the pre-trained extractor on the new few examples performs on par with more sophisticated meta-learning strategies. However, the freezing of the extractor during fine-tuning that is necessary to avoid overfilling hinders the performances.

Image source: Afrasiyabi et al.

The paper proposes a two step approach to solve this. First, the feature extractor is used to produce features for the novel examples. The feature of each example is then mapped to one of the base classes using a similarity metric in the embeddings space. The second step consists of associative alignment, where the feature extractor is fine-tuned so that the embeddings of the novel images are pushed closer to the embeddings of their corresponding bases images. This is done by either centroid alignment where the distance between the center of each base class and the novel classes is reduced, or adversarial alignment where a discriminator pushes the feature extractor to align the base and novel examples in the embedding space.

Other Papers

3D Computer Vision & Robotics

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (paper)

3D view synthesis from 2D images is a challenging problem, especially if the input 2D images are sparsely sampled. The goal is to train a model that takes a set of 2D images of a 3D scene (with the optional camera pose and its intrinsics), then, using the trained model, we can render novel views of the 3D scene that were not found in the input 2D images. A successful approach is voxed-based representations that represent the 3D scene on a discredited grid. Anf the 3D voxel of RGB-alpha grid values is predicted using a 3D CNN. However, such methods are memory inefficient since they scale cubically with the space resolution, can be hard to optimize and are not able to parametrize scene surfaces smoothly. A recent trend in the computer vision community is to represent a given 3D scene as a continuous function using a fully-connected neural network. So the neural network itself is a compressed representation of the 3D scene, trained using the set of 2D images and then used to render novel views. Still, the existing methods were not able to match existing voxed-based methods.

Image source: Mildenhall et al.

NeRF (neural radiance fields) represents a scene as a continuous 5D function using a fully-connected network of 9 layers and 256 channels, whose input is a single continuous 5D coordinate, ie., 3D spatial locations (\(x\), \(y\), \(z\)) and the viewing directions (\(\theta\), \(\phi\)), and whose output is RGB color and opacity (output density). To synthesize a given view, the rendering procedure consists of querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize the representation is a set of images with known camera poses. This way, NeRF is able to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance with a simple reconstruction loss between the rendered images and the ground truths, and demonstrates results that outperform prior work on neural rendering and view synthesis.

Towards Streaming Perception (paper)

Practical applications such as self-driving vehicles require fast reaction times similar to that of of humans, which is typically 200 milliseconds. In such settings, low-latency algorithms are required to ensure safe operation. However, even if the latency of computer vision algorithms is often studied, it have been primarily explored only in an offline setting. While Vision-for-online perception imposes quite different latency demands. Because by the time an algorithm finishes processing a particular image frame, say after 200ms, the surrounding world has changed as shown in the figure bellow. This forces the perception to be ultimately predictive of the future, which is a fundamental property of human vision (e.g., as required whenever a baseball player strikes a fast ball).

Image source: Li et al.

To develope better benchmarks that reflects real-world scenarios and make comparing existing method easier. The paper introduces the objective of streaming perception, ie., real-time online perception, and proposes a new meta-benchmark that systematically converts any image understanding task into a streaming image understand-ing task. This benchmark is built on a key insight: streaming perception requires understanding the state of the world at all time instants. So when a new frame arrives, streaming algorithms must report the state of the world even if they have not done processing the previous frame, forcing them to consider the amount of streaming data that should be ignored while the computation is occurring. Specifically, when comparing the model’s output and the ground truths, the alignment is done using time instead of the input index, so the model needs to give the correct prediction for time step \(t\) before the processing the corresponding input, ie., if the model takes \(\Delta t\) to process the inputs, it can only use data before \(t - \Delta t\) to predict the output corresponding to the input at time \(t\).

Teaching Cameras to Feel: Estimating Tactile Physical Properties of Surfaces From Images (paper)

Humans are capable of forming a mental model at a young age that maps the perception of an object with a perceived sense of touch, which is based on previous experiences when interacting with different items. Having autonomous agents equipped with such a mental model can be a very valuable tool when interacting with novel objects, especially when a simple object class is not informative enough to accurately estimate tactile physical properties.

Image source: Purri et al.

In order to simulate such a mental model in a more direct manner, the paper proposes to estimate the physical properties directly, allowing attributes of objects to be utilized directly. First, the authors propose a dataset of 400+ surface image sequences and tactile property measurements. Since when estimating surface properties, people often unconsciously move their heads, acquiring multiple views of a surface, the captured images sequences comprise multiple viewing angles for each material surface. Then, they propose a cross-modal framework for learning the complex mapping for visual cues to the tactile properties. The training objective of the model is to generate precise tactile properties estimates given vision information. Both visual and tactile information are embedded into a shared latent space through separate encoder networks. A generator function then estimates tactile property values from the embedded visual vector. The discriminator network learns to predict whether a tactile-visual pair is a real or synthetic example. During inference, the encoder-generator pair is used to infer the tactile properties if the input images.

Image source: Purri et al.

Convolutional Occupancy Networks (paper)

3D reconstruction is an important problem in computer vision with numerous applications. For an ideal representation of 3D geometry we need to be able to, a) encode complex geometries and arbitrary topologies, b) scale to large scenes, c) encapsulate local and global information, and d) be tractable in terms of memory and computation. However, existing representations for 3D reconstruction do not satisfy all of these requirements. While recent implicit neural representation have demonstrated impressive performances in 3D reconstruction, they suffer from some limitation due to using a simple fully-connected network architecture which does not allow for integrating local information in the observations or incorporating inductive biases such as translational equivariance.

Image source: Peng et al.

Convolutional Occupancy Networks uses convolutional encoders with implicit occupancy decoders to incorporates inductive biases and enabling structured reasoning in 3D space. Resulting in a more fine-grained implicit 3D reconstruction of single objects, with the ability to scale to large indoor scenes, and generalizes well from synthetic to real data.

Other Papers

Image and Video Synthesis

Transforming and Projecting Images into Class-conditional Generative Networks (paper)

GaNs are capable of generating diverse images from different classes. For instance, BigGaN, a class conditional GaN, given a noise vector \(z\) and a class embeddings \(c\), the model is capable of generating a new image from that class. The image can then be manipulated by editing the latent variable of the noise vectors and class embedding. But is the inverse possible?, ie., given an input image, can we find the latent variable \(z\) and the class embedding \(c\) that best matches to the image? This problem remains challenging since many input images cannot be generated by a GaN. Additionally, the objective function have many local minimas, the search algorithms can get stuck in such regions easily.

Image source: Huh et al.

To address these problems, the paper proposes pix2latent with two new ideas: estimating input transformations at scale, and using a non-local search algorithm to find better solutions. As illustrated above, given an input image, pix2latent first finds the best transformation so that the transformed input is likely to be generated by a GaN, then the image is projected into the latent space using the proposed BasicCMA optimization method. The obtained latent variables are then edited, projected back into the image space obtaining an edited image, which can then be transformed with the inverse of the initial transformation

Contrastive Learning for Unpaired Image-to-Image Translation (paper)

Given two training sets of image pairs of different properties and modes, eg., images of horses and zebras, the objective of unpaired image-to-image translation is to learn a translation function between the two modes, eg., transform horses to zebras and vice-versa, while retaining the sensible information such as pose or size without having access to a set of one-to-one matches between the two modes. Existing methods such as CycleGaN forces the model to have back translated images that are consistent with the original ones. But such methods assume a bijection, which is often too restrictive since a given translated image might have many plausible source images. An ideal loss should be invariant to different styles, but differentiate between sensitive information.

Image source: Park et al.

Contrastive Unpaired Translation (CUT) aims to learn such an embedding space. In addition to the standard GaN loss where the generator is trained to generate realistic translated images while the discriminator tries to differentiate between the translate images and real ones. An additional loss that pushes for similar embeddings between two corresponding patches from the input and translated image in used. Optimized with a contrastive objective which pulls the embeddings of the two corresponding patches, while pushing away the embedding of a give patch and its negatives which are randomly sampled patches (ie., only internal patches from the same input image are used, external ones from other images decrease the performances).

Rewriting a Deep Generative Model (paper)

GAN are capable of modeling a rich set of semantic and physical rules about the data distribution, but up to now, it has been obscure how such rules are encoded in the network, or how a rule could be changed. This paper introduces a new problem setting: manipulation of specific rules encoded by a deep generative model. So given a generative model, the objective is to adjust its weights, so that the new and modified model follows new rules, and generates images that follow the new set of rules as shown bellow.

Image source: Bau et al.

By viewing each layer as an associative memory that stores latent rules as a set of key-value relationships over hidden features. The model can be edited by defining a constrained optimization that adds or edits one specific rule within the associative memory while preserving the existing semantic relationships in the model as much as possible. The papers does this directly by measuring and manipulating the model’s internal structure, without requiring any new training data.

Learning Stereo from Single Images (paper)

Given a pair of corresponding images, the goal of stereo matching is to estimate the per-pixel horizontal displacement (i.e. disparity) between the corresponding location of every pixel from the first view to the second, or vice-versa. While fully supervised methods give good results, the precise ground truth disparity between a pair of stereo images is often hard to acquire. A possible alternative is to train on synthetic data, then fine-tune on a limited amount of real labeled data. But without a fine-tuning step with enough labels, such model are not capable of generating well to real images.

Image source: Watson et al.

The paper proposed a novel and fully automatic pipeline for generating stereo training data from unstructured collections of single images given a depth-from-color model, requiring no synthetic data or stereo pairs of images to train. Using a depth estimation network. First, a given left input image is converted into a synthesized right image by a forward wrapping operation using the depth disparity. Then, with stereo pairs of images, the stereo network can then be trained in a supervised manner, resulting in a well generalizable model.

What makes fake images detectable? Understanding properties that generalize (paper)

Although the the quality of GaN generated images is reaching impressive levels, deep networks trained to detect fake images can still pick up on the subtle artifacts in these generated images, and such trained networks can also find the same artifacts across many models trained on different dataset and with different methods. This paper aims to visualize and understand which artifacts are shared between models and are easily detectable and transferable across different scenarios.

Image source: Chai et al.

Since the global facial structure can vary among different generators and datasets, local patches of the generated images are more stereotyped and may share redundant artifacts. To this end, a fully-convolutional patch-based classifier is used to focus on local patches rather than global structure. The path level classifier can then be used to visualize and categorize the patches that are most indicative of real or fake images across various test datasets. Additionally, the generated image can be manipulated to exaggerate characteristic attributes of fake images.

Other Papers

Vision and Language

Connecting Vision and Language with Localized Narratives (paper)

One of the popular ways for connecting vision and language is image captioning, where each image is paired with human authored textual captions, but this link is only at the full image scale where the sentences describes the whole image. To improve this linking, grounded images captioning adds links between specific parts of the image caption and object boxes in the image. However, the links are still very sparse and the majority of objects and words are not grounded and the annotation process if expensive.

Image source: Pont-Tuset et al.

The paper proposes a new and efficient form of multi-modal image annotations for connecting vision and language called Localized Narratives. Localized Narratives are generated by asking the annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. For instance, as shown in the figure above, the annotator says “woman” while using their mouse to indicate her spatial extent, thus providing visual grounding for this noun. Later they move the mouse from the woman to the balloon following its string, saying “holding”. This provides direct visual grounding of this relation. They also describe attributes like “clear blue sky” and “light blue jeans”. Since voice is synchronized to the mouse pointer, the image location of every single word in the description can be determined. This provides dense visual grounding in the form of a mouse trace segment for each word. This rich from of annotation with multiple modalities (ie., image, text, speech and grounding) can be used to for different tasks such as text-to-image generation, visual question answering and voice-driven environnement navigation. Or for a more fine-grained control of tasks, such conditioning captions on specific parts of the image, which can be used by a person with imperfect vision to get descriptions of specific parts by hovering their finger on the image.

UNITER: UNiversal Image-TExt Representation Learning (paper)

Most Vision-and-Language (V&L) tasks such Visual Question Answering (VQA) rely on joint multi modal embeddings to bridge the semantic gap between visual and textual clues in images and text. But such representations are usually tailored for specific tasks, and require specific architectures. In order to learn general joint embeddings that can be used on all of the V&L downstream tasks. The paper introduces UNITER, a large-scale pre-trained model for joint multimodal embedding illustrated bellow. Based on the transformer model, UNITER is pre-trained on 4 tasks: Masked Language Modeling (MLM) conditioned on image, where the randomly masked words are recovered using both image and text features. Masked Region Modeling (MRM) conditioned on text, where the model reconstructs some regions of a given image. Image-Text Matching (ITM), where the model predicts if an image and a text instances are paired or not. And Word-Region Alignment (WRA), where the model learn the optimal alignment between words and images found using optimal transport. To use UNITER on downstream tasks, first they are reformulated as a classification problem, then the added classifier on top of the [CLS] features can be trained using a cross-entropy loss.

Image source: Chen et al.

Learning to Learn Words from Visual Scenes (paper)

The standard approach in vision and language consists of learning a common embedding space, however this approach is inefficient, often requiring millions of examples to learn, generalizes poorly to the natural compositional structure of language, and the learned embeddings are unable to adapt to novel words at inference time. So instead of learning the word embeddings, this paper proposes to learn the process for acquiring word embeddings.

Image source: Surís et al.

The model is based on the transformer model, and at each iteration, the model receives an episode of image and language pairs, and then meta-learns a policy to acquire word representations from the episode. This produces a representation that is able to acquire novel words at inference time as well as more robustly generalize to novel compositions. Specifically, every tasks is formulated as a language acquisition task or an episode, consisting of training examples and a testing examples, where the testing examples evaluates the language acquired from the training examples. In this figure above for instance, the model needs to acquire the word “chair” from the training samples, a word which it has never seen before. The meta-training is done in the forward pass, where the model needs to point to the correct word, “chair”, in the training example, and a matching loss is used to train the model. After training on many episodes and tasks, the model is able to adapt very quickly to novel task during inference.

Other Papers

The Rest

Unfortunately, the number of papers makes the summarization task difficult and time consuming. So for the rest of the papers, I will simply list some papers I came across in case the the reader is interested in the subjects.

Click to expand

Deep Learning: Applications, Methodology, and Theory:

Low level vision, Motion and Tracking:

Face, Gesture, and Body Pose:

Action Recognition, Understanding:

CVPR 2020: A Snapshot

2020-06-21T00:00:00+00:00

The first virtual CVPR conference ended, with 1467 papers accepted, 29 tutorials, 64 workshops, and 7.6k virtual attendees. The huge number of papers and the new virtual version made navigating the conference overwhelming (and very slow) at times. To get a grasp of the general trends of the conference this year, I will present in this blog post a sort of a snapshot of the conference by summarizing some papers (& listing some) that grabbed my attention.

All of the papers can be found here: CVPR2020 open access
CVPR virtual website: CVPR2020 virtual

Disclaimer: This post is not a representation of the papers and subjects presented in CVPR; it is just a personnel overview of what I found interesting. Any feedback is welcomed!

CVPR 2020 in numbers
Recognition, Detection and Segmentation
Generative models and image synthesis
Representation Learning
Computational photography
Transfer/Low-shot/Semi/Unsupervised Learning
Vision and Language
The rest

CVPR 2020 in numbers

The statistics presented in this section are taken from the official Opening & Awards presentation. Let’s start by some general statistics:

Image source: Opening & Awards presentation.

The trends of earlier years continued with a 20% increase in authors and a 29% increase in submitted papers, joined by rising the number of reviewers and area chairs to accommodate this expansion.

Image source: Opening & Awards presentation.

Similar to last year, China is the first contributor to CVPR in terms of accepted papers with Tsinghua University with the most significant number of authors, followed by the USA as the second contributor by country and Google by organization.

Image source: Opening & Awards presentation.

As expected, the majority of the accepted papers focus on topics related to learning, recognition, detection, and understanding. However, there is an increasing interest in relatively new areas such as label-efficient methods (e.g., transfer learning), image synthesis and robotic perception. Some emerging topics like fairness and explain AI are also starting to gather more attention within the computer vision community.

Recognition, Detection and Segmentation

PointRend: Image segmentation as rendering (paper)

Image segmentation models, such as Mask R-CNN, typically operate on regular grids: the input image is a regular grid of pixels, their hidden representations are feature vectors on a regular grid, and their outputs are label maps on a regular grid. However, a regular grid will unnecessarily over sample the smooth areas while simultaneously undersampling object boundaries, often resulting in blurry contours, as illustrated in the right figure below.

Image source: Kirillov et al.

The paper proposes to view image segmentation as a rendering problem and adapt classical ideas from computer graphics to render high-quality label maps efficiently. This is done using a neural network module called PointRend. PointRend takes as input a given number of CNN feature maps that are defined over regular grids and outputs high-resolution predictions over a finer grid. These fine predictions are only made in carefully selected points, chosen to be near high-frequency areas such as object boundaries where we have uncertain predictions (i.e., similar to adaptive subdivision), which are then upsampled and a small subhead is used to make the prediction from such point-wise features.

Self-training with Noisy Student improves ImageNet classification (paper)

Semi-supervised learning methods work quite well in a low-data regime, but with a large number of labeled data, fully-supervised learning still works best. In this paper, the authors revisit this assumption and show that noisy self-training works well, even when labeled data is abundant.

Image source: Xie et al.

The method used a large corpus of unlabeled images (i.e., different than ImageNet training set distribution), and consists of three main steps. First, a teacher model is trained on the labeled images, the trained teacher is then used to generate pseudo-labels on the unlabeled images, which are then used to train a student model on the combination of labeled images and pseudo-labeled images, the student model is larger than the teacher (e.g., starting with EfficientNetB0 then EfficientNetB3) and is trained with an injected noise (e.g., dropout). The student is then considered as a teacher, and the last two-step are repeated a few times to relabel the unlabeled data and training a new student. The last model achieves SOTA on ImageNet top-1 and shows a higher degree of robustness.

Designing network design spaces (paper)

Instead of focusing on designing individual network instances, this paper focuses on designing network design spaces that parametrize populations of networks, in order to find some guiding design principals for fast and simple networks.

Image source: Radosavovic et al.

The proposed method focuses on finding a good model population instead of good model instances (e.g., natural architecture search). Based on the comparison paradigm of distribution estimates, the process consists of initializing a design space A, followed by introducing a new design principle to obtain a new and refined design space B, containing simpler and better models. The process is repeated until the resulting population consists of models that are more likely to be robust and generalize well.

EfficientDet: Scalable and Efficient Object Detection (paper)

EfficientDet is a model with STOA in object detection, with better efficiency across a wide range of resource constraints.

Image source: Tan et al.

The model’s architecture with an EfficientNet backbone consists of two new design choices: a bidirectional Feature Pyramid Network (FPN) with a bidirectional topology, or BiFPN, and using learned weights when merging the features from different scales. Additionally, the network is designed with compound scaling, where the backbone, class/box network and input resolution are jointly adapted to meet a wide spectrum of resource constraints, instead of simply employing bigger backbone networks as done in previous works.

Dynamic Convolution: Attention Over Convolution Kernels (paper)

One of the main problems with light-weight CNNs, such as MobileNetV2, is their limited representation capability due to the constrained depth (i.e., number of layers) and width (i.e., number of channels) to maintain low computational requirements. In this paper, the authors propose dynamic convolutions to boost the capability of the convolution layers by aggregating the results of multiple parallel convolutions with attention weights, without increasing the computation significantly.

Image source: Chen et al.

Dynamic convolutions consist of applying K convolution kernels that share the same kernel size and input/output dimensions instead of a single operation, their results are then aggregated using attention weights produced with small attention module. For faster training, the kernel weights are constrained to triangles where each attention weights are in range [0, 1] and their sum equal to one.

PolarMask: Single Shot Instance Segmentation with Polar Representation (paper)

PolarMask proposes to represent the masks for each detected object in an instance segmentation task using polar coordinates. Polar representation compared to Cartesian representation has many inherent advantages: (1) The origin point of the Polar coordinates can be seen as the center of the object. (2) Starting from the origin point, the contour of the object can be determined only by the distance from the center and the angle. (3) The angle is naturally directional (starting from 0° to 360°) and makes it very convenient to connect the points into a whole contour.

Image source: Xie et al.

The model is based on FCOS, where for a given instance, we have three outputs: the classification probabilities over \(k\) classes (e.g., \(k= 80\) on COCO dataset), the center of the object (Polar Centerness), and the distances from the center (Mask Regression). The paper proposes to use \(n = 36\) distances from the center, so the angle between two points in the contour is 10° in this case. Based on these outputs, the extent of each object can be detected easily in a single shot manner without needing a sub-head network for pixel-wise segmentation over each detected object as in Mask-RCNN.

Other papers:

Generative models and image synthesis

Learning Physics-Guided Face Relighting Under Directional Light (paper)

Relighting involves adjusting the lighting of an unseen source image with its corresponding directional light, towards the new desired directional light. The previous works give good results but are limited to smooth lighting and do not model non-diffuse effects such as cast shadows and specularities.

To be able to create precise and believable relighting results and generalizes to complex illumination conditions and challenging poses, the authors propose an end-to-end deep learning architecture that both delights and relights an image of a human face. This is done in two stages, as shown below.

Image source: Nestmeyer et al.

The first stage consists of predicting the albedo and normals of the input image using a Unet architecture, the desired directional light is then used with the normals to predict the shading and then the diffuse relighting. The outputs of the first stage are used in the second stage to predict the correct shading. The whole model is trained end-to-end with a generative adversarial network (GaN) loss similar to the one used in the pix2pix paper.

SynSin: End-to-End View Synthesis From a Single Image (paper)

The goal of view synthesis is to generate new views of a scene given one or more images. But this can be challenging, requiring an understanding of the 3D scene from images. To overcome this, current methods rely on multiple images, train on ground-truth depth, or are limited to synthetic data. The authors propose a novel end-to-end model for view synthesis from a single image at test time while being trained on real images without any ground-truth 3D information (e.g., depth).

Image source: Wiles et al.

SynSin takes an input image, the target image, and the desired relative pose (i.e., the desired rotation and translation). The input image is first passed through a feature network to embed it into a feature space at each pixel location, followed by depth prediction at each pixel via a depth regressor. Based on the features and the depth information, a point cloud representation is created, the relative pose (i.e., applying rotation and translation) is then used to render the features at the new view with a fully differentiable neural point cloud renderer. However, the projected features might have some artifacts (e.g., some unseen parts of the image are now visible in the new view, and need to be rendered), in order to fix this, a generator is used to fill the missing regions. The whole model is then trained end-to-end with: an L2 loss, a discriminator loss, and a perceptual loss, without requiring any depth information. At test time, the network takes an image and the target relative pose and outputs the image with the desired view.

Novel View Synthesis of Dynamic Scenes with Globally Coherent Depths from a Monocular Camera (paper)

The objective in this paper is to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene, i.e., a series of images captured by a single monocular camera from many the locations (image bellow, left). The method can produce a novel view from an arbitrary location within the original range of locations (image bellow, middle), and can also produce the dynamic content that appeared across any views in different times (image bellow, right). This is done using a single camera, without requiring a multiview system or human-specific priors as previous methods.

Image source: Yoon et al.

The authors combine the depth from multiview stereo (DMV) with the depth from a single view (DSV) using depth fusion network with the help of the input image from the target view, producing a scale-invariant and a complete depth map. With geometrically consistent depths across views, a novel view can be synthesized using a self-supervised rendering network that produces a photorealistic image in the presence of missing data with an adversarial loss and a reconstruction loss.

Image source: Yoon et al.

STEFANN: Scene Text Editor using Font Adaptive Neural Network (paper)

This paper presents a method to directly modify text in an image at a character level while maintaining the same style. This is done in two steps. First, a network called FANnet takes as input the source character we would like to modify and outputs the target character while keeping structural consistency and the style of the source. Second, the coloring network, Colornet, takes the output of the first stage and the source character and colors the target character while reserving visual consistency. After doing this process for each character of the text, the characters are placed in the in-painted background while maintaining the correct spacing between characters. Below are some examples of the results from the project’s webpage.

Image source: Roy et al.

MixNMatch: Multifactor Disentanglement and Encoding for Conditional Image Generation (paper)

MixNMatch is a conditional GAN capable of disentangling background, object pose, shape, and texture from real images with minimal supervision, i.e., bounding box annotations to model background. A trained model can then be used to arbitrarily combine the our factors to generate new images, including sketch2color, cartoon2img, and img2gif applications.

Image source: Li et al.

Given a collection of images of a single object category, the model is trained to simultaneously encode background, object pose, shape, and texture factors associated with each images into a disentangled latent code space, and then generate real looking image by combining latent factors from the disentangled code space. Four encoders are used to separately encode each latent code. Four different latent codes are then sampled and fed into the FineGAN generator to hierarchically generate images, the model is then trained with four image-code pair discriminators optimize the encoders and generator to match their joint image-code distributions.

StarGAN v2: Diverse Image Synthesis for Multiple Domains (paper)

The main objective in image-to-image translation (i.e., changing some attributes of an image, such as hair color) is to increase the quality and the diversity of the generated images, while maintaining high scalability over multiple domains (i.e., a domain refers to set of images having the same attribute value, like black hair). Given that existing methods address only one of these issues, resulting in either limited diversity or various models for all domains. StarGAN v2 tries to solves both issues simultaneously, using style codes instead of an explicit domain labels as in the first version of StarGAN.

Image source: Choi et al.

The StarGAN v2 model contains four modules: A generator that translates an input image into an output image with the desired domain-specific style code. A latent encoder (or a mapping network) that produces a style code for each domain, one of which is randomly selected during training. A style encoder that extracts the style code of an image, allowing the generator to perform reference-guided image synthesis, and a discriminator that distinguishes between real and fake (R/F) images from multiple domains. All modules except the generator contain multiple output branches, one of which is selected when training the corresponding domain. The model is then trained using an adversarial loss, a style reconstruction to force the generator to utilize the style code when generating the image, a style diversification loss to enable the generator to produce diverse images and a cycle loss to preserve the characteristics of each domain.

GAN Compression: Efficient Architectures for Interactive Conditional GANs (paper)

Conditional GANs (cGANs) give the ability to do controllable image synthesis for many computer vision and graphics applications. However, the computational resources needed for training them are orders of magnitude larger than that of traditional CNNs used for detection and recognition. For example, GANs require 10x to 500x more computation that image recognition models. To solve this problem, the authors propose a GAN compression approach based on distillation, channel pruning, and neural architecture search (NAS), resulting in a compressed model while maintaining the same performance.

Image source: Li et al.

The proposed GAN Compression framework takes a pre-trained generator, considered as a teacher, which is first distilled into a smaller once-for-all student generator that contains all possible channel numbers through weight sharing, where different channel numbers are chosen for the student at each iteration. Now, in order to choose the correct number of channels of the student for each layer, many sub-generators are extracted from the once-for-all (student) generator and evaluated, creating the candidate generator pool. Finally, the best sub-generator with the desired compression ratio target and performance target (e.g., FID or mIoU) using one-shot NAS, the selected generator is then fine-tuned, resulting in the final compressed model.

Semantic Pyramid for Image Generation (paper)

Semantic Pyramid tries to bridge the gap between discriminative and generative models. This is done using a novel GAN-based model that utilizes the space of deep features learned by a pre-trained classification model. Given a set of features extracted from a reference image, the model generates diverse image samples, each with matching features at each semantic level of the classification model.

Image source: Shocher et al.

Concretely, given a pretrained classification network, a GAN network is designed with a generator with a similar architecture as the classification network. Each layer of the generator is trained to be conditioned on the previous layers, and the corresponding layers of the classification network. For example, conditioning the generator on the classification features close to the input results in an image similar to the input image of the classification model, with the possibility of exploring the sup-space of such image by sampling different noise vectors. On the other hand, conditioning on deeper layers results in a wider distribution of generated images. The model is trained with an adversarial loss to produce realistic images, a diversity loss to produce diverse images with different noises, and a reconstruction loss to match the features of the generated image to the reference image. Different regions of the image can be conditioned on different semantic levels using a masking operation \(m\), which can be used to semantically modify the image.

Analyzing and Improving the Image Quality of StyleGAN (paper)

In the first version of StyleGAN, the authors proposed an alternative generator architecture, capable of producing high-quality images and capable of separating high-level attributes (e.g., pose and identity when trained on human faces). This new architecture consisted of using a mapping network from the latent space \(\mathcal{Z}\) into an intermediate space \(\mathcal{W}\) to more closely match the distribution of features in the training set, and avoid the forbidden combinations present in \(\mathcal{Z}\). The intermediate latent vector is incorporated into the generator using Adaptive Instance Normalization (AdaIN) layers while a uniform noise is additively injected before each application of AdaIN, and trained in a progressive manner. Yielding impressive results in data-driven unconditional generative image modeling. However, the generated images still contain some artifacts, like water-splotches (more details: whichfaceisreal) and unchanged positions of face attributes like eyes.

Image source: Karras et al.

First, to avoid the droplet effects, which are results of the AdaIN discarding information in feature maps, AdaIN is replaced with a weight demodulation layer by removing some redundant operations, moving the addition of the noise to be outside of the active area of a style, and adjusting only the standard deviation per feature map. The progressive GAN training is removed to avoid the permanent positions of face attributes based on MSG-GAN. Finally, StyleGAN2 introduces a new regularization term to the loss to enforce smoother latent space interpolations based on the Jacobian matrix at a single position at the intermediate latent space.

Adversarial Latent Auto-encoders (paper)

Auto-Encoders (AE) are characterized by their simplicity and their capability of combining generative and representational properties by learning an encoder-generator map simultaneously. However, they do not have the same generative capabilities as GANs. The proposed Adversarial Latent Autoencoder (ALAE) retain the generative properties of GANs by learning an output data distribution with an adversarial strategy, with AE architecture where the latent distribution is learned from data to improve the disentanglement properties (i.e., the \(\mathcal{W}\) intermediate latent space of StyleGAN).

Image source: Pidhorskyi et al.

The ALAE architecture decomposes the generator G and the discriminator D in two networks: F, G, and E, D, where the latent spaces between F and G, and between E and D are considered same, and refereed to as the intermediate latent space \(\mathcal{W}\). In this case, the mapping network F is deterministic, while E and G are stochastic depending on an injected noise. The pair of networks (G,E) consist a generator-encoder network that auto-encodes the latent space \(\mathcal{W}\), and trained to minimize the discrepancy \(\Delta\) (e.g., an MSE loss) between the two distributions, i.e., the distribution at the input of G and the distribution of the output of E. As a whole, model is the trained by alternating between optimizing the GAN loss and the discrepancy \(\Delta\).

Other papers:

Representation Learning

Self-Supervised Learning of Pretext-Invariant Representations (paper)

Existing self-supervised learning methods consist of creating a pretext task, for example, diving the images into nine patches and solving a jigsaw puzzle on the permuted patches. These pretext tasks involve transforming an image, computing a representation of the transformed image, and predicting properties of transformation from that representation. As a result, the authors argue that the learned representation must covary with the transformation, and as a results, reducing the amount of learned semantic information. To solve this, they propose PIRL (Pretext-Invariant Representation Learning) to learn invariant representations with respect to the transformations and retain more semantic information.

Image source: Misra et al.

PIRL trains a network that produces image representations that are invariant to image transformations, and this is done by minimizing a contrastive loss, where the model is trained to differentiate a positive sample (i.e., an image and its transformed version) from N corresponding negative samples that are drawn uniformly at random from the dataset excluding the image used for the positive samples. Using a large number of negative samples is critical for noise contrastive estimation based losses. To this end, PIRL uses a memory bank containing feature representations for each example, where each representation at a given instance is an exponential moving average of previous representations.

ClusterFit: Improving generalization of visual representations (paper)

Weakly-supervised (e.g., hashtag prediction) and self-supervised (e.g., jigsaw puzzle) strategies are becoming increasingly popular for pretraining CNNs for visual downstream tasks. However, the learned representations using such methods may overfit to the pretraining objective given the limited training signal that can be extracted during pretraining, leading to a reduced generalization to downstream tasks.

Image source: Yan et al.

The idea of ClusterFit is quite simple, a network is first pretrained using some chosen pretraining task, be it self- or weakly-supervised, this network is then used to extract features for each image to then apply k-means clustering and assign a pseudo-label for each data points. The pseudo-labels can then be used for training a network from scratch, which will be more adapted to downstream tasks, with either linear probing or fine-tunning.

Momentum contrast for unsupervised visual representation learning (paper)

Recent works on unsupervised visual representation learning are based on minimizing the contrastive loss, which can be seen as building dynamic dictionaries, where the keys in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network, which is then trained so that a query \(q\) is similar to a given key \(k\) (a positive sample) and dissimilar to the other keys (negative samples) .

Image source: He et al.

Momentum Contrast (MoCo) trains an encoder by matching an encoded query \(q\) to a dictionary of encoded keys using a contrastive loss. The dictionary keys are defined on-the-fly by a set of data samples, where the dictionary is built as a queue, with the current mini-batch enqueued and the oldest mini-batch dequeued, decoupling it from the mini-batch size. By using a queue, a large number of negatives can be used even outside of the current mini-batch. Additionally, the keys are encoded by a slowly progressing encoder, i.e., an exponential moving average of the query encoder, this way, the key encoder is slowly changing over time, producing stable predictions during the course of training. An other benefit of the query encoder is that the dequeue keys used as negatives are not too dissimilar to the current prediction of the key encoder, avoiding having a simple matching problem where the negatives are easily distinguishable from the positive sample.

Steering Self-Supervised Feature Learning Beyond Local Pixel Statistics (paper)

Image source: Jenni et al.

The authors argue that good image representations should capture both local and global image statistics to better generalize to downstream tasks, where local statistics capture the distribution of near by pixels, e.g., texture, and global statistics to capture the distribution of far away pixels and patches, e.g., shape. However, CNNs are more biased toward local statistics, and need to be explicitly forced to focus on global features for better generalization. To this end, the authors carefully choose a set of image transformations (i.e., warping, local inpainting and rotation) so that the network can not predict the applied transformation while only observing local statistics, forcing the network to focus on global pixel statistics. With the selected transformation, the network is then pretrained using a classification objective to predicted to label corresponding to the applied transformation.

Other papers:

Computational photography

Learning to See Through Obstructions (paper)

The paper propose a learning based approach for removing unwanted obstructions (examples bellow). The method uses a multi-frame obstruction removal algorithm that exploits the advantages of both optimization-based and learning-based methods, alternating between dense motion estimation and background/obstruction layer reconstruction steps in a coarse-to-fine manner. By modeling of the dense motion, detailed content in the respective layers can be progressively recovered, gradually separating the background from the unwanted occlusion layers. The first staeg consists of flow decomposition, followed by two subsequent stages, background and obstruction layer reconstruction stages, and finally optical flow refinement.

Image source: Liu et al.

Background Matting: The World is Your Green Screen (paper)

The process of separating an image into foreground and background, called matting, generally requires a green screen background or a manually created trimap to produce a good matte, to then allow placing the extracted foreground in the desired background. In this paper, the authors propose to use a captured background as an estimate of the true background which is then used to solve for the foreground and alpha value (i.e., every pixel in the image is represented as a combination of foreground and background with a weight alpha).

Image source: Sengupta et al.

The model takes as input an image or video of a person in front of a static, natural background, plus an image of just the background. A deep matting network then extracts foreground color and alpha at each spatial location for a given input frame, augmented with background, soft segmentation, and optionally nearby video frames, in addition to a discriminator network that guides the training to generate realistic results. The whole model is trained end-to-end using a combination of a supervised and self-supervised adversarial losses.

3D Photography using Context-aware Layered Depth Inpainting (paper)

The objective of the paper is to synthesize content in regions occluded in the input image from a single RGB-D image. The proposed method consists of a three steps pipeline. First, given the RGB-D image, a preprocessing step is applied by filtering the depth and color input using a bilateral median filter, the raw discontinuities are then detected using disparity thresholds to estimate the depth edges. Followed by a detection of a context/synthesis regions for each detected depth. Given the color, depth and edge information, the last step consist of depth edge inpating guided by color and depth inpating, resulting a new view as seen in the GIF bellow (taken from authors YT video).

Image source: Shih et al.

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models (paper)

The goal of single-image super-resolution is to output a corresponding high-resolution (HR) image from a low-resolution (LR) one. Previous methods train with a supervised loss that measures the pixel-wise average distance between the ground-truth HR image and the output of the model. However, multiple HR images that map to the same LR image exist, and such methods try to match the true HR image, outputting a per-pixel average of all the possible HR images that do not contain a lot details in high frequency regions, resulting in a blurry HR output.

Image source: Menon et al.

PULSE seeks to find only one plausible HR image from the set of possible HR images that down-scale to the same LR input, and can be trained in a self-supervised manner without the need for a labeled dataset, making the method more flexible and not confined to a specific degradation operator. Specifically, instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is done by minimizing a distance measure between the down-scaled HR output of the generator, taking as input the LR image, and the LR image itself. Additionally, the search space is restricted to guarantee that outputs of the generator are realistic by using the unit sphere in \(d\) dimensional Euclidean space as the latent space.

Other papers:

Transfer/Low-shot/Semi/Unsupervised Learning

Conditional Channel Gated Networks for Task-Aware Continual Learning (paper)

In the case where the training examples come in a sequence of sub-tasks, deep nets where gradient-based optimization is required are subject to catastrophic forgetting, where the learned information from previous tasks is lost. Continual learning tries to solve this by allowing the models to protect and preserve the acquired information while still being capable of extracting new information from new tasks. Similar to the gating mechanism in LSTMs/GRUs, the authors propose a channel-gating module where only a subset of the feature maps are selected depending on the current task. This way, the important filters are protected to avoid a loss in performance of the model on previously learned tasks, additionally, by selecting a limited set of kernels to be updated, the model will still have the capacity to learn new tasks.

Image source: Abati et al.

The paper also introduces a task classifier to overcome the need to know which task the model is being applied to at test time, the task classifier is trained to predict the task at train time and selects which CNN features to pass to the fully-connected layers for classification. However, the task classifier is also subject to catastrophic forgetting, and the authors propose to train it with Episodic memory and Generative memory to avoid this.

Few-Shot Learning via Embedding Adaptation with Set-to-Set Functions (paper)

Few-Shot Learning consists of learning a well-performing model with N-classes, K examples in each class (i.e., referred to as a N-way, K-shot task), but a high-capacity deep net is easily prone to over-fitting with limited training data. Many few-shot learning methods (e.g., Prototypical Networks) address this by learning an instance embedding function from seen classes during training where there are ample labeled instances, and then apply a simple function to the embeddings of the new instances from unseen classes with limited labels at test time. However, the learned embeddings are task-agnostic given that the learned the embedding function is not optimally discriminative with respect to the unseen classes.

Image source: Ye et al.

The authors propose to adapt the instance embeddings to the target classification task with a set-to-set function, yielding embeddings that are task-specific and are discriminative. To have task-specific embeddings, an additional adaptation step is conducted, where the embedding function is transformed with a set-to-set function that contextualizes over the image instances of a set, to enable strong co-adaptation of each item. The authors tested many set-to-set functions, such as BiLSTMs, Graph Convolutional Networks, and Transformers, and found that Transformers work bets in this case.

Towards Discriminability and Diversity: Batch Nuclear-norm Maximization under Label Insufficient Situations (paper)

In cases where we are provided with a small labeled set, the performance of deep nets degrades on ambiguous samples as a result of placing the decision boundary close to high-density regions (right figure below). A common solution is to minimize the entropy, but one side effect caused by entropy minimization is the reduction of the prediction diversity, where ambiguous samples are classified as belonging to the most dominant classes, i.e., an increase in discriminability but a reduction in diversity.

Image source: Cui et al.

The paper investigates ways to increase both the discriminability: outputting highly certain predictions, and the diversity: predicting all the categories somewhat equally. By analyzing the rank of the output matrix \(A \in \mathbb{R}^{B \times C}\), with a batch of \(B\) samples and \(C\) classes, the authors find that the prediction discriminability and diversity could be separately measured by the Frobenius-norm and the rank of \(A\), and propose Batch Nuclear-norm Maximization (BNM) and apply it on the output matrix \(A\) to increase the performance in cases where we have a limited amount of labels, such as semi-supervised learning and domain adaptation.

Other papers:

Vision and Language

12-in-1: Multi-Task Vision and Language Representation Learning (paper)

Vision-and-language based methods often focus on a small set of independent tasks that are studied in isolation. However, the authors point that visually-grounded language understanding skills required for success at each of these tasks overlap significantly. To this end, the paper proposes a large-scale, multi-task training regime with a single model trained on 12 datasets from four broad categories of tasks: visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Using a single model helps reduce the number of parameters from approximately 3 billion parameters to 270 million while simultaneously improving the performances across tasks.

Image source: Lu et al.

The model is based on ViLBERT, where each task has a task-specific head network that branches off a common, shared trunk (i.e., ViLBERT model). With 6 task heads, 12 datasets, and over 4.4 million individual training instances, multi-task training of this scale is hard to control. To overcome this, all the models are first pretrained on the same dataset. Then a round-robin batch sampling is used to cycle through each task from the beginning of multi-task training, with an early stopping for stopping a given task where some over-fitting is observed, with the possibility of restarting the training to avoid catastrophic forgetting.

Other papers:

The rest

This post turned into a long one very quickly, so in order to avoid ending-up with a 1h long reading session, I will simply list some papers I came across in case the the reader is interested in the subjects.

Click to expand

Efficient training & inference:

3D applications and methods:

Face, gesture, and body pose:

Video & Scene analysis and understanding:

Deep Semi-Supervised Learning

2020-06-11T08:00:00+00:00

[A more detailed version of this post is available on arXiv.]
[A curated and an up-to-date list of SSL papers is available on github.]

Deep neural networks demonstrated their ability to provide remarkable performances on certain supervised learning tasks (e.g., image classification) when trained on extensive collections of labeled data (e.g. ImageNet). However, creating such large collections of data requires a considerable amount of resources, time, and effort. Such resources may not be available in many practical cases, limiting the adoption and application of many deep learning (DL) methods.

In a search for more data-efficient DL methods to overcome the need for large annotated datasets, we see a lot of research interest in recent years with regards to the application of semi-supervised learning (SSL) to deep neural nets as a possible alternative, by developing novel methods and adopting existing SSL frameworks for a deep learning setting. This post discusses SSL in a deep learning setting and goes through some of the main deep learning SSL methods.

Semi-supervised Learning
Consistency Regularization
Entropy Minimization
Proxy-label Methods
Holistic Methods
References

Semi-supervised Learning

What is Semi-supervised Learning?

Semi-supervised learning (SSL) is halfway between supervised and unsupervised learning. In addition to unlabeled data, the algorithm is provided with some supervision information – but not necessarily for all examples. Often, this information will be the targets associated with some of the examples. In this case, the data set \( X=\left(x_{i}\right); i \in [n]\) can be divided into two parts: the points \( X_{l}:=\left(x_{1}, \dots, x_{l}\right) \), for which labels \( Y_{l}:=\left(y_{1}, \dots, y_{l}\right) \) are provided, and the points \( X_{u}:=\left(x_{l+1}, \ldots, x_{l+u}\right) \), the labels of which are not known.

Chapelle et al. — SSL book

As stated in the definition above, in SSL, we are provided with a dataset containing both labeled and unlabeled examples. The portion of labeled examples is usually quite small compared to the unlabeled example (e.g., 1 to 10% of the total number of examples). So with a dataset \(\mathcal{D}\) containing a labeled subset \(\mathcal{D}_l\) and an unlabeled subset \(\mathcal{D}_u\). The objective, or rather hope, is to leverage the unlabeled examples to train a better performing model than what can be obtained using only the labeled portion. And hopefully, get closer to the desired optimal performance, in which all of the dataset \(\mathcal{D}\) is labeled.

More formally, SSL’s goal is to leverage the unlabeled data \(\mathcal{D}_u\) to produce a prediction function \(f_{\theta}\) with trainable parameters \(\theta\), that is more accurate than what would have been obtained by only using the labeled data \(\mathcal{D}_l\). For instance, \(\mathcal{D}_u\) might provide us with additional information about the structure of the data distribution \(p(x)\), to better estimate the decision boundary between the different classes. As shown in Fig. 1 bellow, where the data points with distinct labels are separated with low-density regions, leveraging unlabeled data with a SSL approach can provide us with additional information about the shape of the decision boundary between two classes and reduce the ambiguity present in the supervised case.

Fig. 1. The decision boundaries obtained on two moons dataset, with a supervised and different SSL approaches, using 6 labeled examples, 3 for each class and the rest of the points as unlabeled data. (Image source: Oliver et al)

Semi-supervised learning first appeared in the form of self-training, where a model is first trained on labeled data, and then, iteratively, at each training iteration, a portion of the unlabeled data is annotated using the trained model and added to the training set for the next iteration. SSL really took off in the 1970s after its success with iterative algorithms such as the expectation-maximization algorithm, using labeled and unlabeled data to maximize the likelihood of the model. In this post, we are only interested in SSL applied to deep learning. For a detailed review of the field, Semi-Supervised Learning Book is a good resource.

Semi-supervised learning methods

There have been many SSL methods and approaches that have been introduced over the years, SSL algorithms can be broadly divided into the following categories:

Consistency Regularization (Consistency Training). Based on the assumption that if a realistic perturbation was applied to the unlabeled data points, the prediction should not change significantly. We can then train the model to have a consistent prediction on a given unlabeled example and its perturbed version.
Proxy-label Methods. Such methods leverage a trained model on the labeled set to produce additional training examples extracted from the unlabeled set based on some heuristic. These approaches can also be referred to as self-teaching or bootstrapping algorithms; we follow Ruder et al. and refer to them as proxy-label methods. Some examples of such methods are Self-training, Co-training, and Multi-View Learning.
Generative models. Similar to the supervised setting, where the learned features on one task can be transferred to other downstream tasks. Generative models that are able to generate images from the data distribution \(p(x)\) must learn transferable features to a supervised task \(p(x | y)\) for a given task with targets \(y\).
Graph-Based Methods. A labeled and unlabeled data points constitute the nodes of the graph, and the objective is to propagate the labels from the labeled nodes to the unlabeled ones. The similarity of two nodes \(n_i\) and \(n_j\) is reflected by how strong is the edge \(e_{ij}\) between them.

In addition to these main categories, there is also some SSL work on entropy minimization, where we force the model to make confident predictions by minimizing the entropy of the predictions. Consistency training can also be considered as a proxy-label method, with a subtle difference where instead of considering the predictions as ground-truths and compute the cross-entropy loss, we enforce consistency of predictions by minimizing a given distance between the outputs.

In this post, we will focus more on consistency regularization based approaches, given that they are the most commonly used methods in deep learning, and we will present a brief introduction to the proxy-label, and holistic approaches.

Main Assumptions in SSL

The first question we need to answer, is under what assumptions can we apply SSL algorithms? SSL algorithms only work under some conditions, where some assumptions about the structure of the data need to hold. Without such assumptions, it would not be possible to generalize from a finite training set to a set of possibly infinitely many unseen test cases.

The main assumptions in SSL are:

The Smoothness Assumption: If two points \(x_1\), \(x_2\) that reside in a high-density regions are close, then so should be their corresponding outputs \(y_1\), \(y_2\). Meaning that if two inputs are of the same class and belong to the same cluster, which is a high-density region of the input space, then their corresponding outputs need to be close. The inverse holds; if the two points are separated by a low-density region, the outputs must be distant from each other. This assumption can be quite helpful in a classification task, but not so much for regression.
The Cluster Assumption: If points are in the same cluster, they are likely to be of the same class. In this special case of the smoothness assumption, we suppose that input data points form clusters, and each cluster corresponds to one of the output classes. The cluster assumption can also be seen as the low-density separation assumption: the decision boundary should lie in the low-density regions. The relation between the two assumptions is easy to see, if a given decision boundary lies in a high-density region, it will likely cut a cluster into two different classes, resulting in samples from different classes belonging to the same cluster, which is a violation of the cluster assumption. In this case, we can restrict our model to have consistent predictions on the unlabeled data over some small perturbations pushing its decision boundary to low-density regions.
The Manifold Assumption: The (high-dimensional) data lie (roughly) on a low-dimensional manifold. With high dimensional space, where the volume grows exponentially with the number of dimensions, it can be quite hard to estimate the true data distribution for generative tasks, and for discriminative tasks, the distances are similar regardless of the class type, making classification quite challenging. However, if our input data lies on some lower-dimensional manifold, we can try to find a low dimensional representation using the unlabeled data and then use the labeled data to solve the simplified task.

Consistency Regularization

A recent line of works in deep semi-supervised learning utilize the unlabeled data to enforce the trained model to be in line with the cluster assumption, i.e., the learned decision boundary must lie in low-density regions. These methods are based on a simple concept that, if a realistic perturbation was to be applied to an unlabeled example, the prediction should not change significantly, given that under the cluster assumption: Data points with distinct labels are separated with low-density regions, so the likelihood of one example switching classes after a perturbation is small (see Figure 1).

More formally, with consistency regularization, we are favoring the functions \(f_\theta\) that give consistent prediction for similar data points. So rather than minimizing the classification cost at the zero-dimensional data points of the inputs space, the regularized model minimizes the cost on a manifold around each data point, pushing the decision boundaries away from the unlabeled data points and smoothing the manifold on which the data resides (Zhu, 2005). Given an unlabeled data point \(x_u \in \mathcal{D}_u\) and its perturbed version \(\hat{x}_u\), the objective is to minimize the distance between the two outputs \(d(f_{\theta}(x_u), f_{\theta}(\hat{x}_u))\). The popular distance measures \(d\) are mean squared error (MSE), Kullback-Leiber divergence (KL) and Jensen-Shannon divergence (JS). For two outputs \(y_u = f_{\theta}(x_u)\) and \(\hat{y}_u = f_{\theta}(\hat{x}_u)\) in the form of a probability distribution over the \(C\) classes, and \(m=\frac{1}{2}(f_{\theta}(x_u) + f_{\theta}(\hat{x}_u))\), we can compute these measures as follows:

\[\small d_{\mathrm{MSE}}(y_u, \hat{y}_u)=\frac{1}{C} \sum_{k=1}^{C}(f_{\theta}(x_u)_k -f_{\theta}(\hat{x}_u)_k)^{2}\] \[\small d_{\mathrm{KL}}(y_u, \hat{y}_u)=\frac{1}{C} \sum_{k=1}^{C} f_{\theta}(x_u)_k \log \frac{f_{\theta}(x_u)_k}{f_{\theta}(\hat{x}_u)_k}\] \[\small d_{\mathrm{JS}}(y_u, \hat{y}_u)=\frac{1}{2} d_{\mathrm{KL}}(y_u, m)+\frac{1}{2} \mathrm{d}_{\mathrm{KL}}(\hat{y}_u, m)\]

Note that we can also enforce a consistency over two perturbed versions of \(x_u\), \(\hat{x}_{u_1}\) and \(\hat{x}_{u_2}\). Now let’s go through the popular consistency regularization methods in deep learning.

Ladder Networks

With the objective to take any well-performing feed-forward network on supervised data and augment it with additional branches to be able to utilize additional unlabeled data. Rasmus et al. proposed to use Ladder Networks (Harri Valpola) with an additional encoder and decoder for SSL. As illustrated in Figure 2, the network consists of two encoders, a corrupted and clean one, and a decoder. At each training iteration, the input \(x\) is passed through both encoders. In the corrupted encoder, Gaussian noise is injected at each layer after batch normalization, producing two outputs, a clean prediction \(y\) and a prediction based on corrupted activations \(\tilde{y}\). The output \(\tilde{y}\) is then fed into the decoder to reconstruct the uncorrupted input and the clean hidden activations. The unsupervised training loss \(\mathcal{L}_u\) is then computed as the MSE between the activations of the clean encoder \(\mathbf{z}\) and the reconstructed activations \(\hat{\mathbf{z}}\) (ie., after batch normalization) in the decoder using the corrupted output \(\tilde{y}\), this is computed over all layers, from the input to the last layer \(L\), with a weighting \(\lambda_{l}\) for each layer’s contribution loss:

\[\mathcal{L}_u = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} \sum_{l=0}^{L} \lambda_{l}\|\mathbf{z}^{(l)}-\hat{\mathbf{z}}^{(l)}\|^{2}\]

If the input \(x\) is a labeled data point (\(x \in \mathcal{D}_l\)). Then we can add a supervised loss term to \(\mathcal{L}_u\) to obtain the final loss. Note the supervised cross-entropy \(\mathrm{H}(\tilde{y}, t)\) loss is computed between the corrupted output \(\tilde{y}\) and the targets \(t\):

\[\mathcal{L} = \mathcal{L}_u + \mathcal{L}_s = \mathcal{L}_u + \frac{1}{|\mathcal{D}_l|} \sum_{x, t \in \mathcal{D}_l} \mathrm{H}(\tilde{y}, t)\]

Fig. 2. An illustration of one forward pass of Ladder Networks, C refers to the MSE loss between the activations at various layers. (Image source: Rasmus et al)

The method can be easily adapted for convolutional neural networks (CNNs) by replacing the fully-connected layers with convolution and deconvolution layers for semi-supervised vision tasks. However, the ladder network is quite heavy computationally, approximately tripling the computation needed for one training iteration. To mitigate this, the authors propose a variant of ladder networks called Γ-Model where \(\lambda_{l}=0\) when \(l<L\). In this case, the decoder is omitted, and the unsupervised loss is computed as the MSE between the two outputs \(y\) and \(\tilde{y}\).

Π-model

The Π-model (Laine et al.) is a simplification of the Γ-Model of Ladder Networks, where the corrupted encoder is removed, and the same network is used to get the prediction for both corrupted and uncorrupted inputs. Specifically, Π-model takes advantage of the stochastic nature of the prediction function \(f_ \theta\) in neural networks due to conventional regularization techniques, such as data augmentation and dropout, that typically don’t alter the model’s predictions. For any given input \(x\), the objective is to reduce the distances between two predictions of \(f_ \theta\) with \(x\) as input in both forward passes. Concretely, as illustrated in Figure 3, we would like to minimize \(d(z, \tilde{z})\), where we consider one of the two outputs as a target. Given the stochastic nature of the predictions function (ie., using dropout as noise source), the two outputs \(f_\theta(x) = z\) and \(f_\theta(x) = \tilde{z}\) will be distinct. The objective is to obtain consistent predictions for both of them. In case the input \(x\) is a labeled data point, we also compute the cross-entropy supervised loss using the provided labels \(y\) and the total loss will be:

\[\mathcal{L} = w \frac{1}{|\mathcal{D}_u|} \sum_{x \in \mathcal{D}_u} d_{\mathrm{MSE}}(z, \tilde{z}) + \frac{1}{|\mathcal{D}_l|} \sum_{x, y \in \mathcal{D}_l} \mathrm{H}(y, z)\]

With \(w\) as a weighting function, starting from 0 up to a fixed weight \(\lambda\) (eg., 30) after a given number of epochs (eg., 20% of training time). This way, we avoid using the untrained and random prediction function providing us with unstable predictions at the start of training to extract the training signal from the unlabeled examples.

Fig. 3. Loss computation for Π-model, we compute the MSE between the two outputs for the unsupervised loss, and if the input is a labeled example, we add the supervised loss to the weighted unsupervised loss. (Image source: Laine et al)

Temporal Ensembling

Π-model can be divided into two stages. We first classify all of the training data without updating the weights of the model, obtaining the predictions \(\tilde{z}\), and in the second stage, we consider the predictions \(\tilde{z}\) as targets for the unsupervised loss and enforce consistency of predictions by minimizing the distance between the current outputs \(z\) and the outputs of the first stage \(\tilde{z}\) under different dropout and augmentations. The problem with this approach is that the targets \(\tilde{z}\) are based on a single evaluation of the network and can rapidly change, this instability in the targets can lead to instability during training and reduces the amount of training signal that can be extracted from the unlabeled examples. To solve this, Laine et al. proposed a second version of Π-model called Temporal Ensembling, where the targets \(\tilde{z}\) are the aggregation of all the previous predictions. This way, during training, we only need a single forward pass to get the current predictions \(z\) and the aggregated targets \(\tilde{z}\), speeding up the training time by approximately 2x. The training process is illustrated in Figure 4.

Fig. 4. Loss computation for Temporal Ensembling, we compute the MSE between the current prediction and the aggregated target for the unsupervised loss, and if the input is a labeled example, we add the supervised loss to the weighted unsupervised loss. (Image source: Laine et al)

For a target \(\tilde{z}\), at each training iteration, the current output \(z\) are accumulated into the ensemble outputs \(\tilde{z}\) by an exponentially moving average update:

\[\tilde{z} = \alpha \tilde{z}+(1-\alpha) z\]

where \(\alpha\) is a momentum term that controls how far the ensemble reaches into training history. \(\tilde{z}\) can also be seen as the output of an ensemble network \(f\) from previous training epochs, where the recent ones have a greater weight than the distant ones.

At the start of training, temporal ensembling reduces to Π-model since the aggregated targets are very noisy, to overcome this, similar to the bias correction used in Adam optimizer, a training target \(\tilde{z}\) are corrected for the startup bias at a training step \(t\) as follows:

\[\tilde{z} = (\alpha \tilde{z}+(1-\alpha) z) / (1-\alpha^{t})\]

The loss computation in temporal ensembling remains the same as in Π-model, but with two critical benefits. First, the training is faster since we only need a single forward pass through the network to obtain \(z\), while maintaining an exponential moving average (EMA) of label predictions on each training example, and penalizes predictions that are inconsistent with these targets. Second, the targets are more stable during training, yielding better results. The downside of such a method is a large amount of memory needed to keep an aggregate of the predictions for all of the training examples, which can become quite memory intensive for large datasets and dense tasks (e.g., semantic segmentation).

Mean Teachers

In the previous approach, the same model plays a dual role as a teacher and a student. Given a set of unlabeled data, as a teacher, the model generates the targets, which are then used by itself as a student for learning using a consistency loss. These targets may very well be misclassified, and if the weight of the unsupervised loss outweighs that of the supervised loss, the model is prevented from learning new information and keeps predicting the same targets, resulting in the form of confirmation bias. To solve this, the quality of the targets must be improved.

The quality of targets can be improved by either: (1) carefully choosing the perturbations instead of simply injecting additive or multiplicative noise, or (2) carefully choosing the teacher model responsible for generating the targets, instead of using a replica of the student model.

Π-model and its improved version with Temporal Ensembling provides a better and more stable teacher model by maintaining an EMA of the predictions of each example, which is formed by an ensemble of the model’s current version and those earlier versions that evaluated the same example. This ensembling improves the quality of the predictions and using them as the teacher predictions improve results. However, the newly learned information is incorporated into the training at a slow pace, since each target is updated only once during training, and the larger the dataset, the bigger the span between the updates gets. To overcome the limitations of Temporal Ensembling, Tarvainen et al. propose to average the model weights instead of its predictions and call this method Mean Teacher, illustrated in Figure 5.

Fig. 5. The Mean Teacher method. The teacher model, which is an EMA of the student model, is responsible for generating the targets for consistency training. The student model is then trained to minimize the supervised loss over labeled examples and the consistency loss over unlabeled examples. At each training iteration, both models are evaluated with an injected noise (η, η'), and the weights of the teacher model are updated using the current student model to incorporate the learned information at a faster pace. (Image source: Tarvainen et al.)

A training iteration of Mean Teacher is very similar to previous methods. The main difference is that were the Π-model uses the same model as a student and a teacher \(\theta^{\prime}=\theta\), and temporal ensembling approximate a stable teacher \(f_{\theta^{\prime}}\) as an ensemble function with a weighted average of successive predictions. Mean Teacher defines the weights \(\theta^{\prime}_t\) of the teacher model \(f_{\theta^{\prime}}\) at training step \(t\) as the EMA of successive student’s weights \(\theta\) as follows:

\[\theta_{t}^{\prime}=\alpha \theta_{t-1}^{\prime}+(1-\alpha) \theta_{t}\]

In this case, the loss computation is the sum of the supervised and unsupervised loss, where the teacher model is used to obtain the targets for the unsupervised loss for a given input \(x_i\):

\[\mathcal{L} = w \frac{1}{|\mathcal{D}_u|} \sum_{x \in \mathcal{D}_u} d_{\mathrm{MSE}}(f_{\theta}(x), f_{\theta^{\prime}}(x)) + \frac{1}{|\mathcal{D}_l|} \sum_{x, y \in \mathcal{D}_l} \mathrm{H}(y, f_{\theta}(x))\]

Dual Students

One of the main drawbacks of using Mean Teacher, where the teacher’s weights are an EMA of the student’s weights, is that given a large number of training iterations, the weights of the teacher model will converge to that of the student model, and any biased and unstable predictions will be carried over to the student.

To solve this, Ke et al. propose a dual students step-up, where two student models with different initialization are simultaneously trained, and at a given iteration, one of them provides the targets for the other. To choose which one, we check for the most stable predictions that satisfy the following stability conditions:

The predictions using two input versions, a clean \(x\) and a perturbed version \(\tilde{x}\) give the same results: \(f(x) = f(\tilde{x})\).
Both predictions are confident, ie, are far from the decision boundary. This can be tested by seeing if \(f(x)\) (resp. \(f(\tilde{x})\)) is greater than a confidence threshold \(\epsilon\), such as 0.1.

Given two student models, \(f_{\theta_1}\) and \(f_{\theta_2}\), an unlabeled input \(x_u\) and its perturbed version \(\tilde{x}_u\). We compute four predictions: \(f_{\theta_1}(x_u), f_{\theta_1}(\tilde{x}_u), f_{\theta_2}(x_u), f_{\theta_2}(\tilde{x}_u)\). In addition to training each model to minimize both the supervised and unsupervised losses for both models:

\[\mathcal{L}_s = \frac{1}{|\mathcal{D}_l|} \sum_{x_l, y \in \mathcal{D}_l} \mathrm{H}(y, f_{\theta_i}(x_l))\] \[\mathcal{L}_u = \frac{1}{|\mathcal{D}_u|} \sum_{x_u \in \mathcal{D}_u} d_{\mathrm{MSE}}(f_{\theta_i}(x_u), f_{\theta_i}(\tilde{x}_u))\]

We also force one of the students to have a similar prediction to its counterpart. To chose which one to update its weights, we check for the stability constraint for both models. If the predictions one of the models is unstable, we update its weights. If both are stable, we update the model with the largest variation \(\mathcal{E}^{i} =\left\|f_{i}(x_u)-f_{i}(\tilde{x}_u)\right\|^{2}\), so the least stable.

Fig. 6. Examples of the perturbed inputs for different values of the scaling hyperparameter Ɛ. (Image source: Ke et al)

In the end, as depicted in Figure 6, the least stable model is trained with the following loss:

\[\mathcal{L} = \mathcal{L}_s + \lambda_{1} \mathcal{L}_u + \lambda_{2} \frac{1}{|\mathcal{D}_u|} \sum_{x_u \in \mathcal{D}_u} d_{\mathrm{MSE}}(f_{\theta_i}(x_u), f_{\theta_j}(x_u))\]

while the stable model is trained using traditional loss for consistency training: \(\lambda_{1} \mathcal{L}_u + \mathcal{L}_s\).

Virtual Adversarial Training

The previous approaches focused on applying random perturbations to each input to generate artificial input points, encouraging the model to assign similar outputs to the unlabeled data points and their perturbed versions, this way we push for a smoother output distribution, and as a result, the generalization performance of the model can be improved. Such random noise and random data augmentation often leaves the predictor particularly vulnerable to small perturbations in a specific direction, that is, the adversarial direction, which is the direction in the input space in which the label probability \(p(y|x)\) of the model is most sensitive.

To solve this, and inspired by adversarial training (Goodfellow et al.) that trains the model to assign to each input data a label that is similar to the labels to be assigned to its neighbors in the adversarial direction. Miyato et al. propose Virtual Adversarial Training (VAT), a regularization technique that enhances the model’s robustness around each input data point against random and local perturbations. The term “virtual” comes from the fact that the adversarial perturbation is approximated without label information and is hence applicable to semi-supervised learning to smooth the output distribution.

Concretely, VAT trains the output distribution to be identically smooth around each data point by selectively smoothing the model in its most adversarial direction. For a given data point \(x\), we would like to compute the adversarial perturbation \(r_{adv}\) that will alter the model’s predictions the most. We start by sampling a Gaussian noise \(r\) of the same dimensions as the input \(x\). We then compute its gradients \(grad_r\) with respect the loss between the two predictions, with and without the injections of the noise \(r\) (i.e., KL-divergence is used as a distance measure \(d(.,.)\)). \(r_{adv}\) can then be obtained by normalizing and scaling \(grad_r\) by a hyperparameter \(\epsilon\). This can be written as follows:

\[1) \ \ r \sim \mathcal{N}(0, \frac{\xi}{\sqrt{\operatorname{dim}(x)}} I)\] \[2) \ \ grad_{r}=\nabla_{r} d_{\mathrm{KL}}(f_{\theta}(x), f_{\theta}(x+r))\] \[3) \ \ r_{adv}=\epsilon \frac{grad_{r}}{\|grad_{r}\|}\]

Note that the computation above is a single iteration of the approximation of \(r_{adv}\), for a more accurate approximation, we consider \(r_{adv} = r\) and recompute \(r_{adv}\) following the last two steps. But in general, given how computationally expensive this computation is, requiring additional forward and backward passes, we only apply a single power iteration for computing the adversarial perturbation.

With the optimal perturbation \(r_{adv}\), we can then compute the unsupervised loss as the MSE between the two predictions of the model, with and without the injection of \(r_{adv}\):

\[\mathcal{L}_u = w \frac{1}{|\mathcal{D}_u|} \sum_{x_u \in \mathcal{D}_u} d_{\mathrm{MSE}}(f_{\theta}(x_u), f_{\theta}(x_u + r_{adv}))\]

For a more stable training, we can use a mean teacher to generate stable targets by replacing \(f_{\theta}(x_u)\) with \(f_{\theta^{\prime}}(x_u)\), where \(f_{\theta^{\prime}}\) is an EMA teacher model of the student \(f_{\theta}\).

Fig. 7. Examples of the perturbed inputs for different values of the scaling hyperparameter Ɛ. (Image source: Miyato et al)

Adversarial Dropout

Instead of using an additive adversarial noise as VAT, Park et al. propose adversarial dropout (AdD), in which dropout masks are adversarially optimized to alter the model’s predictions. With this type of perturbations, we induce a sparse structure of the neural network, while the other forms of additive noise do not make changes to the structure of the neural network directly.

The first step is to find the dropout conditions that is most sensitive to the model’s predictions. In a SSL setting, where we do not have access to the true labels, we use the model predictions on the unlabeled data points to approximate the adversarial dropout mast \(\epsilon^{adv}\), which is subject to the boundary condition: \(\|\epsilon^{adv}-\epsilon\|_{2} \leq \delta H\) with \(H\) as the dropout layer dimension and a hyperparameter \(\delta\), which restricts adversarial dropout mask to be infinitesimally different from the random dropout mask \(\epsilon\). Without this constraint, the adversarial dropout might induce a layer without any connections. By restricting the adversarial dropout to be similar to the random dropout, we prevent finding such an irrational layer, which does not support backpropagation.

Similar to VAT, we start from a random dropout mask, we compute a KL-divergence loss between the outputs with and without dropout, and given the gradients of the loss with respect to the activations before the dropout layer, we update the random dropout mask in an adversarial manner. The prediction function \(f_{\theta}\) is divided into two parts, \(f_{\theta_1}\) and \(f_{\theta_2}\), where \(f_{\theta}(x_i, \epsilon)=f_{\theta_{2}}(f_{\theta_{1}}(x_i) \odot \epsilon)\), we start by computing an approximation of the jacobian matrix as follows:

\[J(x_i, \epsilon) \approx f_{\theta_{1}}(x_i)\odot \nabla_{f_{\theta_{1}}(x_i)} d_{\mathrm{KL}}(f_{\theta}(x_i), f_{\theta}(x_i, \epsilon))\]

Using \(J(x_i, \epsilon)\), we can then update the random dropout mask \(\epsilon\) to obtain \(\epsilon^{adv}\), so that if \(\epsilon(i) = 0\) and \(J(x_i, \epsilon)(i) > 0\) or \(\epsilon(i) = 1\) and \(J(x_i, \epsilon)(i) < 0\) at a given position \(i\), we inverse the value of \(\epsilon\) at that location. Resulting in \(\epsilon^{adv}\), which can then be used to compute the unsupervised loss:

\[\mathcal{L}_u = w \frac{1}{|\mathcal{D}_u|} \sum_{x_u \in \mathcal{D}_u} d_{\mathrm{MSE}}(f_{\theta}(x_u), f_{\theta}(x_u, \epsilon^{adv}))\]

Interpolation Consistency Training

As discussed earlier, the random perturbations are inefficient in high dimensions, given that only a limited subset of the input perturbations are capable of pushing the decision boundary into low-density regions. VAT and AdD find the adversarial perturbations that will maximize the change in the model’s predictions, which involve multiple forward and backward passes to compute these perturbations. This additional computation can be restrictive in many cases and makes such methods less appealing. As an alternative, Verma et al. propose Interpolation Consistency Training (ICT) as an efficient consistency regularization technique for SSL.

Given a mixup operation \(\operatorname{Mix}_{\lambda}(a, b)=\lambda \cdot a+(1-\lambda) \cdot b\) that outputs an interpolation between the two inputs with a weight \(\lambda \sim \operatorname{Beta}(\alpha, \alpha)\) for \(\alpha \in(0, \infty)\). As shown in Figure 8, ICT trains a prediction function \(f_{\theta}\) to provide consistent predictions at different interpolations of unlabeled data points \(u_i\) and \(u_j\), where the targets are generated using a teacher model \(f_{\theta^{\prime}}\) which is an EMA of \(f_{\theta}\):

\[f_{\theta}(\operatorname{Mix}_{\lambda}(u_{j}, u_{k})) \approx \operatorname{Mix}_{\lambda}(f_{\theta^{\prime}}(u_{j}), f_{\theta^{\prime}}(u_{k}))\]

Fig. 8. ICT where a student model is trained to have consistent predictions at different interpolations of unlabeled data points, where a teacher is used to generated the targets before the mixup operation. (Image source: Verma et al)

The unsupervised objective is to have similar values between the student model’s prediction given a mixed input of two unlabeled data points and the mixed outputs of the teacher model.

\[\mathcal{L}_u = w \frac{1}{|\mathcal{D}_u|} \sum_{u_j, u_k \in \mathcal{D}_u} d_{\mathrm{MSE}}(f_{\theta}(\operatorname{Mix}_{\lambda}(u_{j}, u_{k})), \operatorname{Mix}_{\lambda}(f_{\theta^{\prime}}(u_{j}), f_{\theta^{\prime}}(u_{k}))\]

The benefit of ICT compared to random noise can be analyzed by considering the mixup operation as a perturbation applied to a given unlabeled example: \(u_{j}+\delta=\operatorname{Mix}_{\lambda}(u_{j}, u_{k})\), for a large number of classes and a with a similar distribution of examples per class, it is likely that the pair of point \(\left(u_{j}, u_{k}\right)\) lie in different clusters and belong to different classes. If one of these two data points lies in a low-density region, applying an interpolation toward \(u_{k}\) points to a low-density region, which is a good direction to move the decision boundary toward.

Unsupervised Data Augmentation

Unsupervised Data Augmentation (Xie et al.) uses advanced data augmentation methods, such as AutoAugment, RandAugment and Back Translation as perturbations for consistency training based SSL. Similar to supervised learning, advanced data augmentation methods can also provide extra advantages over simple augmentations and random noise for consistency training, given that (1) it generates realistic augmented examples, making it safe to encourage the consistency between predictions on the original and augmented examples. (2) it can generate a diverse set of examples improving the sample efficiency and (3) it is capable of providing the missing inductive biases for different tasks.

Motivated by these points, Xie et al. propose to apply the following augmentations to generate transformed versions of the unlabeled inputs:

RandAugment for Image Classification: consists of uniformly sampling from the same set of possible transformations in PIL, without requiring any labeled data to search to find a good augmentation strategy.
Back-translation for Text Classification: consists of translating an existing example in language A into another language B, and then translating it back into A to obtain an augmented example.

Fig. 9. The training procedure in UDA. (Image source: Qizhe Xie et al)

After defining the augmentations to be applied during training, the training procedure shown in Figure 9 is quite straight forward. The objective is to have the correct predictions over the labeled set, and consistency of predictions on the original and augmented examples from the unlabeled set.

Entropy Minimization

In the previous section, in a setting where the cluster assumption is maintained, we enforce consistency of predictions to push the decision boundary into low-density regions to avoid classifying samples from the same cluster with distinct classes, which is a violation of the cluster assumption. Another way to enforce this is to encourage the network to make confident (low-entropy) predictions on unlabeled data regardless of the predicted class, discouraging the decision boundary from passing near data points where it would otherwise be forced to produce low-confidence predictions. This is done by adding a loss term which minimizes the entropy of the prediction function \(f_\theta(x)\), e.g., for a categorical output space with \(C\) possible classes, the entropy minimization term (Grandvalet et al.) is:

\[-\sum_{k=1}^{C} f_{\theta}(x)_{k} \log f_{\theta}(x)_{k}\]

However, with high capacity models such as neural networks, the model can quickly overfit to low confident data points by simply outputting large logits, resulting in a model with very confident predictions. On its own, entropy minimization doesn’t produce competitive results compared to other SSL methods but can produce state-of-the-art results when combined with other SSL approaches.

Proxy-label Methods

Proxy label methods (Ruder et al.) are the class of SSL algorithms that produce proxy labels on unlabeled data, using the prediction function itself or some variant of it without any supervision. These proxy labels are then used as targets together with the labeled data, providing some additional training information even if the produced labels are often noisy or weak and do not reflect the ground truth, which can be divided mainly into two groups: self-training, where the model itself produces the proxy labels; and multi-view learning, where the proxy labels are produced by models trained on different views of the data.

Self-training

In self-training or bootstrapping, the small amount of labeled data \(\mathcal{D}_l\) is first used to train a prediction function \(f_{\theta}\). The trained model is then used to assign pseudo-labels to the unlabeled data points in \(\mathcal{D}_u\). Given an output \(f_{\theta}(x_u)\) for an unlabeled data point \(x_u\) in the form of a probability distribution over the classes, the pair \((x_u, \text{argmax}f_{\theta}(x_u))\) is added to the labeled set if the probability assigned to its most likely class is higher than a predetermined threshold \(\tau\). The process of training the model using the augmented labeled set and then set using it to label the remaining of \(\mathcal{D}_u\) is repeated until the model is incapable of producing confident predictions.

Pseudo-labeling can also be seen as a special case of self-training, differing only in the heuristics used to decide which proxy labeled examples to retain, such as using the relative confidence instead of the absolute confidence, where the top \(n\) unlabeled examples predicted with the highest confidence after every epoch is added to the labeled training dataset \(\mathcal{D}_l\).

The main downside of such methods is that the model is unable to correct its own mistakes and any biased and wrong classifications can be quickly amplified resulting in confident but erroneous proxy labels on the unlabeled data points.

Meta Pseudo Labels

Given how important the heuristics used to generate the proxy labels, where a proper method could lead to a sizable gain. Pham et al. propose to use the student-teacher setting, where the teacher model is responsible for producing the proxy labels based on an efficient meta-learning algorithm called Meta Pseudo Labels (MPL), which encourages the teacher to adjust the target distributions of training examples in a manner that improves the learning of the student model. The teacher is updated by policy gradients computed by evaluating the student model on a held-out validation set.

A given training step of MPL consists of two phases (Figure 10):

Phase 1: The Student learns from the teacher. In this phase, given a single input example \(x_u\), the teacher \(f_{\theta^{\prime}}\) produces a target class-distribution to train the student \(f_{\theta}\), where the pair \((x_u, f_{\theta^{\prime}}(x_u))\) is shown to the student to update its parameters by back-propagating from the cross-entropy loss.
Phase 2: The Teacher learns from the Student’s Validation Loss. After the student updates its parameters in first step, its new parameter \(\theta(t+1)\) are evaluated on an example \((x_{val},y_{val})\) from the held-out validation dataset using the cross-entropy loss. Since the validation loss depends on \(\theta^{\prime}\) via the first step, this validation cross-entropy loss is also a function of the teacher’s weights \(\theta^{\prime}\). This dependency allows us to compute the gradients of the validation loss with respect to the teacher’s weights, and then update \(\theta^{\prime}\) to minimize the validation loss using policy gradients.

Fig. 10. The MPL training procedure. (Image source: Pham et al)

While the student’s performance allows the teacher to adjust and adapt to the student’s learning state, this signal alone is not sufficient to train the teacher since when the teacher has observed enough evidence to produce meaningful target distributions to teach the student, the student might have already entered a lousy region of parameters. To overcome this, the teacher is also trained using the pair of labeled data points from the held-out validation set.

Multi-view training

Multi-view training (MVL, Zhao et al.) utilizes multi-view data that are very common in real-world applications, where different views can be collected by different measuring methods (e.g., color information and texture information for images) or by creating limited views of the original data. In such a setting, MVL aims to learn a distinct prediction function \(f_{\theta_i}\) to model a given view \(v_{i}(x)\) of a data point \(x\), and jointly optimize all the functions to improve the generalization performance. Ideally, the possible views complement each other so that the produced models can collaborate in improving each other’s performance.

Co-training

Co-training (Blum et al.) requires that each data point \(x\) can be represented using two conditionally independent views \(v_1(x)\) and \(v_2(x)\), and that each view is sufficient to train a good model.

After training two prediction functions \(f_{\theta_1}\) and \(f_{\theta_2}\) on a specific view on the labeled set \(\mathcal{D}_l\). We start the proxy labeling procedure, where, at each iteration, an unlabeled data point is added to the training set of the model \(f_{\theta_i}\) if the other model \(f_{\theta_j}\) outputs a confident prediction with a probability higher than a threshold \(\tau\). This way, one of the models provides newly labeled examples where the other model is uncertain. The two views \(v_1(x)\) and \(v_2(x)\) can also be generated using consistency training methods detailed in the previous section, for example, Qiao et al. use adversarial perturbations to produce new views for deep co-training for image classification, where the models are encouraged to have the same predictions on \(\mathcal{D}_l\) but make different errors when they are exposed to adversarial attacks.

Democratic Co-training (Zhou et al.), an extension of Co-training, consists of replacing the different views of the input data with a number of models with different architectures and learning algorithms, which are first trained on the labeled examples. The trained models are then used to label a given an example \(x\) if a majority of models confidently agree on the label of an example.

Tri-Training

Tri-training (Zhou et al.) tries to reduce the bias of the predictions on unlabeled data produced with self-training by utilizing the agreement of three independently trained models instead of a single model. First, the labeled data \(\mathcal{D}_l\) is used to train three prediction functions: \(f_{\theta_1}\), \(f_{\theta_2}\) and \(f_{\theta_3}\). An unlabeled data point \(x\) is then added to the supervised training set of the function \(f_{\theta_i}\) if the other two models agree on its predicted label. The training stops if no data points are being added to any of the model’s training sets.

For a stronger heuristic when selecting the prediction to consider as proxy labels, Tri-training with disagreement (Søgaard), in addition to the only considering confident predictions with a probability higher than a threshold \(\tau\), only adds a data point \(x\) to the training set of the model \(f_{\theta_i}\) if the other two models agree, and \(f_{\theta_i}\) disagree on the predicted label. This way, the training set of a given model is only extended with data points where the model needs to be strengthened, and the easy examples that can skew the labeled data are avoided.

Using Tri-training with neural networks can be very expensive, requiring predictions for each one of the three models on all the unlabeled data. Ruder et al. propose to sample a limited number of unlabeled data points at each training epoch, the candidate pool size is increased as the training progresses and the models become more accurate. Multi-task tri-training (Ruder et al.) can also be used to reduce the time and sample complexity, where all three models share the same feature-extractor with model-specific classification layers. This way, the models are trained jointly with an additional orthogonality constraint on two of the three classification layers to be added to loss term, to avoiding learning similar models and falling back to the standard case of self-training.

Holistic Methods

An emerging line of work in SSL is a set of holistic approaches that unify the current dominant approaches in SSL in a single framework, achieving better performances.

MixMatch

Berthelot et al. propose a “holistic” approach which gracefully unifies ideas and components from the dominant paradigms for SSL, resulting in a algorithm that is greater than the sum of its parts and surpasses the performance of the traditional approaches.

Fig. 11. MixMatch. The procedure of label guessing process used in MixMatch, taking as input a batch of unlabeled examples, and outputting a batch of K augmented version of each input, with a corresponding sharpened proxy labels. (Image source: David Berthelot et al.)

MixMatch takes as input a batch from the labeled set \(\mathcal{D}_l\) containing a pair of inputs and their corresponding one-hot targets, a batch from the unlabeled set \(\mathcal{D}_u\) containing only unlabeled data, and a set of hyperparameters: sharpening softmax temperature \(T\), number of augmentations \(K\), Beta distribution parameter \(\alpha\) for MixUp. Producing a batch of augmented labeled examples and a batch of augmented unlabeled examples with their proxy labels. These augmented examples can then be used to compute the losses and train the model. Specifically, MixMatch consists of the following steps:

Step 1: Data Augmentation. Using a given transformation, a labeled example \(x^l\) from the labeled batch is transformed, generating its augmented versions \(\tilde{x}^l\). For an unlabeled example \(x^u\), the augmentation function is applied \(K\) times, resulting in \(K\) augmented versions of the unlabeled examples {\(\tilde{x}_1^u\), …, \(\tilde{x}_K^u\)}.
Step 2: Label Guessing. The second step consists of producing proxy labels for the unlabeled examples. First, we generate the predictions for the \(K\) augmented versions of each unlabeled example using the predictions function \(f_\theta\). The \(K\) predictions are then averaged together, obtaining a proxy or a pseudo label \(\hat{y}^u = 1/K \sum_{k=1}^{K}(\hat{y}^u_k)\) for each one of the augmentations of the unlabeled example \(x^u\): {(\(\tilde{x}_1^u, \hat{y}^u\)), …, (\(\tilde{x}_K^u, \hat{y}^u\))}.
Step 3: Sharpening. To push the model to produce confident predictions and minimize the entropy of the output distribution, the generated proxy labels \(\hat{y}^u\) in step 2 in the form of a probability distribution over \(C\) classes are sharpened by adjusting the temperature of the categorical distribution, computed as follows where \((\hat{y}^u)_k\) refers to the probability of class \(k\) out of \(C\) classes:

\[(\hat{y}^u)_k = (\hat{y}^u)_k^{\frac{1}{T}} / \sum_{k=1}^{C} (\hat{y}^u)_k^{\frac{1}{T}}\]

Step 4 MixUp. After the previous step, we created two new augmented batch, a batch \(\mathcal{L}\) of augmented labeled examples and their target, and a batch \(\mathcal{U}\) of augmented unlabeled examples and their sharpened proxy labels. Note that the size of \(\mathcal{U}\) is \(K\) times larger than the original batch given that each example \(x_u\) is replaced by its \(K\) augmented versions. In the last step, we mix these two batches. First, a new batch merging both batches is created \(\mathcal{W}=\text{Shuffle}(\text{Concat}(\mathcal{L}, \mathcal{U}))\). \(\mathcal{W}\) is then divided into two batches: \(\mathcal{W}_1\) of the same size as \(\mathcal{L}\) and \(\mathcal{W}_2\) of the same size as \(\mathcal{L}\). Using a Mixup operation that is slightly adjusted so that the mixed example is closer the labeled examples, the final step is to create new labeled and unlabeled batches by mixing the produced batches together using Mixup as follows:

\[\mathcal{L}{\prime}=\operatorname{MixUp}(\mathcal{L}, \mathcal{W}_1)\] \[\mathcal{U}{\prime}=\operatorname{MixUp}(\mathcal{U}, \mathcal{W}_2)\]

After creating two augmented batches \(\mathcal{L}{\prime}\) and \(\mathcal{U}{\prime}\) using MixMatch, we can then train the model using the standard SSL by computing the CE loss for the supervised loss, and the consistency loss for the unsupervised loss using the augmented batches as follows:

\[\mathcal{L}_s=\frac{1}{|\mathcal{L}^{\prime}|} \sum_{x_l, y \in \mathcal{L}^{\prime}} \mathrm{H}(y, f_\theta(x_l)))\] \[\mathcal{L}_u=w \frac{1}{|\mathcal{U}^{\prime}|} \sum_{x_u, \hat{y} \in \mathcal{U}^{\prime}} d_{\mathrm{MSE}}(\hat{y}, f_{\theta}(x_u))\]

ReMixMatch

Berthelot et al. propose to improve MixMatch by introducing two new techniques: distribution alignment and augmentation anchoring. Distribution alignment encourages the marginal distribution of predictions on unlabeled data to be close to the marginal distribution of ground-truth labels. Augmentation anchoring feeds multiple strongly augmented versions of the input into the model and encourages each output to be close to the prediction for a weakly-augmented version of the same input.

Fig. 12. ReMixMatch. Distribution alignment adjusts the guessed labels distributions to match the ground-truth class distribution divided by the average model predictions on unlabeled data. Augmentation anchoring uses the prediction obtained using a weakly augmented image as targets for a strongly augmented version of the same image. (Image source: Pham et al)

Distribution alignment: In order to force that the aggregate of predictions on unlabeled data to match the distribution of the provided labeled data. Over the course of training, a running average \(\tilde{y}\) of the model’s predictions on unlabeled data is maintained over the last 128 batches. For the marginal class distribution \(p(y)\), it is estimated based on the labeled examples seen during training. Given a prediction \(f_{\theta}(x_u)\) on the unlabeled example \(x_u\), the output probability distribution is aligned as follows: \(f_{\theta}(x_u) = \text { Normalize }(f_{\theta}(x_u) \times p(y) / \tilde{y})\).

Augmentation Anchoring: MixMatch uses a simple flip-and-crop augmentation strategy, ReMixMatch replaces the weak augmentations with strong augmentations learned using a control theory based strategy following AutoAugment. With such augmentations, the model’s prediction for a weakly augmented unlabeled image is used as the guessed label for many strongly augmented versions of the same image in a standard cross-entropy loss.

For training, MixMatch is applied to the unlabeled and labeled batches, with the application of distribution alignment and replacing the \(K\) weakly augmented example with a strongly augmented example, in addition to using the weakly augmented examples for predicting proxy labels for the unlabeled, strongly augmented examples. With two augmented batches \(\mathcal{L}^{\prime}\) and \(\mathcal{U}^{\prime}\), the supervised and unsupervised losses are computed as follows:

In addition to these losses, the authors add a self-supervised loss. First, a new unlabeled batch \(\hat{\mathcal{U}}^{\prime}\) of examples is created by rotating all of the examples with an angle \(r \sim\{0,90,180,270\}\). The rotated examples are then used to compute a self-supervised loss, where the classification layer on top of the model predicts the correct applied rotation, in addition to the cross-entropy loss over the rotated examples:

\[\mathcal{L}_{SL} = w^{\prime} \frac{1}{|\hat{\mathcal{U}}^{\prime}|} \sum_{x_u, \hat{y} \in \hat{\mathcal{U}}^{\prime}} \mathrm{H}(\hat{y}, f_\theta(x_u))) + \lambda \frac{1}{|\hat{\mathcal{U}}^{\prime}|} \sum_{x_u \in \hat{\mathcal{U}}^{\prime}} \mathrm{H}(r, f_\theta(x_u)))\]

FixMatch

Kihyuk Sohn et al. present FixMatch, a simple SSL algorithm that combines consistency regularization and pseudo-labeling. In FixMatch (Figure 13), both the supervised and unsupervised losses are computed using a cross-entropy loss. For labeled examples, the provided targets are used. For unlabeled examples \(x_u\), a weakly augmented version is first computed using a weak augmentation function \(A_w\). As in self-training, the predicted label is then considered as a proxy label if the highest class probability is greater than a threshold \(\tau\). With a proxy label for \(x_u\), \(K\) strongly augmented examples are generated using a strong augmentation function \(A_s\), we then assign to these strongly augmented versions the proxy label obtained with the weakly labeled version. With a batch of unlabeled examples of size \(\mathcal{D}_u\), the unsupervised loss can be written as follows:

\[\mathcal{L}_u = w \frac{1}{K |\mathcal{D}_u|} \sum_{x_u \in \mathcal{D}_u} \sum_{i=1}^{K} \mathbb{1}(\max (f_\theta(A_w(x_u))) \geq \tau) \mathrm{H} (f_\theta(A_w(x_u)), f_\theta(A_s(x_u)))\]

Fig. 13. FixMatch. The model prediction on a weakly augmented input is considered as target if the maximum output class probability is above threshold, this target can then be used to train the model on a strongly augmented version of the same input using standard cross-entropy loss. (Image source: Pham et al)

Augmentations. Weak augmentations consist of a standard flip-and-shift augmentation strategy. Specifically, the images are flipped horizontally with a probability of 50% on all datasets except SVHN, in addition to randomly translating images by up to 12.5% vertically and horizontally. For the strong augmentations, RandAugment and CTAugment are used where a given transformation (e.g.,, color inversion, translation, contrast adjustment, etc.) is randomly selected for each sample in a batch of training examples, where the amplitude of the transformation is a hyperparameter that is optimized during training.

Other important factors in FixMatch are the usage of adam optimizer, weight decay regularization and the learning rate schedule used, the authors propose to use a cosine learning rate decay with a decay of \(\eta \cos (\frac{7 \pi t}{16 T})\), where \(\eta\) is the initial learning rate, \(t\) is the current training step, and \(T\) is the total number of training steps.

References

^{[1] Chapelle et al. Semi-supervised learning book. IEEE Transactions on Neural Networks, 2009.

[2] Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2005.

[3] Rasmus et al. Semi-supervised learning with ladder networks. NIPS 2015.

[4] Samuli Laine, Timo Aila. Temporal Ensembling for Semi-Supervised Learning. ICLR 2017.

[5] Harri Valpola et al. From neural PCA to deep unsupervised learning. Advances in Independent Component Analysis and Learning Machines 2015.

[6] Antti Tarvainen, Harri Valpola. Mean teachers are better role models:Weight-averaged consistency targets improve semi-supervised deep learning results. NIPS 2017.

[7] Takeru Miyato et al. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. Transactions on Pattern Analysis and Machine Intelligence 2018.

[8] Ian Goodfellow et al. Explaining and harnessing adversarial examples.. ICLR 2015.

[9] Sungrae Park et al. Adversarial Dropout for Supervised and Semi-Supervised Learning.. AAAI 2018.

[10] Vikas Verma et al. Interpolation Consistency Training for Semi-Supervised Learning.. IJCAI 2019.

[11] Qizhe Xie et al. Unsupervised Data Augmentation for Consistency Training.. arXiv 2019.

[12] Zhanghan Ke et al. Dual Student: Breaking the Limits of the Teacher in Semi-supervised Learning.. ICCV 2019.

[13] Sebastian Ruder et al. Strong Baselines for Neural Semi-supervised Learning under Domain Shift.. ACL 2018.

[14] Hieu Pham et al. Meta Pseudo Labels. Preprint 2020.

[15] Jing Zhao et al. Multi-view learning overview: Recent progress and new challenges. Information Fusion, 2017.

[16] Avrim Blum, Tom Michael. Combining labeled and unlabeled data with co-training., COLT 1992.
[17] Siyuan Qiao, Wei Shen, Zhishuai Zhang, Bo Wang, Alan Yuille. Deep Co-Training for Semi-Supervised Image Recognition., ECCV 2018.

[18] Yan Zhou, Sally Goldman. Democratic Co-Learning., ICTAI 2004.

[19] Zhi-Hua Zhou, Ming Li. Tri-Training: Exploiting Unlabled Data Using Three Classifiers. IEEE Trans.Data Eng 2015.

[20] Anders Søgaard. Simple Semi-Supervised Training of Part-Of-Speech Taggers. NIPS 2019.

[21] Yves Grandvalet et al. Semi-supervised learning by entropy minimization. NIPS 2005.

[22] David Berthelot et al. MixMatch: A Holistic Approach to Semi-Supervised Learning. NIPS 2019.

[23] David Berthelot et al. ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring. ICLR 2020.

[24] Kihyuk Sohn et al. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. Preprint 2020.}

Yassine

CVPR 2021: An Overview

Table of Contents

CVPR 2021 in numbers

Recognition, Detection & Tracking

Task Programming: Learning Data Efficient Behavior Representations (paper)

Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks (paper)

Towards Open World Object Detection (paper)

Learning Calibrated Medical Image Segmentation via Multi-Rater Agreement Modeling (paper)

Re-Labeling ImageNet: From Single to Multi-Labels, From Global to Localized (paper)

Other papers to check out

Model Architectures & Learning Methods

Pre-Trained Image Processing Transformer (paper)

RepVGG: Making VGG-Style ConvNets Great Again (paper)

Bottleneck Transformers for Visual Recognition (paper)

Scaling Local Self-Attention for Parameter Efficient Visual Backbones (paper)

Involution: Inverting the Inherence of Convolution for Visual Recognition (paper)

On Feature Normalization and Data Augmentation (paper)

3D Computer Vision

MP3: A Unified Model to Map, Perceive, Predict and Plan (paper)

Multi-Modal Fusion Transformer for End-to-End Autonomous Driving (paper)

Neural Lumigraph Rendering (paper)

NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis (paper)

Neural Body: Implicit Neural Representations With Structured Latent Codes for Novel View Synthesis of Dynamic Humans (paper)

pixelNeRF: Neural Radiance Fields From One or Few Images (paper)

Other papers to check out

Image and Video Synthesis

GIRAFFE: Representing Scenes As Compositional Generative Neural Feature Fields (paper)

GeoSim: Realistic Video Simulation via Geometry-Aware Composition for Self-Driving (paper)

Taming Transformers for High-Resolution Image Synthesis (paper)

Rethinking and Improving the Robustness of Image Style Transfer(Paper)

Learning Continuous Image Representation With Local Implicit Image Function (Paper)

Other papers to check out

Scene Analysis & Understanding

Rethinking Semantic Segmentation From a Sequence-to-Sequence Perspective With Transformers (Paper)

MaX-DeepLab: End-to-End Panoptic Segmentation With Mask Transformers (Paper)

Binary TTC: A Temporal Geofence for Autonomous Navigation (Paper)

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging (Paper)

Polygonal Building Extraction by Frame Field Learning (Paper)

Other papers to check out

Representation & Adversarial Learning

Exploring Simple Siamese Representation Learning (Paper)

Where and What? Examining Interpretable Disentangled Representations (Paper)

Audio-Visual Instance Discrimination with Cross-Modal Agreement (Paper)

UP-DETR: Unsupervised Pre-Training for Object Detection With Transformers (Paper)

Fast End-to-End Learning on Protein Surfaces (Paper)

Natural Adversarial Examples (Paper)

Transfer, Low-shot, Semi & Unsupervised Learning

DatasetGAN: Efficient Labeled Data Factory With Minimal Human Effort (Paper)

Ranking Neural Checkpoints (Paper)

Other papers to check out

Computational Photography

Real-Time High-Resolution Background Matting (Paper)

Im2Vec: Synthesizing Vector Graphics without Vector Supervision (Paper)

Other papers to check out

Other

Biometrics, Face, Gesture and Body Pose

Vision & Language

Datasets

Explainable AI & Privacy

Video Analysis and Understanding

ECCV 2020: Some Highlights

General Statistics

Recognition, Detection, Segmentation and Pose Estimation

End-to-End Object Detection with Transformers (paper)

MutualNet: Adaptive ConvNet via Mutual Learning from Network Width and Resolution (paper)

Gradient Centralization: A New Optimization Technique for Deep Neural Networks (paper)

Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval (paper)

Hybrid Models for Open Set Recognition (paper)

Conditional Convolutions for Instance Segmentation (paper)

Multitask Learning Strengthens Adversarial Robustness (paper)

Dynamic Group Convolution for Accelerating Convolutional Neural Networks (paper)

Disentangled Non-local Neural Networks (paper)

Hard negative examples are hard, but useful (paper)

Volumetric Transformer Networks (paper)

Faster AutoAugment: Learning Augmentation Strategies Using Backpropagation (paper)

Other Papers

Semi-Supervised, Unsupervised, Transfer, Representation & Few-Shot Learning

Big Transfer (BiT): General Visual Representation Learning (paper)

Learning Visual Representations with Caption Annotations (paper)