Semi-Supervised Semantic Segmentation with Cross-Consistency Training

Yassine Ouali
MICS, CentraleSupélec, Université Paris-Saclay
Céline Hudelot
MICS, CentraleSupélec, Université Paris-Saclay
Myriam Tami
MICS, CentraleSupélec, Université Paris-Saclay

Code [GitHub]
arXiv [Paper]
Poster [pdf]
Slides [pdf]


Figure 1: The proposed Cross-Consistency Training (CCT). For the labeled examples, the encoder and the main decoder are trained in a supervised manner. For the unlabeled examples, a consistency between the main decoder’s predictions and those of the auxiliary decoders is enforced, over different types of perturbations applied to the inputs of the auxiliary decoders.


In this paper, we present a novel cross-consistency based semi-supervised approach for semantic segmentation. Consistency training has proven to be a powerful semi-supervised learning framework for leveraging unlabeled data under the cluster assumption, in which the decision boundary should lie in low-density regions. In this work, we first observe that for semantic segmentation, the low-density regions are more apparent within the hidden representations than within the inputs. We thus propose cross-consistency training, where an invariance of the predictions is enforced over different perturbations applied to the outputs of the encoder. Concretely, a shared encoder and a main decoder are trained in a supervised manner using the available labeled examples. To leverage the unlabeled examples, we enforce a consistency between the main decoder predictions and those of the auxiliary decoders, taking as inputs different perturbed versions of the encoder's output, and consequently, improving the encoder's representations. The proposed method is simple and can easily be extended to use additional training signal, such as image-level labels or pixel-level labels across different domains. We perform an ablation study to tease apart the effectiveness of each component, and conduct extensive experiments to demonstrate that our method achieves state-of-the-art results in several datasets.


(1) Consistency Training for semantic segmentation.
We observe that for semantic segmentation, due to the dense nature of the task, the cluster assumption is more easily enforced over the hidden representations rather than the inputs.

(2) Cross-Consistency Training.
We propose CCT (Cross-Consistency Training) for semi-supervised semantic segmentation, where we several novel perturbations, and show the effectiveness of enforcing consistency over the encoder's outputs rather than the inputs.

(3) Using weak-labels and pixel-level labels from multiple domains.
The proposed method is quite simple and flexible, and can easily be extended to use image-level labels and pixel-level labels from multiple-domains.

(4) Competitive results on a number of benchmarks.
We have shown competitive results on several semantic segmentation benchmarks, whether on semi-supervised semantic segmentation, with labels at the image level, and during training on several fields with partially or completely non-overlapping label spaces.


The objective of consistency training is to enforce an invariance of the model's predictions over small perturbations applied to the inputs. As a result, the learned decision boundary will reside in low density regions, and the learned model will be robust to such small changes. The effectiveness of consistency training depends heavily on the behavior of the data distribution, i.e., the cluster assumption, meaning low density regions must separate the classes. In semantic segmentation, we do not observe the presence of low density regions separating the classes within the inputs, but rather within the encoder's outputs. Based on this observation, we propose to enforce the consistency over different forms of perturbations applied to the encoder's output. Specifically, we consider a shared encoder and a main decoder that are trained using the labeled examples. To leverage unlabeled data, we then consider multiple auxiliary decoders whose inputs are perturbed versions of the output of the shared encoder. The consistency is imposed between the main decoder's predictions and that of the auxiliary decoders. This way, the shared encoder's representation is enhanced by using the additional training signal extracted from the unlabeled data, while still enforcing robustness across the introduced perturbations in an end-to-end manner. The added auxiliary decoders have a negligible amount of parameters compared to the encoder. Additionally, during inference, only the main decoder is used, reducing the computation overhead both in training and inference.


Figure 2: Illustration of the method. For one training iteration, we sample a labeled input image and its pixel-level label, together with an unlabeled image. We pass both images through the encoder and main decoder, obtaining two main predictions, for both the labeled and unlabeled examples. We compute the supervised loss using the pixel-level label and the main prediction. We apply various perturbations to the encoder's output corresponding to the unlabeled input, and generate auxiliary predictions using the K perturbed versions of z. The unsupervised loss is then computed between the outputs of the auxiliary decoders and that of the main-decoder.

Main Results

CCT outperforms previous works relying on the same level of supervision and even methods which exploit image-level labels. We also obtain impressive results when using with image-level labels and when training on multiple domain confirming the flexibility of CCT.



This work was supported by Randstad corporate research chair. We would also like to thank Saclay-IA plateform of Université Paris-Saclay and Mésocentre computing center of CentraleSupélec and École Normale Supérieure Paris-Saclay for providing the computational resources.

Related Publications

Acknowledgement: This page is based on this template by Yonglong Tian