Search | arXiv e-print repository

Boosting Unsupervised Semantic Segmentation with Principal Mask Proposals

Authors: Oliver Hahn, Nikita Araslanov, Simone Schaub-Meyer, Stefan Roth

Abstract: Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present Pri… ▽ More Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: Code: https://github.com/visinf/primaps

arXiv:2404.03778 [pdf, other]

Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincaré Ball

Authors: Simon Weber, Barış Zöngür, Nikita Araslanov, Daniel Cremers

Abstract: Hierarchy is a natural representation of semantic taxonomies, including the ones routinely used in image segmentation. Indeed, recent work on semantic segmentation reports improved accuracy from supervised training leveraging hierarchical label structures. Encouraged by these results, we revisit the fundamental assumptions behind that work. We postulate and then empirically verify that the reasons… ▽ More Hierarchy is a natural representation of semantic taxonomies, including the ones routinely used in image segmentation. Indeed, recent work on semantic segmentation reports improved accuracy from supervised training leveraging hierarchical label structures. Encouraged by these results, we revisit the fundamental assumptions behind that work. We postulate and then empirically verify that the reasons for the observed improvement in segmentation accuracy may be entirely unrelated to the use of the semantic hierarchy. To demonstrate this, we design a range of cross-domain experiments with a representative hierarchical approach. We find that on the new testing domains, a flat (non-hierarchical) segmentation network, in which the parents are inferred from the children, has superior segmentation accuracy to the hierarchical approach across the board. Complementing these findings and inspired by the intrinsic properties of hyperbolic spaces, we study a more principled approach to hierarchical segmentation using the Poincaré ball model. The hyperbolic representation largely outperforms the previous (Euclidean) hierarchical approach as well and is on par with our flat Euclidean baseline in terms of segmentation accuracy. However, it additionally exhibits surprisingly strong calibration quality of the parent nodes in the semantic hierarchy, especially on the more challenging domains. Our combined analysis suggests that the established practice of hierarchical segmentation may be limited to in-domain settings, whereas flat classifiers generalize substantially better, especially if they are modeled in the hyperbolic space. △ Less

Submitted 15 April, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

arXiv:2212.10368 [pdf, other]

Masked Event Modeling: Self-Supervised Pretraining for Event Cameras

Authors: Simon Klenk, David Bonello, Lukas Koestler, Nikita Araslanov, Daniel Cremers

Abstract: Event cameras asynchronously capture brightness changes with low latency, high temporal resolution, and high dynamic range. However, annotation of event data is a costly and laborious process, which limits the use of deep learning methods for classification and other semantic tasks with the event modality. To reduce the dependency on labeled event data, we introduce Masked Event Modeling (MEM), a… ▽ More Event cameras asynchronously capture brightness changes with low latency, high temporal resolution, and high dynamic range. However, annotation of event data is a costly and laborious process, which limits the use of deep learning methods for classification and other semantic tasks with the event modality. To reduce the dependency on labeled event data, we introduce Masked Event Modeling (MEM), a self-supervised framework for events. Our method pretrains a neural network on unlabeled events, which can originate from any event camera recording. Subsequently, the pretrained model is finetuned on a downstream task, leading to a consistent improvement of the task accuracy. For example, our method reaches state-of-the-art classification accuracy across three datasets, N-ImageNet, N-Cars, and N-Caltech101, increasing the top-1 accuracy of previous work by significant margins. When tested on real-world event data, MEM is even superior to supervised RGB-based pretraining. The models pretrained with MEM are also label-efficient and generalize well to the dense task of semantic image segmentation. △ Less

Submitted 23 December, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: To appear at WACV 2024. Code: https://github.com/tum-vision/mem

arXiv:2208.05788 [pdf, other]

Semantic Self-adaptation: Enhancing Generalization with a Single Sample

Authors: Sherwin Bahmani, Oliver Hahn, Eduard Zamfir, Nikita Araslanov, Daniel Cremers, Stefan Roth

Abstract: The lack of out-of-domain generalization is a critical weakness of deep networks for semantic segmentation. Previous studies relied on the assumption of a static model, i. e., once the training process is complete, model parameters remain fixed at test time. In this work, we challenge this premise with a self-adaptive approach for semantic segmentation that adjusts the inference process to each in… ▽ More The lack of out-of-domain generalization is a critical weakness of deep networks for semantic segmentation. Previous studies relied on the assumption of a static model, i. e., once the training process is complete, model parameters remain fixed at test time. In this work, we challenge this premise with a self-adaptive approach for semantic segmentation that adjusts the inference process to each input sample. Self-adaptation operates on two levels. First, it fine-tunes the parameters of convolutional layers to the input image using consistency regularization. Second, in Batch Normalization layers, self-adaptation interpolates between the training and the reference distribution derived from a single test sample. Despite both techniques being well known in the literature, their combination sets new state-of-the-art accuracy on synthetic-to-real generalization benchmarks. Our empirical study suggests that self-adaptation may complement the established practice of model regularization at training time for improving deep network generalization to out-of-domain data. Our code and pre-trained models are available at https://github.com/visinf/self-adaptive. △ Less

Submitted 13 December, 2023; v1 submitted 10 August, 2022; originally announced August 2022.

Comments: Published in TMLR (July 2023) | OpenReview: https://openreview.net/forum?id=ILNqQhGbLx | Code: https://github.com/visinf/self-adaptive | Video: https://youtu.be/s4DG65ic0EA

arXiv:2111.06265 [pdf, other]

Dense Unsupervised Learning for Video Segmentation

Authors: Nikita Araslanov, Simone Schaub-Meyer, Stefan Roth

Abstract: We present a novel approach to unsupervised learning for video object segmentation (VOS). Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime. We rely on uniform grid sampling to extract a set of anchors and train our model to disambiguate between them on both inter- and intra-video levels. However, a naive scheme to train su… ▽ More We present a novel approach to unsupervised learning for video object segmentation (VOS). Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime. We rely on uniform grid sampling to extract a set of anchors and train our model to disambiguate between them on both inter- and intra-video levels. However, a naive scheme to train such a model results in a degenerate solution. We propose to prevent this with a simple regularisation scheme, accommodating the equivariance property of the segmentation task to similarity transformations. Our training objective admits efficient implementation and exhibits fast training convergence. On established VOS benchmarks, our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power. △ Less

Submitted 11 November, 2021; originally announced November 2021.

Comments: To appear at NeurIPS*2021. Code: https://github.com/visinf/dense-ulearn-vos

arXiv:2105.00097 [pdf, other]

Self-supervised Augmentation Consistency for Adapting Semantic Segmentation

Authors: Nikita Araslanov, Stefan Roth

Abstract: We propose an approach to domain adaptation for semantic segmentation that is both practical and highly accurate. In contrast to previous work, we abandon the use of computationally involved adversarial objectives, network ensembles and style transfer. Instead, we employ standard data augmentation techniques $-$ photometric noise, flipping and scaling $-$ and ensure consistency of the semantic pre… ▽ More We propose an approach to domain adaptation for semantic segmentation that is both practical and highly accurate. In contrast to previous work, we abandon the use of computationally involved adversarial objectives, network ensembles and style transfer. Instead, we employ standard data augmentation techniques $-$ photometric noise, flipping and scaling $-$ and ensure consistency of the semantic predictions across these image transformations. We develop this principle in a lightweight self-supervised framework trained on co-evolving pseudo labels without the need for cumbersome extra training rounds. Simple in training from a practitioner's standpoint, our approach is remarkably effective. We achieve significant improvements of the state-of-the-art segmentation accuracy after adaptation, consistent both across different choices of the backbone architecture and adaptation scenarios. △ Less

Submitted 30 April, 2021; originally announced May 2021.

Comments: To appear at CVPR 2021. Code: https://github.com/visinf/da-sac

arXiv:2005.08104 [pdf, other]

Single-Stage Semantic Segmentation from Image Labels

Authors: Nikita Araslanov, Stefan Roth

Abstract: Recent years have seen a rapid growth in new approaches improving the accuracy of semantic segmentation in a weakly supervised setting, i.e. with only image-level labels available for training. However, this has come at the cost of increased model complexity and sophisticated multi-stage training procedures. This is in contrast to earlier work that used only a single stage $-$ training one segment… ▽ More Recent years have seen a rapid growth in new approaches improving the accuracy of semantic segmentation in a weakly supervised setting, i.e. with only image-level labels available for training. However, this has come at the cost of increased model complexity and sophisticated multi-stage training procedures. This is in contrast to earlier work that used only a single stage $-$ training one segmentation network on image labels $-$ which was abandoned due to inferior segmentation accuracy. In this work, we first define three desirable properties of a weakly supervised method: local consistency, semantic fidelity, and completeness. Using these properties as guidelines, we then develop a segmentation-based network model and a self-supervised training scheme to train for semantic masks from image-level annotations in a single stage. We show that despite its simplicity, our method achieves results that are competitive with significantly more complex pipelines, substantially outperforming earlier single-stage methods. △ Less

Submitted 16 May, 2020; originally announced May 2020.

Comments: To appear at CVPR 2020; minor corrections in Eq. (9). Code: https://github.com/visinf/1-stage-wseg

arXiv:1909.12400 [pdf, other]

Markov Decision Process for Video Generation

Authors: Vladyslav Yushchenko, Nikita Araslanov, Stefan Roth

Abstract: We identify two pathological cases of temporal inconsistencies in video generation: video freezing and video looping. To better quantify the temporal diversity, we propose a class of complementary metrics that are effective, easy to implement, data agnostic, and interpretable. Further, we observe that current state-of-the-art models are trained on video samples of fixed length thereby inhibiting l… ▽ More We identify two pathological cases of temporal inconsistencies in video generation: video freezing and video looping. To better quantify the temporal diversity, we propose a class of complementary metrics that are effective, easy to implement, data agnostic, and interpretable. Further, we observe that current state-of-the-art models are trained on video samples of fixed length thereby inhibiting long-term modeling. To address this, we reformulate the problem of video generation as a Markov Decision Process (MDP). The underlying idea is to represent motion as a stochastic process with an infinite forecast horizon to overcome the fixed length limitation and to mitigate the presence of temporal artifacts. We show that our formulation is easy to integrate into the state-of-the-art MoCoGAN framework. Our experiments on the Human Actions and UCF-101 datasets demonstrate that our MDP-based model is more memory efficient and improves the video quality both in terms of the new and established metrics. △ Less

Submitted 26 September, 2019; originally announced September 2019.

Comments: To appear at 2019 ICCV Workshop on Large Scale Holistic Video Understanding

arXiv:1904.05126 [pdf, other]

Actor-Critic Instance Segmentation

Authors: Nikita Araslanov, Constantin Rothkopf, Stefan Roth

Abstract: Most approaches to visual scene analysis have emphasised parallel processing of the image elements. However, one area in which the sequential nature of vision is apparent, is that of segmenting multiple, potentially similar and partially occluded objects in a scene. In this work, we revisit the recurrent formulation of this challenging problem in the context of reinforcement learning. Motivated by… ▽ More Most approaches to visual scene analysis have emphasised parallel processing of the image elements. However, one area in which the sequential nature of vision is apparent, is that of segmenting multiple, potentially similar and partially occluded objects in a scene. In this work, we revisit the recurrent formulation of this challenging problem in the context of reinforcement learning. Motivated by the limitations of the global max-matching assignment of the ground-truth segments to the recurrent states, we develop an actor-critic approach in which the actor recurrently predicts one instance mask at a time and utilises the gradient from a concurrently trained critic network. We formulate the state, action, and the reward such as to let the critic model long-term effects of the current prediction and incorporate this information into the gradient signal. Furthermore, to enable effective exploration in the inherently high-dimensional action space of instance masks, we learn a compact representation using a conditional variational auto-encoder. We show that our actor-critic model consistently provides accuracy benefits over the recurrent baseline on standard instance segmentation benchmarks. △ Less

Submitted 10 April, 2019; originally announced April 2019.

Comments: To appear at CVPR 2019

arXiv:1810.01345 [pdf, other]

doi 10.1002/rob.21677

NimbRo Rescue: Solving Disaster-Response Tasks through Mobile Manipulation Robot Momaro

Authors: Max Schwarz, Tobias Rodehutskors, David Droeschel, Marius Beul, Michael Schreiber, Nikita Araslanov, Ivan Ivanov, Christian Lenz, Jan Razlaw, Sebastian Schüller, David Schwarz, Angeliki Topalidou-Kyniazopoulou, Sven Behnke

Abstract: Robots that solve complex tasks in environments too dangerous for humans to enter are desperately needed, e.g. for search and rescue applications. We describe our mobile manipulation robot Momaro, with which we participated successfully in the DARPA Robotics Challenge. It features a unique locomotion design with four legs ending in steerable wheels, which allows it both to drive omnidirectionally… ▽ More Robots that solve complex tasks in environments too dangerous for humans to enter are desperately needed, e.g. for search and rescue applications. We describe our mobile manipulation robot Momaro, with which we participated successfully in the DARPA Robotics Challenge. It features a unique locomotion design with four legs ending in steerable wheels, which allows it both to drive omnidirectionally and to step over obstacles or climb. Furthermore, we present advanced communication and teleoperation approaches, which include immersive 3D visualization, and 6D tracking of operator head and arm motions. The proposed system is evaluated in the DARPA Robotics Challenge, the DLR SpaceBot Cup Qualification and lab experiments. We also discuss the lessons learned from the competitions. △ Less

Submitted 15 October, 2018; v1 submitted 2 October, 2018; originally announced October 2018.

Journal ref: Journal of Field Robotics 34(2): 400-425 (2017)

Showing 1–10 of 10 results for author: Araslanov, N