Search | arXiv e-print repository

Boosting Unsupervised Semantic Segmentation with Principal Mask Proposals

Authors: Oliver Hahn, Nikita Araslanov, Simone Schaub-Meyer, Stefan Roth

Abstract: Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present Pri… ▽ More Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: Code: https://github.com/visinf/primaps

arXiv:2404.12330 [pdf, other]

A Perspective on Deep Vision Performance with Standard Image and Video Codecs

Authors: Christoph Reich, Oliver Hahn, Daniel Cremers, Stefan Roth, Biplob Debnath

Abstract: Resource-constrained hardware, such as edge devices or cell phones, often rely on cloud servers to provide the required computational resources for inference in deep vision models. However, transferring image and video data from an edge or mobile device to a cloud server requires coding to deal with network constraints. The use of standardized codecs, such as JPEG or H.264, is prevalent and requir… ▽ More Resource-constrained hardware, such as edge devices or cell phones, often rely on cloud servers to provide the required computational resources for inference in deep vision models. However, transferring image and video data from an edge or mobile device to a cloud server requires coding to deal with network constraints. The use of standardized codecs, such as JPEG or H.264, is prevalent and required to ensure interoperability. This paper aims to examine the implications of employing standardized codecs within deep vision pipelines. We find that using JPEG and H.264 coding significantly deteriorates the accuracy across a broad range of vision tasks and models. For instance, strong compression rates reduce semantic segmentation accuracy by more than 80% in mIoU. In contrast to previous findings, our analysis extends beyond image and action classification to localization and dense prediction tasks, thus providing a more comprehensive perspective. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: Accepted at CVPR 2024 Workshop on AI for Streaming (AIS)

arXiv:2402.13773 [pdf, other]

Spatial-Domain Wireless Jamming with Reconfigurable Intelligent Surfaces

Authors: Philipp Mackensen, Paul Staat, Stefan Roth, Aydin Sezgin, Christof Paar, Veelasha Moonsamy

Abstract: Today, we rely heavily on the constant availability of wireless communication systems. As a result, wireless jamming continues to prevail as an imminent threat: Attackers can create deliberate radio interference to overshadow desired signals, leading to denial of service. Although the broadcast nature of radio signal propagation makes such an attack possible in the first place, it likewise poses a… ▽ More Today, we rely heavily on the constant availability of wireless communication systems. As a result, wireless jamming continues to prevail as an imminent threat: Attackers can create deliberate radio interference to overshadow desired signals, leading to denial of service. Although the broadcast nature of radio signal propagation makes such an attack possible in the first place, it likewise poses a challenge for the attacker, preventing precise targeting of single devices. In particular, the jamming signal will likely not only reach the victim receiver but also other neighboring devices. In this work, we introduce spatial control of wireless jamming signals, granting a new degree of freedom to leverage for jamming attacks. Our novel strategy employs an environment-adaptive reconfigurable intelligent surface (RIS), exploiting multipath signal propagation to spatially focus jamming signals on particular victim devices. We investigate this effect through extensive experimentation and show that our approach can disable the wireless communication of a victim device while leaving neighbouring devices unaffected. In particular, we demonstrate complete denial-of-service of a Wi-Fi device while a second device located at a distance as close as 5 mm remains unaffected, sustaining wireless communication at a data rate of 60 Mbit/s. We also show that the attacker can change the attack target on-the-fly, dynamically selecting the device to be jammed. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2312.14791 [pdf, other]

EMF-Constrained Artificial Noise for Secrecy Rates with Stochastic Eavesdropper Channels

Authors: Stefan Roth, Aydin Sezgin

Abstract: An information-theoretic confidential communication is achievable if the eavesdropper has a degraded channel compared to the legitimate receiver. In wireless channels, beamforming and artificial noise can enable such confidentiality. However, only distribution knowledge of the eavesdropper channels can be assumed. Moreover, the transmission of artificial noise can lead to an increased electromagne… ▽ More An information-theoretic confidential communication is achievable if the eavesdropper has a degraded channel compared to the legitimate receiver. In wireless channels, beamforming and artificial noise can enable such confidentiality. However, only distribution knowledge of the eavesdropper channels can be assumed. Moreover, the transmission of artificial noise can lead to an increased electromagnetic field (EMF) exposure, which depends on the considered location and can thus also be seen as a random variable. Hence, we optimize the $\varepsilon$-outage secrecy rate under a $δ$-outage exposure constraint in a setup, where the base station (BS) is communicating to a user equipment (UE), while a single-antenna eavesdropper with Rayleigh distributed channels is present. Therefore, we calculate the secrecy outage probability (SOP) in closed-form. Based on this, we convexify the optimization problem and optimize the $\varepsilon$-outage secrecy rate iteratively. Numerical results show that for a moderate exposure constraint, artificial noise from the BS has a relatively large impact due to beamforming, while for a strict exposure constraint artificial noise from the UE is more important. △ Less

Submitted 22 December, 2023; originally announced December 2023.

arXiv:2310.07706 [pdf, other]

Pixel State Value Network for Combined Prediction and Planning in Interactive Environments

Authors: Sascha Rosbach, Stefan M. Leupold, Simon Großjohann, Stefan Roth

Abstract: Automated vehicles operating in urban environments have to reliably interact with other traffic participants. Planning algorithms often utilize separate prediction modules forecasting probabilistic, multi-modal, and interactive behaviors of objects. Designing prediction and planning as two separate modules introduces significant challenges, particularly due to the interdependence of these modules.… ▽ More Automated vehicles operating in urban environments have to reliably interact with other traffic participants. Planning algorithms often utilize separate prediction modules forecasting probabilistic, multi-modal, and interactive behaviors of objects. Designing prediction and planning as two separate modules introduces significant challenges, particularly due to the interdependence of these modules. This work proposes a deep learning methodology to combine prediction and planning. A conditional GAN with the U-Net architecture is trained to predict two high-resolution image sequences. The sequences represent explicit motion predictions, mainly used to train context understanding, and pixel state values suitable for planning encoding kinematic reachability, object dynamics, safety, and driving comfort. The model can be trained offline on target images rendered by a sampling-based model-predictive planner, leveraging real-world driving data. Our results demonstrate intuitive behavior in complex situations, such as lane changes amidst conflicting objectives. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2308.09472 [pdf, other]

Vision Relation Transformer for Unbiased Scene Graph Generation

Authors: Gopika Sudhakaran, Devendra Singh Dhami, Kristian Kersting, Stefan Roth

Abstract: Recent years have seen a growing interest in Scene Graph Generation (SGG), a comprehensive visual scene understanding task that aims to predict entity relationships using a relation encoder-decoder pipeline stacked on top of an object encoder-decoder backbone. Unfortunately, current SGG methods suffer from an information loss regarding the entities local-level cues during the relation encoding pro… ▽ More Recent years have seen a growing interest in Scene Graph Generation (SGG), a comprehensive visual scene understanding task that aims to predict entity relationships using a relation encoder-decoder pipeline stacked on top of an object encoder-decoder backbone. Unfortunately, current SGG methods suffer from an information loss regarding the entities local-level cues during the relation encoding process. To mitigate this, we introduce the Vision rElation TransfOrmer (VETO), consisting of a novel local-level entity relation encoder. We further observe that many existing SGG methods claim to be unbiased, but are still biased towards either head or tail classes. To overcome this bias, we introduce a Mutually Exclusive ExperT (MEET) learning strategy that captures important relation features without bias towards head or tail classes. Experimental results on the VG and GQA datasets demonstrate that VETO + MEET boosts the predictive performance by up to 47 percentage over the state of the art while being 10 times smaller. △ Less

Submitted 18 August, 2023; originally announced August 2023.

Comments: Accepted for publication in ICCV 2023

arXiv:2308.06248 [pdf, other]

FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods

Authors: Robin Hesse, Simone Schaub-Meyer, Stefan Roth

Abstract: The field of explainable artificial intelligence (XAI) aims to uncover the inner workings of complex deep neural models. While being crucial for safety-critical domains, XAI inherently lacks ground-truth explanations, making its automatic evaluation an unsolved problem. We address this challenge by proposing a novel synthetic vision dataset, named FunnyBirds, and accompanying automatic evaluation… ▽ More The field of explainable artificial intelligence (XAI) aims to uncover the inner workings of complex deep neural models. While being crucial for safety-critical domains, XAI inherently lacks ground-truth explanations, making its automatic evaluation an unsolved problem. We address this challenge by proposing a novel synthetic vision dataset, named FunnyBirds, and accompanying automatic evaluation protocols. Our dataset allows performing semantically meaningful image interventions, e.g., removing individual object parts, which has three important implications. First, it enables analyzing explanations on a part level, which is closer to human comprehension than existing methods that evaluate on a pixel level. Second, by comparing the model output for inputs with removed parts, we can estimate ground-truth part importances that should be reflected in the explanations. Third, by mapping individual explanations into a common space of part importances, we can analyze a variety of different explanation types in a single common framework. Using our tools, we report results for 24 different combinations of neural models and XAI methods, demonstrating the strengths and weaknesses of the assessed methods in a fully automatic and systematic manner. △ Less

Submitted 11 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV 2023. Code: https://github.com/visinf/funnybirds

arXiv:2305.09504 [pdf, other]

Content-Adaptive Downsampling in Convolutional Neural Networks

Authors: Robin Hesse, Simone Schaub-Meyer, Stefan Roth

Abstract: Many convolutional neural networks (CNNs) rely on progressive downsampling of their feature maps to increase the network's receptive field and decrease computational cost. However, this comes at the price of losing granularity in the feature maps, limiting the ability to correctly understand images or recover fine detail in dense prediction tasks. To address this, common practice is to replace the… ▽ More Many convolutional neural networks (CNNs) rely on progressive downsampling of their feature maps to increase the network's receptive field and decrease computational cost. However, this comes at the price of losing granularity in the feature maps, limiting the ability to correctly understand images or recover fine detail in dense prediction tasks. To address this, common practice is to replace the last few downsampling operations in a CNN with dilated convolutions, allowing to retain the feature map resolution without reducing the receptive field, albeit increasing the computational cost. This allows to trade off predictive performance against cost, depending on the output feature resolution. By either regularly downsampling or not downsampling the entire feature map, existing work implicitly treats all regions of the input image and subsequent feature maps as equally important, which generally does not hold. We propose an adaptive downsampling scheme that generalizes the above idea by allowing to process informative regions at a higher resolution than less informative ones. In a variety of experiments, we demonstrate the versatility of our adaptive downsampling strategy and empirically show that it improves the cost-accuracy trade-off of various established CNNs. △ Less

Submitted 16 May, 2023; originally announced May 2023.

Comments: Accepted at CVPR 2023 Workshop on Efficient Deep Learning for Computer Vision (ECV). Code: https://github.com/visinf/cad

arXiv:2302.01998 [pdf, ps, other]

Integrated Communication and Control Systems: A Data Significance Perspective

Authors: Stefan Roth, Yasemin Karacora, Christina Chaccour, Aydin Sezgin, Walid Saad

Abstract: The interconnected smart devices and industrial internet of things devices require low-latency communication to fulfill control objectives despite limited resources. In essence, such devices have a time-critical nature but also require a highly accurate data input based on its significance. In this paper, we investigate various coordinated and distributed semantic scheduling schemes with a data si… ▽ More The interconnected smart devices and industrial internet of things devices require low-latency communication to fulfill control objectives despite limited resources. In essence, such devices have a time-critical nature but also require a highly accurate data input based on its significance. In this paper, we investigate various coordinated and distributed semantic scheduling schemes with a data significance perspective. In particular, novel algorithms are proposed to analyze the benefit of such schemes for the significance in terms of estimation accuracy. Then, we derive the bounds of the achievable estimation accuracy. Our numerical results showcase the superiority of semantic scheduling policies that adopt an integrated control and communication strategy. In essence, such policies can reduce the weighted sum of mean squared errors compared to traditional policies. △ Less

Submitted 3 February, 2023; originally announced February 2023.

arXiv:2211.14005 [pdf, other]

Efficient Feature Extraction for High-resolution Video Frame Interpolation

Authors: Moritz Nottebaum, Stefan Roth, Simone Schaub-Meyer

Abstract: Most deep learning methods for video frame interpolation consist of three main components: feature extraction, motion estimation, and image synthesis. Existing approaches are mainly distinguishable in terms of how these modules are designed. However, when interpolating high-resolution images, e.g. at 4K, the design choices for achieving high accuracy within reasonable memory requirements are limit… ▽ More Most deep learning methods for video frame interpolation consist of three main components: feature extraction, motion estimation, and image synthesis. Existing approaches are mainly distinguishable in terms of how these modules are designed. However, when interpolating high-resolution images, e.g. at 4K, the design choices for achieving high accuracy within reasonable memory requirements are limited. The feature extraction layers help to compress the input and extract relevant information for the latter stages, such as motion estimation. However, these layers are often costly in parameters, computation time, and memory. We show how ideas from dimensionality reduction combined with a lightweight optimization can be used to compress the input representation while keeping the extracted information suitable for frame interpolation. Further, we require neither a pretrained flow network nor a synthesis network, additionally reducing the number of trainable parameters and required memory. When evaluating on three 4K benchmarks, we achieve state-of-the-art image quality among the methods without pretrained flow while having the lowest network complexity and memory requirements overall. △ Less

Submitted 25 November, 2022; originally announced November 2022.

Comments: Accepted to BMVC 2022. Code: https://github.com/visinf/fldr-vfi

arXiv:2211.12209 [pdf, other]

$S^2$-Flow: Joint Semantic and Style Editing of Facial Images

Authors: Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth

Abstract: The high-quality images yielded by generative adversarial networks (GANs) have motivated investigations into their application for image editing. However, GANs are often limited in the control they provide for performing specific edits. One of the principal challenges is the entangled latent space of GANs, which is not directly suitable for performing independent and detailed edits. Recent editing… ▽ More The high-quality images yielded by generative adversarial networks (GANs) have motivated investigations into their application for image editing. However, GANs are often limited in the control they provide for performing specific edits. One of the principal challenges is the entangled latent space of GANs, which is not directly suitable for performing independent and detailed edits. Recent editing methods allow for either controlled style edits or controlled semantic edits. In addition, methods that use semantic masks to edit images have difficulty preserving the identity and are unable to perform controlled style edits. We propose a method to disentangle a GAN$\text{'}$s latent space into semantic and style spaces, enabling controlled semantic and style edits for face images independently within the same framework. To achieve this, we design an encoder-decoder based network architecture ($S^2$-Flow), which incorporates two proposed inductive biases. We show the suitability of $S^2$-Flow quantitatively and qualitatively by performing various semantic and style edits. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: Accepted to BMVC 2022

arXiv:2211.05797 [pdf, other]

doi 10.1109/ICC45041.2023.10279561

Optimizing the Age of Information in Mixed-Critical Wireless Communication Networks

Authors: Robert-Jeron Reifert, Stefan Roth, Aydin Sezgin

Abstract: Beyond fifth generation wireless communication networks (B5G) are applied in many use-cases, such as industrial control systems, smart public transport, and power grids. Those applications require innovative techniques for timely transmission and increased wireless network capacities. Hence, this paper proposes optimizing the data freshness measured by the age of information (AoI) in dense interne… ▽ More Beyond fifth generation wireless communication networks (B5G) are applied in many use-cases, such as industrial control systems, smart public transport, and power grids. Those applications require innovative techniques for timely transmission and increased wireless network capacities. Hence, this paper proposes optimizing the data freshness measured by the age of information (AoI) in dense internet of things (IoT) sensor-actuator networks. Given different priorities of data-streams, i.e., different sensitivities to outdated information, mixed-criticality is introduced by analyzing different functions of the age, i.e., we consider linear and exponential aging functions. An intricate non-convex optimization problem managing the physical transmission time and packet outage probability is derived. Such problem is tackled using stochastic reformulations, successive convex approximations, and fractional programming, resulting in an efficient iterative algorithm for AoI optimization. Simulation results validate the proposed scheme's performance in terms of AoI, mixed-criticality, and scalability. The proposed non-orthogonal transmission is shown to outperform an orthogonal access scheme in various deployment cases. Results emphasize the potential gains for dense B5G empowered IoT networks in minimizing the AoI. △ Less

Submitted 10 November, 2022; originally announced November 2022.

Comments: 6 pages, 5 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Journal ref: ICC 2023 - IEEE International Conference on Communications

arXiv:2208.05991 [pdf, ps, other]

Approximation-based Threshold Optimization from Single Antenna to Massive SIMO Authentication

Authors: Stefan Roth, Aydin Sezgin, Roman Bessel, H. Vincent Poor

Abstract: In a wireless sensor network, data from various sensors are gathered to estimate the system-state of the process system. However, adversaries aim at distorting the system-state estimate, for which they may infiltrate sensors or position additional devices in the environment. To authenticate the received process values, the integrity of the measurements from different sensors can be evaluated joint… ▽ More In a wireless sensor network, data from various sensors are gathered to estimate the system-state of the process system. However, adversaries aim at distorting the system-state estimate, for which they may infiltrate sensors or position additional devices in the environment. To authenticate the received process values, the integrity of the measurements from different sensors can be evaluated jointly with the temporal integrity of channel measurements from each sensor. For this purpose, we design a security protocol, in which Kalman filters are used to predict the system-state and the channel-state values, and the received data are authenticated by a hypothesis test. We theoretically analyze the adversarial success probability and the reliability rate obtained in the hypothesis test in two ways, based on a chi-square approximation and on a Gaussian approximation. The two approximations are exact for small and large data vectors, respectively. The Gaussian approximation is suitable for analyzing massive single-input multiple-output (SIMO) setups. To obtain additional insights, the approximation is further adapted for the case of channel hardening, which occurs in massive SIMO fading channels. As adversaries always look for the weakest point of a system, a time-constant security level is required. To provide such a service, the approximations are used to propose time-varying threshold values for the hypothesis test, which approximately attain a constant security level. Numerical results show that a constant security level can only be achieved by a time-varying threshold choice, while a constant threshold value leads to a time-varying security level. △ Less

Submitted 11 August, 2022; originally announced August 2022.

arXiv:2208.05788 [pdf, other]

Semantic Self-adaptation: Enhancing Generalization with a Single Sample

Authors: Sherwin Bahmani, Oliver Hahn, Eduard Zamfir, Nikita Araslanov, Daniel Cremers, Stefan Roth

Abstract: The lack of out-of-domain generalization is a critical weakness of deep networks for semantic segmentation. Previous studies relied on the assumption of a static model, i. e., once the training process is complete, model parameters remain fixed at test time. In this work, we challenge this premise with a self-adaptive approach for semantic segmentation that adjusts the inference process to each in… ▽ More The lack of out-of-domain generalization is a critical weakness of deep networks for semantic segmentation. Previous studies relied on the assumption of a static model, i. e., once the training process is complete, model parameters remain fixed at test time. In this work, we challenge this premise with a self-adaptive approach for semantic segmentation that adjusts the inference process to each input sample. Self-adaptation operates on two levels. First, it fine-tunes the parameters of convolutional layers to the input image using consistency regularization. Second, in Batch Normalization layers, self-adaptation interpolates between the training and the reference distribution derived from a single test sample. Despite both techniques being well known in the literature, their combination sets new state-of-the-art accuracy on synthetic-to-real generalization benchmarks. Our empirical study suggests that self-adaptation may complement the established practice of model regularization at training time for improving deep network generalization to out-of-domain data. Our code and pre-trained models are available at https://github.com/visinf/self-adaptive. △ Less

Submitted 13 December, 2023; v1 submitted 10 August, 2022; originally announced August 2022.

Comments: Published in TMLR (July 2023) | OpenReview: https://openreview.net/forum?id=ILNqQhGbLx | Code: https://github.com/visinf/self-adaptive | Video: https://youtu.be/s4DG65ic0EA

arXiv:2205.01813 [pdf, other]

Diverse Image Captioning with Grounded Style

Authors: Franz Klein, Shweta Mahajan, Stefan Roth

Abstract: Stylized image captioning as presented in prior work aims to generate captions that reflect characteristics beyond a factual description of the scene composition, such as sentiments. Such prior work relies on given sentiment identifiers, which are used to express a certain global style in the caption, e.g. positive or negative, however without taking into account the stylistic content of the visua… ▽ More Stylized image captioning as presented in prior work aims to generate captions that reflect characteristics beyond a factual description of the scene composition, such as sentiments. Such prior work relies on given sentiment identifiers, which are used to express a certain global style in the caption, e.g. positive or negative, however without taking into account the stylistic content of the visual scene. To address this shortcoming, we first analyze the limitations of current stylized captioning datasets and propose COCO attribute-based augmentations to obtain varied stylized captions from COCO annotations. Furthermore, we encode the stylized information in the latent space of a Variational Autoencoder; specifically, we leverage extracted image attributes to explicitly structure its sequential latent space according to different localized style characteristics. Our experiments on the Senticap and COCO datasets show the ability of our approach to generate accurate captions with diversity in styles that are grounded in the image. △ Less

Submitted 3 May, 2022; originally announced May 2022.

Comments: In the 43rd DAGM German Conference on Pattern Recognition (GCPR) 2021

Journal ref: In Proceedings of the German Conference on Pattern Recognition (GCPR), Ed. by C. Bauckhage, J. Gall, and A. G. Schwing, Vol. 13024, Lecture Notes in Computer Science, Springer, 2021, pp. 421-436

arXiv:2204.11878 [pdf, other]

doi 10.1109/TVT.2023.3296977

Comeback Kid: Resilience for Mixed-Critical Wireless Network Resource Management

Authors: Robert-Jeron Reifert, Stefan Roth, Alaa Alameer Ahmad, Aydin Sezgin

Abstract: The future sixth generation (6G) of communication systems is envisioned to provide numerous applications in safety-critical contexts, e.g., driverless traffic, modular industry, and smart cities, which require outstanding performance, high reliability and fault tolerance, as well as autonomy. Ensuring criticality awareness for diverse functional safety applications and providing fault tolerance in… ▽ More The future sixth generation (6G) of communication systems is envisioned to provide numerous applications in safety-critical contexts, e.g., driverless traffic, modular industry, and smart cities, which require outstanding performance, high reliability and fault tolerance, as well as autonomy. Ensuring criticality awareness for diverse functional safety applications and providing fault tolerance in an autonomous manner are essential for future 6G systems. Therefore, this paper proposes jointly employing the concepts of resilience and mixed criticality. In this work, we conduct physical layer resource management in cloud-based networks under the rate-splitting paradigm, which is a promising factor towards achieving high resilience. We recapitulate the concepts individually, outline a joint metric to measure the criticality-aware resilience, and verify its merits in a case study. We, thereby, formulate a non-convex optimization problem, derive an efficient iterative algorithm, propose four resilience mechanisms differing in quality and time of adaption, and conduct extensive numerical simulations. Towards this end, we propose a highly autonomous rate-splitting-enabled physical layer resource management algorithm for future 6G networks respecting mixed-critical quality of service (QoS) levels and providing high levels of resilience. Results emphasize the considerable improvements of incorporating a mixed criticality-aware resilience strategy under channel outages and strict QoS demands. The rate-splitting paradigm is particularly shown to overcome state-of-the-art interference management techniques, and the resilience and throughput adaption over consecutive outage events reveals the proposed schemes contribution towards enabling future 6G networks. △ Less

Submitted 11 June, 2022; v1 submitted 25 April, 2022; originally announced April 2022.

Comments: 16 pages, 13 figures. Submitted to IEEE for possible publication

Journal ref: IEEE Transactions on Vehicular Technology, 2023

arXiv:2202.07951 [pdf, other]

doi 10.1109/ICCWorkshops53468.2022.9813735

Energy Efficiency in Rate-Splitting Multiple Access with Mixed Criticality

Authors: Robert-Jeron Reifert, Stefan Roth, Alaa Alameer Ahmad, Aydin Sezgin

Abstract: Future sixth generation (6G) wireless communication networks face the need to similarly meet unprecedented quality of service (QoS) demands while also providing a larger energy efficiency (EE) to minimize their carbon footprint. Moreover, due to the diverseness of network participants, mixed criticality QoS levels are assigned to the users of such networks. In this work, with a focus on a cloud-ra… ▽ More Future sixth generation (6G) wireless communication networks face the need to similarly meet unprecedented quality of service (QoS) demands while also providing a larger energy efficiency (EE) to minimize their carbon footprint. Moreover, due to the diverseness of network participants, mixed criticality QoS levels are assigned to the users of such networks. In this work, with a focus on a cloud-radio access network (C-RAN), the fulfillment of desired QoS and minimized transmit power use is optimized jointly within a rate-splitting paradigm. Thereby, the optimization problem is non-convex. Hence, a low-complexity algorithm is proposed based on fractional programming. Numerical results validate that there is a trade-off between the QoS fulfillment and power minimization. Moreover, the energy efficiency of the proposed rate-splitting algorithm is larger than in comparative schemes, especially with mixed criticality. △ Less

Submitted 16 February, 2022; originally announced February 2022.

Comments: 7 pages, 6 figures, 1 table. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Journal ref: 2022 IEEE International Conference on Communications Workshops (ICC Workshops)

arXiv:2201.01882 [pdf]

Trust-based Symbolic Motion Planning for Multi-robot Bounding Overwatch

Authors: Huanfei Zheng, Jonathon M. Smereka, Dariusz Mikulski, Stephanie Roth, Yue Wang

Abstract: Multi-robot bounding overwatch requires timely coordination of robot team members. Symbolic motion planning (SMP) can provide provably correct solutions for robot motion planning with high-level temporal logic task requirements. This paper aims to develop a framework for safe and reliable SMP of multi-robot systems (MRS) to satisfy complex bounding overwatch tasks constrained by temporal logics. A… ▽ More Multi-robot bounding overwatch requires timely coordination of robot team members. Symbolic motion planning (SMP) can provide provably correct solutions for robot motion planning with high-level temporal logic task requirements. This paper aims to develop a framework for safe and reliable SMP of multi-robot systems (MRS) to satisfy complex bounding overwatch tasks constrained by temporal logics. A decentralized SMP framework is first presented, which guarantees both correctness and parallel execution of the complex bounding overwatch tasks by the MRS. A computational trust model is then constructed by referring to the traversability and line of sight of robots in the terrain. The trust model predicts the trustworthiness of each robot team's potential behavior in executing a task plan. The most trustworthy task and motion plan is explored with a Dijkstra searching strategy to guarantee the reliability of MRS bounding overwatch. A robot simulation is implemented in ROS Gazebo to demonstrate the effectiveness of the proposed framework. △ Less

Submitted 5 January, 2022; originally announced January 2022.

arXiv:2112.01967 [pdf, other]

IRShield: A Countermeasure Against Adversarial Physical-Layer Wireless Sensing

Authors: Paul Staat, Simon Mulzer, Stefan Roth, Veelasha Moonsamy, Markus Heinrichs, Rainer Kronberger, Aydin Sezgin, Christof Paar

Abstract: Wireless radio channels are known to contain information about the surrounding propagation environment, which can be extracted using established wireless sensing methods. Thus, today's ubiquitous wireless devices are attractive targets for passive eavesdroppers to launch reconnaissance attacks. In particular, by overhearing standard communication signals, eavesdroppers obtain estimations of wirele… ▽ More Wireless radio channels are known to contain information about the surrounding propagation environment, which can be extracted using established wireless sensing methods. Thus, today's ubiquitous wireless devices are attractive targets for passive eavesdroppers to launch reconnaissance attacks. In particular, by overhearing standard communication signals, eavesdroppers obtain estimations of wireless channels which can give away sensitive information about indoor environments. For instance, by applying simple statistical methods, adversaries can infer human motion from wireless channel observations, allowing to remotely monitor premises of victims. In this work, building on the advent of intelligent reflecting surfaces (IRSs), we propose IRShield as a novel countermeasure against adversarial wireless sensing. IRShield is designed as a plug-and-play privacy-preserving extension to existing wireless networks. At the core of IRShield, we design an IRS configuration algorithm to obfuscate wireless channels. We validate the effectiveness with extensive experimental evaluations. In a state-of-the-art human motion detection attack using off-the-shelf Wi-Fi devices, IRShield lowered detection rates to 5% or less. △ Less

Submitted 7 April, 2022; v1 submitted 3 December, 2021; originally announced December 2021.

arXiv:2111.07668 [pdf, other]

Fast Axiomatic Attribution for Neural Networks

Authors: Robin Hesse, Simone Schaub-Meyer, Stefan Roth

Abstract: Mitigating the dependence on spurious correlations present in the training dataset is a quickly emerging and important topic of deep learning. Recent approaches include priors on the feature attribution of a deep neural network (DNN) into the training process to reduce the dependence on unwanted features. However, until now one needed to trade off high-quality attributions, satisfying desirable ax… ▽ More Mitigating the dependence on spurious correlations present in the training dataset is a quickly emerging and important topic of deep learning. Recent approaches include priors on the feature attribution of a deep neural network (DNN) into the training process to reduce the dependence on unwanted features. However, until now one needed to trade off high-quality attributions, satisfying desirable axioms, against the time required to compute them. This in turn either led to long training times or ineffective attribution priors. In this work, we break this trade-off by considering a special class of efficiently axiomatically attributable DNNs for which an axiomatic feature attribution can be computed with only a single forward/backward pass. We formally prove that nonnegatively homogeneous DNNs, here termed $\mathcal{X}$-DNNs, are efficiently axiomatically attributable and show that they can be effortlessly constructed from a wide range of regular DNNs by simply removing the bias term of each layer. Various experiments demonstrate the advantages of $\mathcal{X}$-DNNs, beating state-of-the-art generic attribution methods on regular DNNs for training with attribution priors. △ Less

Submitted 15 November, 2021; originally announced November 2021.

Comments: To appear at NeurIPS*2021. Project page and code: https://visinf.github.io/fast-axiomatic-attribution

arXiv:2111.06265 [pdf, other]

Dense Unsupervised Learning for Video Segmentation

Authors: Nikita Araslanov, Simone Schaub-Meyer, Stefan Roth

Abstract: We present a novel approach to unsupervised learning for video object segmentation (VOS). Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime. We rely on uniform grid sampling to extract a set of anchors and train our model to disambiguate between them on both inter- and intra-video levels. However, a naive scheme to train su… ▽ More We present a novel approach to unsupervised learning for video object segmentation (VOS). Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime. We rely on uniform grid sampling to extract a set of anchors and train our model to disambiguate between them on both inter- and intra-video levels. However, a naive scheme to train such a model results in a degenerate solution. We propose to prevent this with a simple regularisation scheme, accommodating the equivariance property of the segmentation task to similarity transformations. Our training objective admits efficient implementation and exhibits fast training convergence. On established VOS benchmarks, our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power. △ Less

Submitted 11 November, 2021; originally announced November 2021.

Comments: To appear at NeurIPS*2021. Code: https://github.com/visinf/dense-ulearn-vos

arXiv:2110.08787 [pdf, other]

PixelPyramids: Exact Inference Models from Lossless Image Pyramids

Authors: Shweta Mahajan, Stefan Roth

Abstract: Autoregressive models are a class of exact inference approaches with highly flexible functional forms, yielding state-of-the-art density estimates for natural images. Yet, the sequential ordering on the dimensions makes these models computationally expensive and limits their applicability to low-resolution imagery. In this work, we propose Pixel-Pyramids, a block-autoregressive approach employing… ▽ More Autoregressive models are a class of exact inference approaches with highly flexible functional forms, yielding state-of-the-art density estimates for natural images. Yet, the sequential ordering on the dimensions makes these models computationally expensive and limits their applicability to low-resolution imagery. In this work, we propose Pixel-Pyramids, a block-autoregressive approach employing a lossless pyramid decomposition with scale-specific representations to encode the joint distribution of image pixels. Crucially, it affords a sparser dependency structure compared to fully autoregressive approaches. Our PixelPyramids yield state-of-the-art results for density estimation on various image datasets, especially for high-resolution data. For CelebA-HQ 1024 x 1024, we observe that the density estimates (in terms of bits/dim) are improved to ~44% of the baseline despite sampling speeds superior even to easily parallelizable flow-based models. △ Less

Submitted 17 October, 2021; originally announced October 2021.

Comments: To appear at ICCV 2021

arXiv:2109.06082 [pdf, other]

xGQA: Cross-Lingual Visual Question Answering

Authors: Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin O. Steitz, Stefan Roth, Ivan Vulić, Iryna Gurevych

Abstract: Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer modeling efforts. In this work, we address this gap and provide xGQA, a new multilingual evaluation benchmark for the visual question answering task. We extend the established English GQA dataset to 7 typologically divers… ▽ More Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer modeling efforts. In this work, we address this gap and provide xGQA, a new multilingual evaluation benchmark for the visual question answering task. We extend the established English GQA dataset to 7 typologically diverse languages, enabling us to detect and explore crucial challenges in cross-lingual visual question answering. We further propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual, and -- vice versa -- multilingual models to become multimodal. Our proposed methods outperform current state-of-the-art multilingual multimodal models (e.g., M3P) in zero-shot cross-lingual settings, but the accuracy remains low across the board; a performance drop of around 38 accuracy points in target languages showcases the difficulty of zero-shot cross-lingual transfer for this task. Our results suggest that simple cross-lingual transfer of multimodal models yields latent multilingual multimodal misalignment, calling for more sophisticated methods for vision and multilingual language modeling. △ Less

Submitted 17 March, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

Comments: Findings of ACL 2022

arXiv:2109.04422 [pdf, other]

TxT: Crossmodal End-to-End Learning with Transformers

Authors: Jan-Martin O. Steitz, Jonas Pfeiffer, Iryna Gurevych, Stefan Roth

Abstract: Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today's multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual r… ▽ More Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today's multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today's multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering. △ Less

Submitted 9 September, 2021; originally announced September 2021.

Comments: To appear at the 43rd DAGM German Conference on Pattern Recognition (GCPR) 2021

arXiv:2105.02216 [pdf, other]

Self-Supervised Multi-Frame Monocular Scene Flow

Authors: Junhwa Hur, Stefan Roth

Abstract: Estimating 3D scene flow from a sequence of monocular images has been gaining increased attention due to the simple, economical capture setup. Owing to the severe ill-posedness of the problem, the accuracy of current methods has been limited, especially that of efficient, real-time approaches. In this paper, we introduce a multi-frame monocular scene flow network based on self-supervised learning,… ▽ More Estimating 3D scene flow from a sequence of monocular images has been gaining increased attention due to the simple, economical capture setup. Owing to the severe ill-posedness of the problem, the accuracy of current methods has been limited, especially that of efficient, real-time approaches. In this paper, we introduce a multi-frame monocular scene flow network based on self-supervised learning, improving the accuracy over previous networks while retaining real-time efficiency. Based on an advanced two-frame baseline with a split-decoder design, we propose (i) a multi-frame model using a triple frame input and convolutional LSTM connections, (ii) an occlusion-aware census loss for better accuracy, and (iii) a gradient detaching strategy to improve training stability. On the KITTI dataset, we observe state-of-the-art accuracy among monocular scene flow methods based on self-supervised learning. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Comments: To appear at CVPR 2021. Code available: https://github.com/visinf/multi-mono-sf

arXiv:2105.00097 [pdf, other]

Self-supervised Augmentation Consistency for Adapting Semantic Segmentation

Authors: Nikita Araslanov, Stefan Roth

Abstract: We propose an approach to domain adaptation for semantic segmentation that is both practical and highly accurate. In contrast to previous work, we abandon the use of computationally involved adversarial objectives, network ensembles and style transfer. Instead, we employ standard data augmentation techniques $-$ photometric noise, flipping and scaling $-$ and ensure consistency of the semantic pre… ▽ More We propose an approach to domain adaptation for semantic segmentation that is both practical and highly accurate. In contrast to previous work, we abandon the use of computationally involved adversarial objectives, network ensembles and style transfer. Instead, we employ standard data augmentation techniques $-$ photometric noise, flipping and scaling $-$ and ensure consistency of the semantic predictions across these image transformations. We develop this principle in a lightweight self-supervised framework trained on co-evolving pseudo labels without the need for cumbersome extra training rounds. Simple in training from a practitioner's standpoint, our approach is remarkably effective. We achieve significant improvements of the state-of-the-art segmentation accuracy after adaptation, consistent both across different choices of the backbone architecture and adaptation scenarios. △ Less

Submitted 30 April, 2021; originally announced May 2021.

Comments: To appear at CVPR 2021. Code: https://github.com/visinf/da-sac

arXiv:2103.09962 [pdf, other]

Deep Wiener Deconvolution: Wiener Meets Deep Learning for Image Deblurring

Authors: Jiangxin Dong, Stefan Roth, Bernt Schiele

Abstract: We present a simple and effective approach for non-blind image deblurring, combining classical techniques and deep learning. In contrast to existing methods that deblur the image directly in the standard image space, we propose to perform an explicit deconvolution process in a feature space by integrating a classical Wiener deconvolution framework with learned deep features. A multi-scale feature… ▽ More We present a simple and effective approach for non-blind image deblurring, combining classical techniques and deep learning. In contrast to existing methods that deblur the image directly in the standard image space, we propose to perform an explicit deconvolution process in a feature space by integrating a classical Wiener deconvolution framework with learned deep features. A multi-scale feature refinement module then predicts the deblurred image from the deconvolved deep features, progressively recovering detail and small-scale structures. The proposed model is trained in an end-to-end manner and evaluated on scenarios with both simulated and real-world image blur. Our extensive experimental results show that the proposed deep Wiener deconvolution network facilitates deblurred results with visibly fewer artifacts. Moreover, our approach quantitatively outperforms state-of-the-art non-blind image deblurring methods by a wide margin. △ Less

Submitted 17 March, 2021; originally announced March 2021.

Comments: Accepted to NeurIPS 2020 as an oral presentation. Project page: https://gitlab.mpi-klsb.mpg.de/jdong/dwdn

arXiv:2103.08497 [pdf, other]

Sampling-free Variational Inference for Neural Networks with Multiplicative Activation Noise

Authors: Jannik Schmitt, Stefan Roth

Abstract: To adopt neural networks in safety critical domains, knowing whether we can trust their predictions is crucial. Bayesian neural networks (BNNs) provide uncertainty estimates by averaging predictions with respect to the posterior weight distribution. Variational inference methods for BNNs approximate the intractable weight posterior with a tractable distribution, yet mostly rely on sampling from th… ▽ More To adopt neural networks in safety critical domains, knowing whether we can trust their predictions is crucial. Bayesian neural networks (BNNs) provide uncertainty estimates by averaging predictions with respect to the posterior weight distribution. Variational inference methods for BNNs approximate the intractable weight posterior with a tractable distribution, yet mostly rely on sampling from the variational distribution during training and inference. Recent sampling-free approaches offer an alternative, but incur a significant parameter overhead. We here propose a more efficient parameterization of the posterior approximation for sampling-free variational inference that relies on the distribution induced by multiplicative Gaussian activation noise. This allows us to combine parameter efficiency with the benefits of sampling-free variational inference. Our approach yields competitive results for standard regression problems and scales well to large-scale image classification tasks including ImageNet. △ Less

Submitted 16 March, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

arXiv:2012.07727 [pdf, ps, other]

Localization Attack by Precoder Feedback Overhearing in 5G Networks and Countermeasures

Authors: Stefan Roth, Stefano Tomasin, Marco Maso, Aydin Sezgin

Abstract: In fifth-generation (5G) cellular networks, users feed back to the base station the index of the precoder (from a codebook) to be used for downlink transmission. The precoder is strongly related to the user channel and in turn to the user position within the cell. We propose a method by which an external attacker determines the user position by passively overhearing this unencrypted layer-2 feedba… ▽ More In fifth-generation (5G) cellular networks, users feed back to the base station the index of the precoder (from a codebook) to be used for downlink transmission. The precoder is strongly related to the user channel and in turn to the user position within the cell. We propose a method by which an external attacker determines the user position by passively overhearing this unencrypted layer-2 feedback signal. The attacker first builds a map of fed back precoder indices in the cell. Then, by overhearing the precoder index fed back by the victim user, the attacker finds its position on the map. We focus on the type-I single-panel codebook, which today is the only mandatory solution in the 3GPP standard. We analyze the attack and assess the obtained localization accuracy against various parameters. We analyze the localization error of a simplified precoder feedback model and describe its asymptotic localization precision. We also propose a mitigation against our attack, wherein the user randomly selects the precoder among those providing the highest rate. Simulations confirm that the attack can achieve a high localization accuracy, which is significantly reduced when the mitigation solution is adopted, at the cost of a negligible rate degradation. △ Less

Submitted 14 December, 2020; originally announced December 2020.

arXiv:2011.00966 [pdf, other]

Diverse Image Captioning with Context-Object Split Latent Spaces

Authors: Shweta Mahajan, Stefan Roth

Abstract: Diverse image captioning models aim to learn one-to-many mappings that are innate to cross-domain datasets, such as of images and texts. Current methods for this task are based on generative latent variable models, e.g. VAEs with structured latent spaces. Yet, the amount of multimodality captured by prior work is limited to that of the paired training data -- the true diversity of the underlying g… ▽ More Diverse image captioning models aim to learn one-to-many mappings that are innate to cross-domain datasets, such as of images and texts. Current methods for this task are based on generative latent variable models, e.g. VAEs with structured latent spaces. Yet, the amount of multimodality captured by prior work is limited to that of the paired training data -- the true diversity of the underlying generative process is not fully captured. To address this limitation, we leverage the contextual descriptions in the dataset that explain similar contexts in different visual scenes. To this end, we introduce a novel factorization of the latent space, termed context-object split, to model diversity in contextual descriptions across images and texts within the dataset. Our framework not only enables diverse captioning through context-based pseudo supervision, but extends this to images with novel objects and without paired captions in the training data. We evaluate our COS-CVAE approach on the standard COCO dataset and on the held-out COCO dataset consisting of images with novel objects, showing significant gains in accuracy and diversity. △ Less

Submitted 2 November, 2020; originally announced November 2020.

Comments: To appear at NeurIPS 2020

arXiv:2010.07548 [pdf, other]

MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking

Authors: Patrick Dendorfer, Aljoša Ošep, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth, Laura Leal-Taixé

Abstract: Standardized benchmarks have been crucial in pushing the performance of computer vision algorithms, especially since the advent of deep learning. Although leaderboards should not be over-claimed, they often provide the most objective measure of performance and are therefore important guides for research. We present MOTChallenge, a benchmark for single-camera Multiple Object Tracking (MOT) launched… ▽ More Standardized benchmarks have been crucial in pushing the performance of computer vision algorithms, especially since the advent of deep learning. Although leaderboards should not be over-claimed, they often provide the most objective measure of performance and are therefore important guides for research. We present MOTChallenge, a benchmark for single-camera Multiple Object Tracking (MOT) launched in late 2014, to collect existing and new data, and create a framework for the standardized evaluation of multiple object tracking methods. The benchmark is focused on multiple people tracking, since pedestrians are by far the most studied object in the tracking community, with applications ranging from robot navigation to self-driving cars. This paper collects the first three releases of the benchmark: (i) MOT15, along with numerous state-of-the-art results that were submitted in the last years, (ii) MOT16, which contains new challenging videos, and (iii) MOT17, that extends MOT16 sequences with more precise labels and evaluates tracking performance on three different object detectors. The second and third release not only offers a significant increase in the number of labeled boxes but also provide labels for multiple object classes beside pedestrians, as well as the level of visibility for every single object of interest. We finally provide a categorization of state-of-the-art trackers and a broad error analysis. This will help newcomers understand the related work and research trends in the MOT community, and hopefully shed some light on potential future research directions. △ Less

Submitted 8 December, 2020; v1 submitted 15 October, 2020; originally announced October 2020.

Comments: Accepted at IJCV

arXiv:2007.05798 [pdf, other]

doi 10.1109/IROS45743.2020.9340636

Planning on the fast lane: Learning to interact using attention mechanisms in path integral inverse reinforcement learning

Authors: Sascha Rosbach, Xing Li, Simon Großjohann, Silviu Homoceanu, Stefan Roth

Abstract: General-purpose trajectory planning algorithms for automated driving utilize complex reward functions to perform a combined optimization of strategic, behavioral, and kinematic features. The specification and tuning of a single reward function is a tedious task and does not generalize over a large set of traffic situations. Deep learning approaches based on path integral inverse reinforcement lear… ▽ More General-purpose trajectory planning algorithms for automated driving utilize complex reward functions to perform a combined optimization of strategic, behavioral, and kinematic features. The specification and tuning of a single reward function is a tedious task and does not generalize over a large set of traffic situations. Deep learning approaches based on path integral inverse reinforcement learning have been successfully applied to predict local situation-dependent reward functions using features of a set of sampled driving policies. Sample-based trajectory planning algorithms are able to approximate a spatio-temporal subspace of feasible driving policies that can be used to encode the context of a situation. However, the interaction with dynamic objects requires an extended planning horizon, which depends on sequential context modeling. In this work, we are concerned with the sequential reward prediction over an extended time horizon. We present a neural network architecture that uses a policy attention mechanism to generate a low-dimensional context vector by concentrating on trajectories with a human-like driving style. Apart from this, we propose a temporal attention mechanism to identify context switches and allow for stable adaptation of rewards. We evaluate our results on complex simulated driving situations, including other moving vehicles. Our evaluation shows that our policy attention mechanism learns to focus on collision-free policies in the configuration space. Furthermore, the temporal attention mechanism learns persistent interaction with other vehicles over an extended planning horizon. △ Less

Submitted 12 September, 2020; v1 submitted 11 July, 2020; originally announced July 2020.

Comments: To appear in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, October 2020

Journal ref: 2020 IEEE/RSJ Int. Conf. on Intelligent Robots and Syst. (IROS), Las Vegas, USA, 2020, pp. 5187-5193

arXiv:2005.14264 [pdf, other]

LR-CNN: Local-aware Region CNN for Vehicle Detection in Aerial Imagery

Authors: Wentong Liao, Xiang Chen, Jingfeng Yang, Stefan Roth, Michael Goesele, Michael Ying Yang, Bodo Rosenhahn

Abstract: State-of-the-art object detection approaches such as Fast/Faster R-CNN, SSD, or YOLO have difficulties detecting dense, small targets with arbitrary orientation in large aerial images. The main reason is that using interpolation to align RoI features can result in a lack of accuracy or even loss of location information. We present the Local-aware Region Convolutional Neural Network (LR-CNN), a nov… ▽ More State-of-the-art object detection approaches such as Fast/Faster R-CNN, SSD, or YOLO have difficulties detecting dense, small targets with arbitrary orientation in large aerial images. The main reason is that using interpolation to align RoI features can result in a lack of accuracy or even loss of location information. We present the Local-aware Region Convolutional Neural Network (LR-CNN), a novel two-stage approach for vehicle detection in aerial imagery. We enhance translation invariance to detect dense vehicles and address the boundary quantization issue amongst dense vehicles by aggregating the high-precision RoIs' features. Moreover, we resample high-level semantic pooled features, making them regain location information from the features of a shallower convolutional block. This strengthens the local feature invariance for the resampled features and enables detecting vehicles in an arbitrary orientation. The local feature invariance enhances the learning ability of the focal loss function, and the focal loss further helps to focus on the hard examples. Taken together, our method better addresses the challenges of aerial imagery. We evaluate our approach on several challenging datasets (VEDAI, DOTA), demonstrating a significant improvement over state-of-the-art methods. We demonstrate the good generalization ability of our approach on the DLR 3K dataset. △ Less

Submitted 28 May, 2020; originally announced May 2020.

Comments: 8 pages

arXiv:2005.08104 [pdf, other]

Single-Stage Semantic Segmentation from Image Labels

Authors: Nikita Araslanov, Stefan Roth

Abstract: Recent years have seen a rapid growth in new approaches improving the accuracy of semantic segmentation in a weakly supervised setting, i.e. with only image-level labels available for training. However, this has come at the cost of increased model complexity and sophisticated multi-stage training procedures. This is in contrast to earlier work that used only a single stage $-$ training one segment… ▽ More Recent years have seen a rapid growth in new approaches improving the accuracy of semantic segmentation in a weakly supervised setting, i.e. with only image-level labels available for training. However, this has come at the cost of increased model complexity and sophisticated multi-stage training procedures. This is in contrast to earlier work that used only a single stage $-$ training one segmentation network on image labels $-$ which was abandoned due to inferior segmentation accuracy. In this work, we first define three desirable properties of a weakly supervised method: local consistency, semantic fidelity, and completeness. Using these properties as guidelines, we then develop a segmentation-based network model and a self-supervised training scheme to train for semantic masks from image-level annotations in a single stage. We show that despite its simplicity, our method achieves results that are competitive with significantly more complex pipelines, substantially outperforming earlier single-stage methods. △ Less

Submitted 16 May, 2020; originally announced May 2020.

Comments: To appear at CVPR 2020; minor corrections in Eq. (9). Code: https://github.com/visinf/1-stage-wseg

arXiv:2005.05292 [pdf, ps, other]

Remote Short Blocklength Process Monitoring: Trade-off Between Resolution and Data Freshness

Authors: Stefan Roth, Ahmed Arafa, H. Vincent Poor, Aydin Sezgin

Abstract: In cyber-physical systems, as in 5G and beyond, multiple physical processes require timely online monitoring at a remote device. There, the received information is used to estimate current and future process values. When transmitting the process data over a communication channel, source-channel coding is used in order to reduce data errors. During transmission, a high data resolution is helpful to… ▽ More In cyber-physical systems, as in 5G and beyond, multiple physical processes require timely online monitoring at a remote device. There, the received information is used to estimate current and future process values. When transmitting the process data over a communication channel, source-channel coding is used in order to reduce data errors. During transmission, a high data resolution is helpful to capture the value of the process variables precisely. However, this typically comes with long transmission delays reducing the utilizability of the data, since the estimation quality gets reduced over time. In this paper, the trade-off between having recent data and precise measurements is captured for a Gauss-Markov process. An Age-of-Information (AoI) metric is used to assess data timeliness, while mean square error (MSE) is used to assess the precision of the predicted process values. AoI appears inherently within the MSE expressions, yet it can be relatively easier to optimize. Our goal is to minimize a time-averaged version of both metrics. We follow a short blocklength source-channel coding approach, and optimize the parameters of the codes being used in order to describe an achievability region between MSE and AoI. △ Less

Submitted 11 May, 2020; originally announced May 2020.

Comments: To appear in the 2020 IEEE International Conference on Communications

arXiv:2004.04143 [pdf, other]

Self-Supervised Monocular Scene Flow Estimation

Authors: Junhwa Hur, Stefan Roth

Abstract: Scene flow estimation has been receiving increasing attention for 3D environment perception. Monocular scene flow estimation -- obtaining 3D structure and 3D motion from two temporally consecutive images -- is a highly ill-posed problem, and practical solutions are lacking to date. We propose a novel monocular scene flow method that yields competitive accuracy and real-time performance. By taking… ▽ More Scene flow estimation has been receiving increasing attention for 3D environment perception. Monocular scene flow estimation -- obtaining 3D structure and 3D motion from two temporally consecutive images -- is a highly ill-posed problem, and practical solutions are lacking to date. We propose a novel monocular scene flow method that yields competitive accuracy and real-time performance. By taking an inverse problem view, we design a single convolutional neural network (CNN) that successfully estimates depth and 3D motion simultaneously from a classical optical flow cost volume. We adopt self-supervised learning with 3D loss functions and occlusion reasoning to leverage unlabeled data. We validate our design choices, including the proxy loss and augmentation setup. Our model achieves state-of-the-art accuracy among unsupervised/self-supervised learning approaches to monocular scene flow, and yields competitive results for the optical flow and monocular depth estimation sub-tasks. Semi-supervised fine-tuning further improves the accuracy and yields promising results in real-time. △ Less

Submitted 15 April, 2020; v1 submitted 8 April, 2020; originally announced April 2020.

Comments: To appear at CVPR 2020 (Oral); a typo corrected in the reference section

arXiv:2004.03891 [pdf, other]

Normalizing Flows with Multi-Scale Autoregressive Priors

Authors: Shweta Mahajan, Apratim Bhattacharyya, Mario Fritz, Bernt Schiele, Stefan Roth

Abstract: Flow-based generative models are an important class of exact inference models that admit efficient inference and sampling for image synthesis. Owing to the efficiency constraints on the design of the flow layers, e.g. split coupling flow layers in which approximately half the pixels do not undergo further transformations, they have limited expressiveness for modeling long-range data dependencies c… ▽ More Flow-based generative models are an important class of exact inference models that admit efficient inference and sampling for image synthesis. Owing to the efficiency constraints on the design of the flow layers, e.g. split coupling flow layers in which approximately half the pixels do not undergo further transformations, they have limited expressiveness for modeling long-range data dependencies compared to autoregressive models that rely on conditional pixel-wise generation. In this work, we improve the representational power of flow-based models by introducing channel-wise dependencies in their latent space through multi-scale autoregressive priors (mAR). Our mAR prior for models with split coupling flow layers (mAR-SCF) can better capture dependencies in complex multimodal data. The resulting model achieves state-of-the-art density estimation results on MNIST, CIFAR-10, and ImageNet. Furthermore, we show that mAR-SCF allows for improved image generation quality, with gains in FID and Inception scores compared to state-of-the-art flow-based models. △ Less

Submitted 8 April, 2020; originally announced April 2020.

Comments: To appear in CVPR 2020

arXiv:2004.02853 [pdf, other]

Optical Flow Estimation in the Deep Learning Age

Authors: Junhwa Hur, Stefan Roth

Abstract: Akin to many subareas of computer vision, the recent advances in deep learning have also significantly influenced the literature on optical flow. Previously, the literature had been dominated by classical energy-based models, which formulate optical flow estimation as an energy minimization problem. However, as the practical benefits of Convolutional Neural Networks (CNNs) over conventional method… ▽ More Akin to many subareas of computer vision, the recent advances in deep learning have also significantly influenced the literature on optical flow. Previously, the literature had been dominated by classical energy-based models, which formulate optical flow estimation as an energy minimization problem. However, as the practical benefits of Convolutional Neural Networks (CNNs) over conventional methods have become apparent in numerous areas of computer vision and beyond, they have also seen increased adoption in the context of motion estimation to the point where the current state of the art in terms of accuracy is set by CNN approaches. We first review this transition as well as the developments from early work to the current state of CNNs for optical flow estimation. Alongside, we discuss some of their technical details and compare them to recapitulate which technical contribution led to the most significant accuracy improvements. Then we provide an overview of the various optical flow approaches introduced in the deep learning age, including those based on alternative learning paradigms (e.g., unsupervised and semi-supervised methods) as well as the extension to the multi-frame case, which is able to yield further accuracy improvements. △ Less

Submitted 6 April, 2020; originally announced April 2020.

Comments: To appear as a book chapter in Modelling Human Motion, N. Noceti, A. Sciutti and F. Rea, Eds., Springer, 2020

arXiv:2003.14407 [pdf, other]

Probabilistic Pixel-Adaptive Refinement Networks

Authors: Anne S. Wannenwetsch, Stefan Roth

Abstract: Encoder-decoder networks have found widespread use in various dense prediction tasks. However, the strong reduction of spatial resolution in the encoder leads to a loss of location information as well as boundary artifacts. To address this, image-adaptive post-processing methods have shown beneficial by leveraging the high-resolution input image(s) as guidance data. We extend such approaches by co… ▽ More Encoder-decoder networks have found widespread use in various dense prediction tasks. However, the strong reduction of spatial resolution in the encoder leads to a loss of location information as well as boundary artifacts. To address this, image-adaptive post-processing methods have shown beneficial by leveraging the high-resolution input image(s) as guidance data. We extend such approaches by considering an important orthogonal source of information: the network's confidence in its own predictions. We introduce probabilistic pixel-adaptive convolutions (PPACs), which not only depend on image guidance data for filtering, but also respect the reliability of per-pixel predictions. As such, PPACs allow for image-adaptive smoothing and simultaneously propagating pixels of high confidence into less reliable regions, while respecting object boundaries. We demonstrate their utility in refinement networks for optical flow and semantic segmentation, where PPACs lead to a clear reduction in boundary artifacts. Moreover, our proposed refinement step is able to substantially improve the accuracy on various widely used benchmarks. △ Less

Submitted 31 March, 2020; originally announced March 2020.

Comments: To appear at CVPR 2020

arXiv:2003.09003 [pdf, other]

MOT20: A benchmark for multi object tracking in crowded scenes

Authors: Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, Laura Leal-Taixé

Abstract: Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of mu… ▽ More Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of multiple object tracking methods. The challenge focuses on multiple people tracking, since pedestrians are well studied in the tracking community, and precise tracking and detection has high practical relevance. Since the first release, MOT15, MOT16, and MOT17 have tremendously contributed to the community by introducing a clean dataset and precise framework to benchmark multi-object trackers. In this paper, we present our MOT20benchmark, consisting of 8 new sequences depicting very crowded challenging scenes. The benchmark was presented first at the 4thBMTT MOT Challenge Workshop at the Computer Vision and Pattern Recognition Conference (CVPR) 2019, and gives to chance to evaluate state-of-the-art methods for multiple object tracking when handling extremely crowded scenarios. △ Less

Submitted 19 March, 2020; originally announced March 2020.

Comments: The sequences of the new MOT20 benchmark were previously presented in the CVPR 2019 tracking challenge ( arXiv:1906.04567 ). The differences between the two challenges are: - New and corrected annotations - New sequences, as we had to crop and transform some old sequences to achieve higher quality in the annotations. - New baselines evaluations and different sets of public detections

arXiv:2002.06661 [pdf, other]

Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings

Authors: Shweta Mahajan, Iryna Gurevych, Stefan Roth

Abstract: Learned joint representations of images and text form the backbone of several important cross-domain tasks such as image captioning. Prior work mostly maps both domains into a common latent representation in a purely supervised fashion. This is rather restrictive, however, as the two domains follow distinct generative processes. Therefore, we propose a novel semi-supervised framework, which models… ▽ More Learned joint representations of images and text form the backbone of several important cross-domain tasks such as image captioning. Prior work mostly maps both domains into a common latent representation in a purely supervised fashion. This is rather restrictive, however, as the two domains follow distinct generative processes. Therefore, we propose a novel semi-supervised framework, which models shared information between domains and domain-specific information separately. The information shared between the domains is aligned with an invertible neural network. Our model integrates normalizing flow-based priors for the domain-specific information, which allows us to learn diverse many-to-many mappings between the two domains. We demonstrate the effectiveness of our model on diverse tasks, including image captioning and text-to-image synthesis. △ Less

Submitted 16 February, 2020; originally announced February 2020.

Comments: Published as a conference paper at ICLR 2020

arXiv:1912.03509 [pdf, other]

doi 10.1109/ICRA40945.2020.9196778

Driving Style Encoder: Situational Reward Adaptation for General-Purpose Planning in Automated Driving

Authors: Sascha Rosbach, Vinit James, Simon Großjohann, Silviu Homoceanu, Xing Li, Stefan Roth

Abstract: General-purpose planning algorithms for automated driving combine mission, behavior, and local motion planning. Such planning algorithms map features of the environment and driving kinematics into complex reward functions. To achieve this, planning experts often rely on linear reward functions. The specification and tuning of these reward functions is a tedious process and requires significant exp… ▽ More General-purpose planning algorithms for automated driving combine mission, behavior, and local motion planning. Such planning algorithms map features of the environment and driving kinematics into complex reward functions. To achieve this, planning experts often rely on linear reward functions. The specification and tuning of these reward functions is a tedious process and requires significant experience. Moreover, a manually designed linear reward function does not generalize across different driving situations. In this work, we propose a deep learning approach based on inverse reinforcement learning that generates situation-dependent reward functions. Our neural network provides a mapping between features and actions of sampled driving policies of a model-predictive control-based planner and predicts reward functions for upcoming planning cycles. In our evaluation, we compare the driving style of reward functions predicted by our deep network against clustered and linear reward functions. Our proposed deep learning approach outperforms clustered linear reward functions and is at par with linear reward functions with a-priori knowledge about the situation. △ Less

Submitted 13 September, 2020; v1 submitted 7 December, 2019; originally announced December 2019.

Comments: To appear in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, June 2020 (Virtual Conference). Accepted version. Corrected figure font

Journal ref: IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 2020, pp. 6419-6425

arXiv:1909.12400 [pdf, other]

Markov Decision Process for Video Generation

Authors: Vladyslav Yushchenko, Nikita Araslanov, Stefan Roth

Abstract: We identify two pathological cases of temporal inconsistencies in video generation: video freezing and video looping. To better quantify the temporal diversity, we propose a class of complementary metrics that are effective, easy to implement, data agnostic, and interpretable. Further, we observe that current state-of-the-art models are trained on video samples of fixed length thereby inhibiting l… ▽ More We identify two pathological cases of temporal inconsistencies in video generation: video freezing and video looping. To better quantify the temporal diversity, we propose a class of complementary metrics that are effective, easy to implement, data agnostic, and interpretable. Further, we observe that current state-of-the-art models are trained on video samples of fixed length thereby inhibiting long-term modeling. To address this, we reformulate the problem of video generation as a Markov Decision Process (MDP). The underlying idea is to represent motion as a stochastic process with an infinite forecast horizon to overcome the fixed length limitation and to mitigate the presence of temporal artifacts. We show that our formulation is easy to integrate into the state-of-the-art MoCoGAN framework. Our experiments on the Human Actions and UCF-101 datasets demonstrate that our MDP-based model is more memory efficient and improves the video quality both in terms of the new and established metrics. △ Less

Submitted 26 September, 2019; originally announced September 2019.

Comments: To appear at 2019 ICCV Workshop on Large Scale Holistic Video Understanding

arXiv:1909.12196 [pdf, other]

Deep Video Deblurring: The Devil is in the Details

Authors: Jochen Gast, Stefan Roth

Abstract: Video deblurring for hand-held cameras is a challenging task, since the underlying blur is caused by both camera shake and object motion. State-of-the-art deep networks exploit temporal information from neighboring frames, either by means of spatio-temporal transformers or by recurrent architectures. In contrast to these involved models, we found that a simple baseline CNN can perform astonishingl… ▽ More Video deblurring for hand-held cameras is a challenging task, since the underlying blur is caused by both camera shake and object motion. State-of-the-art deep networks exploit temporal information from neighboring frames, either by means of spatio-temporal transformers or by recurrent architectures. In contrast to these involved models, we found that a simple baseline CNN can perform astonishingly well when particular care is taken w.r.t. the details of model and training procedure. To that end, we conduct a comprehensive study regarding these crucial details, uncovering extreme differences in quantitative and qualitative performance. Exploiting these details allows us to boost the architecture and training procedure of a simple baseline CNN by a staggering 3.15dB, such that it becomes highly competitive w.r.t. cutting-edge networks. This raises the question whether the reported accuracy difference between models is always due to technical contributions or also subject to such orthogonal, but crucial details. △ Less

Submitted 26 September, 2019; originally announced September 2019.

Comments: To appear at ICCVW 2019

arXiv:1909.06635 [pdf, other]

Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings

Authors: Shweta Mahajan, Teresa Botschen, Iryna Gurevych, Stefan Roth

Abstract: One of the key challenges in learning joint embeddings of multiple modalities, e.g. of images and text, is to ensure coherent cross-modal semantics that generalize across datasets. We propose to address this through joint Gaussian regularization of the latent representations. Building on Wasserstein autoencoders (WAEs) to encode the input in each domain, we enforce the latent embeddings to be simi… ▽ More One of the key challenges in learning joint embeddings of multiple modalities, e.g. of images and text, is to ensure coherent cross-modal semantics that generalize across datasets. We propose to address this through joint Gaussian regularization of the latent representations. Building on Wasserstein autoencoders (WAEs) to encode the input in each domain, we enforce the latent embeddings to be similar to a Gaussian prior that is shared across the two domains, ensuring compatible continuity of the encoded semantic representations of images and texts. Semantic alignment is achieved through supervision from matching image-text pairs. To show the benefits of our semi-supervised representation, we apply it to cross-modal retrieval and phrase localization. We not only achieve state-of-the-art accuracy, but significantly better generalization across datasets, owing to the semantic continuity of the latent space. △ Less

Submitted 14 September, 2019; originally announced September 2019.

Comments: Accepted at ICCV 2019 Workshop on Cross-Modal Learning in Real World

arXiv:1909.03677 [pdf, other]

Learning Task-Specific Generalized Convolutions in the Permutohedral Lattice

Authors: Anne S. Wannenwetsch, Martin Kiefel, Peter V. Gehler, Stefan Roth

Abstract: Dense prediction tasks typically employ encoder-decoder architectures, but the prevalent convolutions in the decoder are not image-adaptive and can lead to boundary artifacts. Different generalized convolution operations have been introduced to counteract this. We go beyond these by leveraging guidance data to redefine their inherent notion of proximity. Our proposed network layer builds on the pe… ▽ More Dense prediction tasks typically employ encoder-decoder architectures, but the prevalent convolutions in the decoder are not image-adaptive and can lead to boundary artifacts. Different generalized convolution operations have been introduced to counteract this. We go beyond these by leveraging guidance data to redefine their inherent notion of proximity. Our proposed network layer builds on the permutohedral lattice, which performs sparse convolutions in a high-dimensional space allowing for powerful non-local operations despite small filters. Multiple features with different characteristics span this permutohedral space. In contrast to prior work, we learn these features in a task-specific manner by generalizing the basic permutohedral operations to learnt feature representations. As the resulting objective is complex, a carefully designed framework and learning procedure are introduced, yielding rich feature embeddings in practice. We demonstrate the general applicability of our approach in different joint upsampling tasks. When adding our network layer to state-of-the-art networks for optical flow and semantic segmentation, boundary artifacts are removed and the accuracy is improved. △ Less

Submitted 9 September, 2019; originally announced September 2019.

Comments: To appear at GCPR 2019

arXiv:1906.04567 [pdf, other]

CVPR19 Tracking and Detection Challenge: How crowded can it get?

Authors: Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, Laura Leal-Taixe

Abstract: Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of… ▽ More Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of multiple object tracking methods. The challenge focuses on multiple people tracking, since pedestrians are well studied in the tracking community, and precise tracking and detection has high practical relevance. Since the first release, MOT15, MOT16 and MOT17 have tremendously contributed to the community by introducing a clean dataset and precise framework to benchmark multi-object trackers. In this paper, we present our CVPR19 benchmark, consisting of 8 new sequences depicting very crowded challenging scenes. The benchmark will be presented at the 4th BMTT MOT Challenge Workshop at the Computer Vision and Pattern Recognition Conference (CVPR) 2019, and will evaluate the state-of-the-art in multiple object tracking whend handling extremely crowded scenarios. △ Less

Submitted 10 June, 2019; originally announced June 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1603.00831, arXiv:1504.01942

arXiv:1905.00229 [pdf, other]

doi 10.1109/IROS40897.2019.8968205

Driving with Style: Inverse Reinforcement Learning in General-Purpose Planning for Automated Driving

Authors: Sascha Rosbach, Vinit James, Simon Großjohann, Silviu Homoceanu, Stefan Roth

Abstract: Behavior and motion planning play an important role in automated driving. Traditionally, behavior planners instruct local motion planners with predefined behaviors. Due to the high scene complexity in urban environments, unpredictable situations may occur in which behavior planners fail to match predefined behavior templates. Recently, general-purpose planners have been introduced, combining behav… ▽ More Behavior and motion planning play an important role in automated driving. Traditionally, behavior planners instruct local motion planners with predefined behaviors. Due to the high scene complexity in urban environments, unpredictable situations may occur in which behavior planners fail to match predefined behavior templates. Recently, general-purpose planners have been introduced, combining behavior and local motion planning. These general-purpose planners allow behavior-aware motion planning given a single reward function. However, two challenges arise: First, this function has to map a complex feature space into rewards. Second, the reward function has to be manually tuned by an expert. Manually tuning this reward function becomes a tedious task. In this paper, we propose an approach that relies on human driving demonstrations to automatically tune reward functions. This study offers important insights into the driving style optimization of general-purpose planners with maximum entropy inverse reinforcement learning. We evaluate our approach based on the expected value difference between learned and demonstrated policies. Furthermore, we compare the similarity of human driven trajectories with optimal policies of our planner under learned and expert-tuned reward functions. Our experiments show that we are able to learn reward functions exceeding the level of manual expert tuning without prior domain knowledge. △ Less

Submitted 12 September, 2020; v1 submitted 1 May, 2019; originally announced May 2019.

Comments: Appeared at IROS 2019. Accepted version. Added/updated footnote, minor correction in preliminaries

Journal ref: 2019 IEEE/RSJ Int. Conf. on Intelligent Robots and Syst. (IROS), Macau, China, 2019, pp. 2658-2665

arXiv:1904.05290 [pdf, other]

Iterative Residual Refinement for Joint Optical Flow and Occlusion Estimation

Authors: Junhwa Hur, Stefan Roth

Abstract: Deep learning approaches to optical flow estimation have seen rapid progress over the recent years. One common trait of many networks is that they refine an initial flow estimate either through multiple stages or across the levels of a coarse-to-fine representation. While leading to more accurate results, the downside of this is an increased number of parameters. Taking inspiration from both class… ▽ More Deep learning approaches to optical flow estimation have seen rapid progress over the recent years. One common trait of many networks is that they refine an initial flow estimate either through multiple stages or across the levels of a coarse-to-fine representation. While leading to more accurate results, the downside of this is an increased number of parameters. Taking inspiration from both classical energy minimization approaches as well as residual networks, we propose an iterative residual refinement (IRR) scheme based on weight sharing that can be combined with several backbone networks. It reduces the number of parameters, improves the accuracy, or even achieves both. Moreover, we show that integrating occlusion prediction and bi-directional flow estimation into our IRR scheme can further boost the accuracy. Our full network achieves state-of-the-art results for both optical flow and occlusion estimation across several standard datasets. △ Less

Submitted 10 April, 2019; originally announced April 2019.

Comments: To appear in CVPR 2019

arXiv:1904.05126 [pdf, other]

Actor-Critic Instance Segmentation

Authors: Nikita Araslanov, Constantin Rothkopf, Stefan Roth

Abstract: Most approaches to visual scene analysis have emphasised parallel processing of the image elements. However, one area in which the sequential nature of vision is apparent, is that of segmenting multiple, potentially similar and partially occluded objects in a scene. In this work, we revisit the recurrent formulation of this challenging problem in the context of reinforcement learning. Motivated by… ▽ More Most approaches to visual scene analysis have emphasised parallel processing of the image elements. However, one area in which the sequential nature of vision is apparent, is that of segmenting multiple, potentially similar and partially occluded objects in a scene. In this work, we revisit the recurrent formulation of this challenging problem in the context of reinforcement learning. Motivated by the limitations of the global max-matching assignment of the ground-truth segments to the recurrent states, we develop an actor-critic approach in which the actor recurrently predicts one instance mask at a time and utilises the gradient from a concurrently trained critic network. We formulate the state, action, and the reward such as to let the critic model long-term effects of the current prediction and incorporate this information into the gradient signal. Furthermore, to enable effective exploration in the inherently high-dimensional action space of instance masks, we learn a compact representation using a conditional variational auto-encoder. We show that our actor-critic model consistently provides accuracy benefits over the recurrent baseline on standard instance segmentation benchmarks. △ Less

Submitted 10 April, 2019; originally announced April 2019.

Comments: To appear at CVPR 2019

Showing 1–50 of 74 results for author: Roth, S