Search | arXiv e-print repository

The Third Monocular Depth Estimation Challenge

Authors: Jaime Spencer, Fabio Tosi, Matteo Poggi, Ripudaman Singh Arora, Chris Russell, Simon Hadfield, Richard Bowden, GuangYuan Zhou, ZhengXin Li, Qiang Rao, YiPing Bao, Xiao Liu, Dohyeong Kim, Jinseong Kim, Myunghyun Kim, Mykola Lavreniuk, Rui Li, Qing Mao, Jiang Wu, Yu Zhu, Jinqiu Sun, Yanning Zhang, Suraj Patni, Aradhye Agarwal, Chetan Arora , et al. (16 additional authors not shown)

Abstract: This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 su… ▽ More This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method. The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%. △ Less

Submitted 27 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

Comments: To appear in CVPRW2024

arXiv:2404.12103 [pdf, other]

S3R-Net: A Single-Stage Approach to Self-Supervised Shadow Removal

Authors: Nikolina Kubiak, Armin Mustafa, Graeme Phillipson, Stephen Jolly, Simon Hadfield

Abstract: In this paper we present S3R-Net, the Self-Supervised Shadow Removal Network. The two-branch WGAN model achieves self-supervision relying on the unify-and-adaptphenomenon - it unifies the style of the output data and infers its characteristics from a database of unaligned shadow-free reference images. This approach stands in contrast to the large body of supervised frameworks. S3R-Net also differe… ▽ More In this paper we present S3R-Net, the Self-Supervised Shadow Removal Network. The two-branch WGAN model achieves self-supervision relying on the unify-and-adaptphenomenon - it unifies the style of the output data and infers its characteristics from a database of unaligned shadow-free reference images. This approach stands in contrast to the large body of supervised frameworks. S3R-Net also differentiates itself from the few existing self-supervised models operating in a cycle-consistent manner, as it is a non-cyclic, unidirectional solution. The proposed framework achieves comparable numerical scores to recent selfsupervised shadow removal models while exhibiting superior qualitative performance and keeping the computational cost low. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: NTIRE workshop @ CVPR 2024. Code & models available at https://github.com/n-kubiak/S3R-Net

arXiv:2403.01569 [pdf, other]

Kick Back & Relax++: Scaling Beyond Ground-Truth Depth with SlowTV & CribsTV

Authors: Jaime Spencer, Chris Russell, Simon Hadfield, Richard Bowden

Abstract: Self-supervised learning is the key to unlocking generic computer vision systems. By eliminating the reliance on ground-truth annotations, it allows scaling to much larger data quantities. Unfortunately, self-supervised monocular depth estimation (SS-MDE) has been limited by the absence of diverse training data. Existing datasets have focused exclusively on urban driving in densely populated citie… ▽ More Self-supervised learning is the key to unlocking generic computer vision systems. By eliminating the reliance on ground-truth annotations, it allows scaling to much larger data quantities. Unfortunately, self-supervised monocular depth estimation (SS-MDE) has been limited by the absence of diverse training data. Existing datasets have focused exclusively on urban driving in densely populated cities, resulting in models that fail to generalize beyond this domain. To address these limitations, this paper proposes two novel datasets: SlowTV and CribsTV. These are large-scale datasets curated from publicly available YouTube videos, containing a total of 2M training frames. They offer an incredibly diverse set of environments, ranging from snowy forests to coastal roads, luxury mansions and even underwater coral reefs. We leverage these datasets to tackle the challenging task of zero-shot generalization, outperforming every existing SS-MDE approach and even some state-of-the-art supervised methods. The generalization capabilities of our models are further enhanced by a range of components and contributions: 1) learning the camera intrinsics, 2) a stronger augmentation regime targeting aspect ratio changes, 3) support frame randomization, 4) flexible motion estimation, 5) a modern transformer-based architecture. We demonstrate the effectiveness of each component in extensive ablation experiments. To facilitate the development of future research, we make the datasets, code and pretrained models available to the public at https://github.com/jspenmar/slowtv_monodepth. △ Less

Submitted 3 March, 2024; originally announced March 2024.

arXiv:2312.15363 [pdf, other]

BEV-CV: Birds-Eye-View Transform for Cross-View Geo-Localisation

Authors: Tavis Shore, Simon Hadfield, Oscar Mendez

Abstract: Cross-view image matching for geo-localisation is a challenging problem due to the significant visual difference between aerial and ground-level viewpoints. The method provides localisation capabilities from geo-referenced images, eliminating the need for external devices or costly equipment. This enhances the capacity of agents to autonomously determine their position, navigate, and operate effec… ▽ More Cross-view image matching for geo-localisation is a challenging problem due to the significant visual difference between aerial and ground-level viewpoints. The method provides localisation capabilities from geo-referenced images, eliminating the need for external devices or costly equipment. This enhances the capacity of agents to autonomously determine their position, navigate, and operate effectively in environments where GPS signals are unavailable. Current research employs a variety of techniques to reduce the domain gap such as applying polar transforms to aerial images or synthesising between perspectives. However, these approaches generally rely on having a 360° field of view, limiting real-world feasibility. We propose BEV-CV, an approach which introduces two key novelties. Firstly we bring ground-level images into a semantic Birds-Eye-View before matching embeddings, allowing for direct comparison with aerial segmentation representations. Secondly, we introduce the use of a Normalised Temperature-scaled Cross Entropy Loss to the sub-field, achieving faster convergence than with the standard triplet loss. BEV-CV achieves state-of-the-art recall accuracies, improving feature extraction Top-1 rates by more than 300%, and Top-1% rates by approximately 150% for 70° crops, and for orientation-aware application we achieve a 35% Top-1 accuracy increase with 70° crops. △ Less

Submitted 23 December, 2023; originally announced December 2023.

Comments: 8 pages, 6 figures

arXiv:2311.18491 [pdf, other]

ZeST-NeRF: Using temporal aggregation for Zero-Shot Temporal NeRFs

Authors: Violeta Menéndez González, Andrew Gilbert, Graeme Phillipson, Stephen Jolly, Simon Hadfield

Abstract: In the field of media production, video editing techniques play a pivotal role. Recent approaches have had great success at performing novel view image synthesis of static scenes. But adding temporal information adds an extra layer of complexity. Previous models have focused on implicitly representing static and dynamic scenes using NeRF. These models achieve impressive results but are costly at t… ▽ More In the field of media production, video editing techniques play a pivotal role. Recent approaches have had great success at performing novel view image synthesis of static scenes. But adding temporal information adds an extra layer of complexity. Previous models have focused on implicitly representing static and dynamic scenes using NeRF. These models achieve impressive results but are costly at training and inference time. They overfit an MLP to describe the scene implicitly as a function of position. This paper proposes ZeST-NeRF, a new approach that can produce temporal NeRFs for new scenes without retraining. We can accurately reconstruct novel views using multi-view synthesis techniques and scene flow-field estimation, trained only with unrelated scenes. We demonstrate how existing state-of-the-art approaches from a range of fields cannot adequately solve this new task and demonstrate the efficacy of our solution. The resulting network improves quantitatively by 15% and produces significantly better visual results. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: VUA BMVC 2023

arXiv:2309.08301 [pdf, other]

RaSpectLoc: RAman SPECTroscopy-dependent robot LOCalisation

Authors: Christopher Thomas Thirgood, Oscar Alejandro Mendez Maldonado, Chao Ling, Jonathan Storey, Simon J Hadfield

Abstract: This paper presents a new information source for supporting robot localisation: material composition. The proposed method complements the existing visual, structural, and semantic cues utilized in the literature. However, it has a distinct advantage in its ability to differentiate structurally, visually or categorically similar objects such as different doors, by using Raman spectrometers. Such de… ▽ More This paper presents a new information source for supporting robot localisation: material composition. The proposed method complements the existing visual, structural, and semantic cues utilized in the literature. However, it has a distinct advantage in its ability to differentiate structurally, visually or categorically similar objects such as different doors, by using Raman spectrometers. Such devices can identify the material of objects it probes through the bonds between the material's molecules. Unlike similar sensors, such as mass spectroscopy, it does so without damaging the material or environment. In addition to introducing the first material-based localisation algorithm, this paper supports the future growth of the field by presenting a gazebo plugin for Raman spectrometers, material sensing demonstrations, as well as the first-ever localisation data-set with benchmarks for material-based localisation. This benchmarking shows that the proposed technique results in a significant improvement over current state-of-the-art localisation techniques, achieving 16\% more accurate localisation than the leading baseline. △ Less

Submitted 21 September, 2023; v1 submitted 15 September, 2023; originally announced September 2023.

Comments: 8 pages, 5 figures. This work will be presented at IROS 2023

arXiv:2308.12423 [pdf, other]

Design and execution of quantum circuits using tens of superconducting qubits and thousands of gates for dense Ising optimization problems

Authors: Filip B. Maciejewski, Stuart Hadfield, Benjamin Hall, Mark Hodson, Maxime Dupont, Bram Evert, James Sud, M. Sohaib Alam, Zhihui Wang, Stephen Jeffrey, Bhuvanesh Sundar, P. Aaron Lott, Shon Grabbe, Eleanor G. Rieffel, Matthew J. Reagor, Davide Venturelli

Abstract: We develop a hardware-efficient ansatz for variational optimization, derived from existing ansatze in the literature, that parametrizes subsets of all interactions in the Cost Hamiltonian in each layer. We treat gate orderings as a variational parameter and observe that doing so can provide significant performance boosts in experiments. We carried out experimental runs of a compilation-optimized i… ▽ More We develop a hardware-efficient ansatz for variational optimization, derived from existing ansatze in the literature, that parametrizes subsets of all interactions in the Cost Hamiltonian in each layer. We treat gate orderings as a variational parameter and observe that doing so can provide significant performance boosts in experiments. We carried out experimental runs of a compilation-optimized implementation of fully-connected Sherrington-Kirkpatrick Hamiltonians on a 50-qubit linear-chain subsystem of Rigetti Aspen-M-3 transmon processor. Our results indicate that, for the best circuit designs tested, the average performance at optimized angles and gate orderings increases with circuit depth (using more parameters), despite the presence of a high level of noise. We report performance significantly better than using a random guess oracle for circuits involving up to approx 5000 two-qubit and approx 5000 one-qubit native gates. We additionally discuss various takeaways of our results toward more effective utilization of current and future quantum processors for optimization. △ Less

Submitted 2 May, 2024; v1 submitted 17 August, 2023; originally announced August 2023.

Comments: v2: extended experimental results, updated references, fixed typos; v3: improved main narrations, added new experimental data and analysis, updated references, fixed typos; 15+8 pages; 3+5 figures

arXiv:2307.10713 [pdf, other]

Kick Back & Relax: Learning to Reconstruct the World by Watching SlowTV

Authors: Jaime Spencer, Chris Russell, Simon Hadfield, Richard Bowden

Abstract: Self-supervised monocular depth estimation (SS-MDE) has the potential to scale to vast quantities of data. Unfortunately, existing approaches limit themselves to the automotive domain, resulting in models incapable of generalizing to complex environments such as natural or indoor settings. To address this, we propose a large-scale SlowTV dataset curated from YouTube, containing an order of magni… ▽ More Self-supervised monocular depth estimation (SS-MDE) has the potential to scale to vast quantities of data. Unfortunately, existing approaches limit themselves to the automotive domain, resulting in models incapable of generalizing to complex environments such as natural or indoor settings. To address this, we propose a large-scale SlowTV dataset curated from YouTube, containing an order of magnitude more data than existing automotive datasets. SlowTV contains 1.7M images from a rich diversity of environments, such as worldwide seasonal hiking, scenic driving and scuba diving. Using this dataset, we train an SS-MDE model that provides zero-shot generalization to a large collection of indoor/outdoor datasets. The resulting model outperforms all existing SSL approaches and closes the gap on supervised SoTA, despite using a more efficient architecture. We additionally introduce a collection of best-practices to further maximize performance and zero-shot generalization. This includes 1) aspect ratio augmentation, 2) camera intrinsic estimation, 3) support frame randomization and 4) flexible motion estimation. Code is available at https://github.com/jspenmar/slowtv_monodepth. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Comments: Accepted to ICCV2023

arXiv:2304.07051 [pdf, other]

The Second Monocular Depth Estimation Challenge

Authors: Jaime Spencer, C. Stella Qian, Michaela Trescakova, Chris Russell, Simon Hadfield, Erich W. Graf, Wendy J. Adams, Andrew J. Schofield, James Elder, Richard Bowden, Ali Anwar, Hao Chen, Xiaozhi Chen, Kai Cheng, Yuchao Dai, Huynh Thai Hoa, Sadat Hossain, Jianmian Huang, Mohan Jing, Bo Li, Chao Li, Baojun Li, Zhiwen Liu, Stefano Mattoccia, Siegfried Mercelis , et al. (18 additional authors not shown)

Abstract: This paper discusses the results for the second edition of the Monocular Depth Estimation Challenge (MDEC). This edition was open to methods using any form of supervision, including fully-supervised, self-supervised, multi-task or proxy depth. The challenge was based around the SYNS-Patches dataset, which features a wide diversity of environments with high-quality dense ground-truth. This includes… ▽ More This paper discusses the results for the second edition of the Monocular Depth Estimation Challenge (MDEC). This edition was open to methods using any form of supervision, including fully-supervised, self-supervised, multi-task or proxy depth. The challenge was based around the SYNS-Patches dataset, which features a wide diversity of environments with high-quality dense ground-truth. This includes complex natural environments, e.g. forests or fields, which are greatly underrepresented in current benchmarks. The challenge received eight unique submissions that outperformed the provided SotA baseline on any of the pointcloud- or image-based metrics. The top supervised submission improved relative F-Score by 27.62%, while the top self-supervised improved it by 16.61%. Supervised submissions generally leveraged large collections of datasets to improve data diversity. Self-supervised submissions instead updated the network architecture and pretrained backbones. These results represent a significant progress in the field, while highlighting avenues for future research, such as reducing interpolation artifacts at depth boundaries, improving self-supervised indoor performance and overall natural image accuracy. △ Less

Submitted 26 April, 2023; v1 submitted 14 April, 2023; originally announced April 2023.

Comments: Published at CVPRW2023

arXiv:2302.07667 [pdf, other]

CERiL: Continuous Event-based Reinforcement Learning

Authors: Celyn Walters, Simon Hadfield

Abstract: This paper explores the potential of event cameras to enable continuous time reinforcement learning. We formalise this problem where a continuous stream of unsynchronised observations is used to produce a corresponding stream of output actions for the environment. This lack of synchronisation enables greatly enhanced reactivity. We present a method to train on event streams derived from standard R… ▽ More This paper explores the potential of event cameras to enable continuous time reinforcement learning. We formalise this problem where a continuous stream of unsynchronised observations is used to produce a corresponding stream of output actions for the environment. This lack of synchronisation enables greatly enhanced reactivity. We present a method to train on event streams derived from standard RL environments, thereby solving the proposed continuous time RL problem. The CERiL algorithm uses specialised network layers which operate directly on an event stream, rather than aggregating events into quantised image frames. We show the advantages of event streams over less-frequent RGB images. The proposed system outperforms networks typically used in RL, even succeeding at tasks which cannot be solved traditionally. We also demonstrate the value of our CERiL approach over a standard SNN baseline using event streams. △ Less

Submitted 15 February, 2023; originally announced February 2023.

Comments: 9 pages, 10 figures

arXiv:2211.12174 [pdf, other]

The Monocular Depth Estimation Challenge

Authors: Jaime Spencer, C. Stella Qian, Chris Russell, Simon Hadfield, Erich Graf, Wendy Adams, Andrew J. Schofield, James Elder, Richard Bowden, Heng Cong, Stefano Mattoccia, Matteo Poggi, Zeeshan Khan Suri, Yang Tang, Fabio Tosi, Hao Wang, Youmin Zhang, Yusheng Zhang, Chaoqiang Zhao

Abstract: This paper summarizes the results of the first Monocular Depth Estimation Challenge (MDEC) organized at WACV2023. This challenge evaluated the progress of self-supervised monocular depth estimation on the challenging SYNS-Patches dataset. The challenge was organized on CodaLab and received submissions from 4 valid teams. Participants were provided a devkit containing updated reference implementati… ▽ More This paper summarizes the results of the first Monocular Depth Estimation Challenge (MDEC) organized at WACV2023. This challenge evaluated the progress of self-supervised monocular depth estimation on the challenging SYNS-Patches dataset. The challenge was organized on CodaLab and received submissions from 4 valid teams. Participants were provided a devkit containing updated reference implementations for 16 State-of-the-Art algorithms and 4 novel techniques. The threshold for acceptance for novel techniques was to outperform every one of the 16 SotA baselines. All participants outperformed the baseline in traditional metrics such as MAE or AbsRel. However, pointcloud reconstruction metrics were challenging to improve upon. We found predictions were characterized by interpolation artefacts at object boundaries and errors in relative object positioning. We hope this challenge is a valuable contribution to the community and encourage authors to participate in future editions. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: WACV-Workshops 2023

arXiv:2211.07301 [pdf, other]

SVS: Adversarial refinement for sparse novel view synthesis

Authors: Violeta Menéndez González, Andrew Gilbert, Graeme Phillipson, Stephen Jolly, Simon Hadfield

Abstract: This paper proposes Sparse View Synthesis. This is a view synthesis problem where the number of reference views is limited, and the baseline between target and reference view is significant. Under these conditions, current radiance field methods fail catastrophically due to inescapable artifacts such 3D floating blobs, blurring and structural duplication, whenever the number of reference views is… ▽ More This paper proposes Sparse View Synthesis. This is a view synthesis problem where the number of reference views is limited, and the baseline between target and reference view is significant. Under these conditions, current radiance field methods fail catastrophically due to inescapable artifacts such 3D floating blobs, blurring and structural duplication, whenever the number of reference views is limited, or the target view diverges significantly from the reference views. Advances in network architecture and loss regularisation are unable to satisfactorily remove these artifacts. The occlusions within the scene ensure that the true contents of these regions is simply not available to the model. In this work, we instead focus on hallucinating plausible scene contents within such regions. To this end we unify radiance field models with adversarial learning and perceptual losses. The resulting system provides up to 60% improvement in perceptual accuracy compared to current state-of-the-art radiance field models on this problem. △ Less

Submitted 14 November, 2022; originally announced November 2022.

Comments: BMVC 2022

arXiv:2209.04362 [pdf, other]

EDeNN: Event Decay Neural Networks for low latency vision

Authors: Celyn Walters, Simon Hadfield

Abstract: Despite the success of neural networks in computer vision tasks, digital 'neurons' are a very loose approximation of biological neurons. Today's learning approaches are designed to function on digital devices with digital data representations such as image frames. In contrast, biological vision systems are generally much more capable and efficient than state-of-the-art digital computer vision algo… ▽ More Despite the success of neural networks in computer vision tasks, digital 'neurons' are a very loose approximation of biological neurons. Today's learning approaches are designed to function on digital devices with digital data representations such as image frames. In contrast, biological vision systems are generally much more capable and efficient than state-of-the-art digital computer vision algorithms. Event cameras are an emerging sensor technology which imitates biological vision with asynchronously firing pixels, eschewing the concept of the image frame. To leverage modern learning techniques, many event-based algorithms are forced to accumulate events back to image frames, somewhat squandering the advantages of event cameras. We follow the opposite paradigm and develop a new type of neural network which operates closer to the original event data stream. We demonstrate state-of-the-art performance in angular velocity regression and competitive optical flow estimation, while avoiding difficulties related to training SNN. Furthermore, the processing latency of our proposed approach is less than 1/10 any other implementation, while continuous inference increases this improvement by another order of magnitude. △ Less

Submitted 9 May, 2023; v1 submitted 9 September, 2022; originally announced September 2022.

Comments: 14 pages, 5 figures

arXiv:2208.01489 [pdf, other]

Deconstructing Self-Supervised Monocular Reconstruction: The Design Decisions that Matter

Authors: Jaime Spencer, Chris Russell, Simon Hadfield, Richard Bowden

Abstract: This paper presents an open and comprehensive framework to systematically evaluate state-of-the-art contributions to self-supervised monocular depth estimation. This includes pretraining, backbone, architectural design choices and loss functions. Many papers in this field claim novelty in either architecture design or loss formulation. However, simply updating the backbone of historical systems re… ▽ More This paper presents an open and comprehensive framework to systematically evaluate state-of-the-art contributions to self-supervised monocular depth estimation. This includes pretraining, backbone, architectural design choices and loss functions. Many papers in this field claim novelty in either architecture design or loss formulation. However, simply updating the backbone of historical systems results in relative improvements of 25%, allowing them to outperform the majority of existing systems. A systematic evaluation of papers in this field was not straightforward. The need to compare like-with-like in previous papers means that longstanding errors in the evaluation protocol are ubiquitous in the field. It is likely that many papers were not only optimized for particular datasets, but also for errors in the data and evaluation criteria. To aid future research in this area, we release a modular codebase (https://github.com/jspenmar/monodepth_benchmark), allowing for easy evaluation of alternate design decisions against corrected data and evaluation criteria. We re-implement, validate and re-evaluate 16 state-of-the-art contributions and introduce a new dataset (SYNS-Patches) containing dense outdoor depth maps in a variety of both natural and urban scenes. This allows for the computation of informative metrics in complex regions such as depth boundaries. △ Less

Submitted 21 December, 2022; v1 submitted 2 August, 2022; originally announced August 2022.

Comments: https://github.com/jspenmar/monodepth_benchmark

Journal ref: Transactions of Machine Learning Research 2022

arXiv:2207.13460 [pdf, other]

Adaptive sampling for scanning pixel cameras

Authors: Yusuf Duman, Jean-Yves Guillemaut, Simon Hadfield

Abstract: A scanning pixel camera is a novel low-cost, low-power sensor that is not diffraction limited. It produces data as a sequence of samples extracted from various parts of the scene during the course of a scan. It can provide very detailed images at the expense of samplerates and slow image acquisition time. This paper proposes a new algorithm which allows the sensor to adapt the samplerate over the… ▽ More A scanning pixel camera is a novel low-cost, low-power sensor that is not diffraction limited. It produces data as a sequence of samples extracted from various parts of the scene during the course of a scan. It can provide very detailed images at the expense of samplerates and slow image acquisition time. This paper proposes a new algorithm which allows the sensor to adapt the samplerate over the course of this sequence. This makes it possible to overcome some of these limitations by minimising the bandwidth and time required to image and transmit a scene, while maintaining image quality. We examine applications to image classification and semantic segmentation and are able to achieve similar results compared to a fully sampled input, while using 80% fewer samples △ Less

Submitted 1 August, 2022; v1 submitted 27 July, 2022; originally announced July 2022.

arXiv:2205.07716 [pdf, other]

Generalizing to New Tasks via One-Shot Compositional Subgoals

Authors: Xihan Bian, Oscar Mendez, Simon Hadfield

Abstract: The ability to generalize to previously unseen tasks with little to no supervision is a key challenge in modern machine learning research. It is also a cornerstone of a future "General AI". Any artificially intelligent agent deployed in a real world application, must adapt on the fly to unknown environments. Researchers often rely on reinforcement and imitation learning to provide online adaptatio… ▽ More The ability to generalize to previously unseen tasks with little to no supervision is a key challenge in modern machine learning research. It is also a cornerstone of a future "General AI". Any artificially intelligent agent deployed in a real world application, must adapt on the fly to unknown environments. Researchers often rely on reinforcement and imitation learning to provide online adaptation to new tasks, through trial and error learning. However, this can be challenging for complex tasks which require many timesteps or large numbers of subtasks to complete. These "long horizon" tasks suffer from sample inefficiency and can require extremely long training times before the agent can learn to perform the necessary longterm planning. In this work, we introduce CASE which attempts to address these issues by training an Imitation Learning agent using adaptive "near future" subgoals. These subgoals are recalculated at each step using compositional arithmetic in a learned latent representation space. In addition to improving learning efficiency for standard long-term tasks, this approach also makes it possible to perform one-shot generalization to previously unseen tasks, given only a single reference trajectory for the task in a different environment. Our experiments show that the proposed approach consistently outperforms the previous state-of-the-art compositional Imitation Learning approach by 30%. △ Less

Submitted 25 July, 2022; v1 submitted 16 May, 2022; originally announced May 2022.

Comments: Present at ICRA 2022 "Compositional Robotics: Mathematics and Tools"

arXiv:2205.07014 [pdf, other]

SaiNet: Stereo aware inpainting behind objects with generative networks

Authors: Violeta Menéndez González, Andrew Gilbert, Graeme Phillipson, Stephen Jolly, Simon Hadfield

Abstract: In this work, we present an end-to-end network for stereo-consistent image inpainting with the objective of inpainting large missing regions behind objects. The proposed model consists of an edge-guided UNet-like network using Partial Convolutions. We enforce multi-view stereo consistency by introducing a disparity loss. More importantly, we develop a training scheme where the model is learned fro… ▽ More In this work, we present an end-to-end network for stereo-consistent image inpainting with the objective of inpainting large missing regions behind objects. The proposed model consists of an edge-guided UNet-like network using Partial Convolutions. We enforce multi-view stereo consistency by introducing a disparity loss. More importantly, we develop a training scheme where the model is learned from realistic stereo masks representing object occlusions, instead of the more common random masks. The technique is trained in a supervised way. Our evaluation shows competitive results compared to previous state-of-the-art techniques. △ Less

Submitted 14 May, 2022; originally announced May 2022.

Comments: Presented at AI4CC workshop at CVPR

arXiv:2205.03130 [pdf, other]

SKILL-IL: Disentangling Skill and Knowledge in Multitask Imitation Learning

Authors: Bian Xihan, Oscar Mendez, Simon Hadfield

Abstract: In this work, we introduce a new perspective for learning transferable content in multi-task imitation learning. Humans are able to transfer skills and knowledge. If we can cycle to work and drive to the store, we can also cycle to the store and drive to work. We take inspiration from this and hypothesize the latent memory of a policy network can be disentangled into two partitions. These contain… ▽ More In this work, we introduce a new perspective for learning transferable content in multi-task imitation learning. Humans are able to transfer skills and knowledge. If we can cycle to work and drive to the store, we can also cycle to the store and drive to work. We take inspiration from this and hypothesize the latent memory of a policy network can be disentangled into two partitions. These contain either the knowledge of the environmental context for the task or the generalizable skill needed to solve the task. This allows improved training efficiency and better generalization over previously unseen combinations of skills in the same environment, and the same task in unseen environments. We used the proposed approach to train a disentangled agent for two different multi-task IL environments. In both cases we out-performed the SOTA by 30% in task success rate. We also demonstrated this for navigation on a real robot. △ Less

Submitted 26 July, 2022; v1 submitted 6 May, 2022; originally announced May 2022.

Comments: Submitted to IROS 2022, under review

arXiv:2204.05698 [pdf, other]

Medusa: Universal Feature Learning via Attentional Multitasking

Authors: Jaime Spencer, Richard Bowden, Simon Hadfield

Abstract: Recent approaches to multi-task learning (MTL) have focused on modelling connections between tasks at the decoder level. This leads to a tight coupling between tasks, which need retraining if a new task is inserted or removed. We argue that MTL is a stepping stone towards universal feature learning (UFL), which is the ability to learn generic features that can be applied to new tasks without retra… ▽ More Recent approaches to multi-task learning (MTL) have focused on modelling connections between tasks at the decoder level. This leads to a tight coupling between tasks, which need retraining if a new task is inserted or removed. We argue that MTL is a stepping stone towards universal feature learning (UFL), which is the ability to learn generic features that can be applied to new tasks without retraining. We propose Medusa to realize this goal, designing task heads with dual attention mechanisms. The shared feature attention masks relevant backbone features for each task, allowing it to learn a generic representation. Meanwhile, a novel Multi-Scale Attention head allows the network to better combine per-task features from different scales when making the final prediction. We show the effectiveness of Medusa in UFL (+13.18% improvement), while maintaining MTL performance and being 25% more efficient than previous approaches. △ Less

Submitted 12 April, 2022; originally announced April 2022.

Comments: Accepted @ CVPRW 2022 (CLVision, 3rd Edition)

arXiv:2203.14432 [pdf, other]

doi 10.22331/q-2023-09-14-1111

Encoding trade-offs and design toolkits in quantum algorithms for discrete optimization: coloring, routing, scheduling, and other problems

Authors: Nicolas PD Sawaya, Albert T Schmitz, Stuart Hadfield

Abstract: Challenging combinatorial optimization problems are ubiquitous in science and engineering. Several quantum methods for optimization have recently been developed, in different settings including both exact and approximate solvers. Addressing this field of research, this manuscript has three distinct purposes. First, we present an intuitive method for synthesizing and analyzing discrete (i.e., integ… ▽ More Challenging combinatorial optimization problems are ubiquitous in science and engineering. Several quantum methods for optimization have recently been developed, in different settings including both exact and approximate solvers. Addressing this field of research, this manuscript has three distinct purposes. First, we present an intuitive method for synthesizing and analyzing discrete (i.e., integer-based) optimization problems, wherein the problem and corresponding algorithmic primitives are expressed using a discrete quantum intermediate representation (DQIR) that is encoding-independent. This compact representation often allows for more efficient problem compilation, automated analyses of different encoding choices, easier interpretability, more complex runtime procedures, and richer programmability, as compared to previous approaches, which we demonstrate with a number of examples. Second, we perform numerical studies comparing several qubit encodings; the results exhibit a number of preliminary trends that help guide the choice of encoding for a particular set of hardware and a particular problem and algorithm. Our study includes problems related to graph coloring, the traveling salesperson problem, factory/machine scheduling, financial portfolio rebalancing, and integer linear programming. Third, we design low-depth graph-derived partial mixers (GDPMs) up to 16-level quantum variables, demonstrating that compact (binary) encodings are more amenable to QAOA than previously understood. We expect this toolkit of programming abstractions and low-level building blocks to aid in designing quantum algorithms for discrete combinatorial problems. △ Less

Submitted 8 September, 2023; v1 submitted 27 March, 2022; originally announced March 2022.

Comments: 48 pages; 11 figures; Accepted to Quantum Journal

Journal ref: Quantum 7, 1111 (2023)

arXiv:2110.12914 [pdf, other]

SILT: Self-supervised Lighting Transfer Using Implicit Image Decomposition

Authors: Nikolina Kubiak, Armin Mustafa, Graeme Phillipson, Stephen Jolly, Simon Hadfield

Abstract: We present SILT, a Self-supervised Implicit Lighting Transfer method. Unlike previous research on scene relighting, we do not seek to apply arbitrary new lighting configurations to a given scene. Instead, we wish to transfer the lighting style from a database of other scenes, to provide a uniform lighting style regardless of the input. The solution operates as a two-branch network that first aims… ▽ More We present SILT, a Self-supervised Implicit Lighting Transfer method. Unlike previous research on scene relighting, we do not seek to apply arbitrary new lighting configurations to a given scene. Instead, we wish to transfer the lighting style from a database of other scenes, to provide a uniform lighting style regardless of the input. The solution operates as a two-branch network that first aims to map input images of any arbitrary lighting style to a unified domain, with extra guidance achieved through implicit image decomposition. We then remap this unified input domain using a discriminator that is presented with the generated outputs and the style reference, i.e. images of the desired illumination conditions. Our method is shown to outperform supervised relighting solutions across two different datasets without requiring lighting supervision. △ Less

Submitted 15 March, 2022; v1 submitted 25 October, 2021; originally announced October 2021.

Comments: Accepted to BMVC 2021. The code and pre-trained models can be found at https://github.com/n-kubiak/SILT

arXiv:2109.10833 [pdf, other]

doi 10.22331/q-2022-07-07-757

Bounds on approximating Max $k$XOR with quantum and classical local algorithms

Authors: Kunal Marwaha, Stuart Hadfield

Abstract: We consider the power of local algorithms for approximately solving Max $k$XOR, a generalization of two constraint satisfaction problems previously studied with classical and quantum algorithms (MaxCut and Max E3LIN2). In Max $k$XOR each constraint is the XOR of exactly $k$ variables and a parity bit. On instances with either random signs (parities) or no overlapping clauses and $D+1$ clauses per… ▽ More We consider the power of local algorithms for approximately solving Max $k$XOR, a generalization of two constraint satisfaction problems previously studied with classical and quantum algorithms (MaxCut and Max E3LIN2). In Max $k$XOR each constraint is the XOR of exactly $k$ variables and a parity bit. On instances with either random signs (parities) or no overlapping clauses and $D+1$ clauses per variable, we calculate the expected satisfying fraction of the depth-1 QAOA from Farhi et al [arXiv:1411.4028] and compare with a generalization of the local threshold algorithm from Hirvonen et al [arXiv:1402.2543]. Notably, the quantum algorithm outperforms the threshold algorithm for $k > 4$. On the other hand, we highlight potential difficulties for the QAOA to achieve computational quantum advantage on this problem. We first compute a tight upper bound on the maximum satisfying fraction of nearly all large random regular Max $k$XOR instances by numerically calculating the ground state energy density $P(k)$ of a mean-field $k$-spin glass [arXiv:1606.02365]. The upper bound grows with $k$ much faster than the performance of both one-local algorithms. We also identify a new obstruction result for low-depth quantum circuits (including the QAOA) when $k=3$, generalizing a result of Bravyi et al [arXiv:1910.08980] when $k=2$. We conjecture that a similar obstruction exists for all $k$. △ Less

Submitted 30 June, 2022; v1 submitted 22 September, 2021; originally announced September 2021.

Comments: 21+4 pages, 6 figures, code online at https://nbviewer.jupyter.org/github/marwahaha/QuAIL-2021/blob/main/maxkxor.ipynb and https://nbviewer.jupyter.org/github/marwahaha/QuAIL-2021/blob/main/parisi.ipynb

Journal ref: Quantum 6, 757 (2022)

arXiv:2109.10658 [pdf, other]

TACTIC: Joint Rate-Distortion-Accuracy Optimisation for Low Bitrate Compression

Authors: Nikolina Kubiak, Simon Hadfield

Abstract: We present TACTIC: Task-Aware Compression Through Intelligent Coding. Our lossy compression model learns based on the rate-distortion-accuracy trade-off for a specific task. By considering what information is important for the follow-on problem, the system trades off visual fidelity for good task performance at a low bitrate. When compared against JPEG at the same bitrate, our approach is able to… ▽ More We present TACTIC: Task-Aware Compression Through Intelligent Coding. Our lossy compression model learns based on the rate-distortion-accuracy trade-off for a specific task. By considering what information is important for the follow-on problem, the system trades off visual fidelity for good task performance at a low bitrate. When compared against JPEG at the same bitrate, our approach is able to improve the accuracy of ImageNet subset classification by 4.5%. We also demonstrate the applicability of our approach to other problems, providing a 3.4% accuracy and 4.9% mean IoU improvements in performance over task-agnostic compression for semantic segmentation. △ Less

Submitted 22 September, 2021; originally announced September 2021.

arXiv:2109.00405 [pdf, other]

doi 10.1109/IROS51168.2021.9636327

EVReflex: Dense Time-to-Impact Prediction for Event-based Obstacle Avoidance

Authors: Celyn Walters, Simon Hadfield

Abstract: The broad scope of obstacle avoidance has led to many kinds of computer vision-based approaches. Despite its popularity, it is not a solved problem. Traditional computer vision techniques using cameras and depth sensors often focus on static scenes, or rely on priors about the obstacles. Recent developments in bio-inspired sensors present event cameras as a compelling choice for dynamic scenes. Al… ▽ More The broad scope of obstacle avoidance has led to many kinds of computer vision-based approaches. Despite its popularity, it is not a solved problem. Traditional computer vision techniques using cameras and depth sensors often focus on static scenes, or rely on priors about the obstacles. Recent developments in bio-inspired sensors present event cameras as a compelling choice for dynamic scenes. Although these sensors have many advantages over their frame-based counterparts, such as high dynamic range and temporal resolution, event-based perception has largely remained in 2D. This often leads to solutions reliant on heuristics and specific to a particular task. We show that the fusion of events and depth overcomes the failure cases of each individual modality when performing obstacle avoidance. Our proposed approach unifies event camera and lidar streams to estimate metric time-to-impact without prior knowledge of the scene geometry or obstacles. In addition, we release an extensive event-based dataset with six visual streams spanning over 700 scanned scenes. △ Less

Submitted 1 September, 2021; originally announced September 2021.

Comments: To be published in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2021

arXiv:2106.01434 [pdf, other]

Robot in a China Shop: Using Reinforcement Learning for Location-Specific Navigation Behaviour

Authors: Xihan Bian, Oscar Mendez, Simon Hadfield

Abstract: Robots need to be able to work in multiple different environments. Even when performing similar tasks, different behaviour should be deployed to best fit the current environment. In this paper, We propose a new approach to navigation, where it is treated as a multi-task learning problem. This enables the robot to learn to behave differently in visual navigation tasks for different environments whi… ▽ More Robots need to be able to work in multiple different environments. Even when performing similar tasks, different behaviour should be deployed to best fit the current environment. In this paper, We propose a new approach to navigation, where it is treated as a multi-task learning problem. This enables the robot to learn to behave differently in visual navigation tasks for different environments while also learning shared expertise across environments. We evaluated our approach in both simulated environments as well as real-world data. Our method allows our system to converge with a 26% reduction in training time, while also increasing accuracy. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: Published at ICRA 2021

arXiv:2106.00371 [pdf, other]

Markov Localisation using Heatmap Regression and Deep Convolutional Odometry

Authors: Oscar Mendez, Simon Hadfield, Richard Bowden

Abstract: In the context of self-driving vehicles there is strong competition between approaches based on visual localisation and LiDAR. While LiDAR provides important depth information, it is sparse in resolution and expensive. On the other hand, cameras are low-cost and recent developments in deep learning mean they can provide high localisation performance. However, several fundamental problems remain, p… ▽ More In the context of self-driving vehicles there is strong competition between approaches based on visual localisation and LiDAR. While LiDAR provides important depth information, it is sparse in resolution and expensive. On the other hand, cameras are low-cost and recent developments in deep learning mean they can provide high localisation performance. However, several fundamental problems remain, particularly in the domain of uncertainty, where learning based approaches can be notoriously over-confident. Markov, or grid-based, localisation was an early solution to the localisation problem but fell out of favour due to its computational complexity. Representing the likelihood field as a grid (or volume) means there is a trade off between accuracy and memory size. Furthermore, it is necessary to perform expensive convolutions across the entire likelihood volume. Despite the benefit of simultaneously maintaining a likelihood for all possible locations, grid based approaches were superseded by more efficient particle filters and Monte Carlo Localisation (MCL). However, MCL introduces its own problems e.g. particle deprivation. Recent advances in deep learning hardware allow large likelihood volumes to be stored directly on the GPU, along with the hardware necessary to efficiently perform GPU-bound 3D convolutions and this obviates many of the disadvantages of grid based methods. In this work, we present a novel CNN-based localisation approach that can leverage modern deep learning hardware. By implementing a grid-based Markov localisation approach directly on the GPU, we create a hybrid CNN that can perform image-based localisation and odometry-based likelihood propagation within a single neural network. The resulting approach is capable of outperforming direct pose regression methods as well as state-of-the-art localisation systems. △ Less

Submitted 1 June, 2021; originally announced June 2021.

Comments: IEEE International Conference on Robotics and Automation (ICRA) 2021

arXiv:2103.09641 [pdf, other]

doi 10.1109/IROS40897.2019.8968244.

A Robust Extrinsic Calibration Framework for Vehicles with Unscaled Sensors

Authors: Celyn Walters, Oscar Mendez, Simon Hadfield, Richard Bowden

Abstract: Accurate extrinsic sensor calibration is essential for both autonomous vehicles and robots. Traditionally this is an involved process requiring calibration targets, known fiducial markers and is generally performed in a lab. Moreover, even a small change in the sensor layout requires recalibration. With the anticipated arrival of consumer autonomous vehicles, there is demand for a system which can… ▽ More Accurate extrinsic sensor calibration is essential for both autonomous vehicles and robots. Traditionally this is an involved process requiring calibration targets, known fiducial markers and is generally performed in a lab. Moreover, even a small change in the sensor layout requires recalibration. With the anticipated arrival of consumer autonomous vehicles, there is demand for a system which can do this automatically, after deployment and without specialist human expertise. To solve these limitations, we propose a flexible framework which can estimate extrinsic parameters without an explicit calibration stage, even for sensors with unknown scale. Our first contribution builds upon standard hand-eye calibration by jointly recovering scale. Our second contribution is that our system is made robust to imperfect and degenerate sensor data, by collecting independent sets of poses and automatically selecting those which are most ideal. We show that our approach's robustness is essential for the target scenario. Unlike previous approaches, ours runs in real time and constantly estimates the extrinsic transform. For both an ideal experimental setup and a real use case, comparison against these approaches shows that we outperform the state-of-the-art. Furthermore, we demonstrate that the recovered scale may be applied to the full trajectory, circumventing the need for scale estimation via sensor fusion. △ Less

Submitted 17 March, 2021; originally announced March 2021.

Journal ref: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 2019, pp. 36-42

arXiv:2103.04502 [pdf, other]

doi 10.22331/q-2021-09-28-550

Quantum-accelerated constraint programming

Authors: Kyle E. C. Booth, Bryan O'Gorman, Jeffrey Marshall, Stuart Hadfield, Eleanor Rieffel

Abstract: Constraint programming (CP) is a paradigm used to model and solve constraint satisfaction and combinatorial optimization problems. In CP, problems are modeled with constraints that describe acceptable solutions and solved with backtracking tree search augmented with logical inference. In this paper, we show how quantum algorithms can accelerate CP, at both the levels of inference and search. Lever… ▽ More Constraint programming (CP) is a paradigm used to model and solve constraint satisfaction and combinatorial optimization problems. In CP, problems are modeled with constraints that describe acceptable solutions and solved with backtracking tree search augmented with logical inference. In this paper, we show how quantum algorithms can accelerate CP, at both the levels of inference and search. Leveraging existing quantum algorithms, we introduce a quantum-accelerated filtering algorithm for the $\texttt{alldifferent}$ global constraint and discuss its applicability to a broader family of global constraints with similar structure. We propose frameworks for the integration of quantum filtering algorithms within both classical and quantum backtracking search schemes, including a novel hybrid classical-quantum backtracking search method. This work suggests that CP is a promising candidate application for early fault-tolerant quantum computers and beyond. △ Less

Submitted 20 September, 2021; v1 submitted 7 March, 2021; originally announced March 2021.

Comments: published in Quantum

Journal ref: Quantum 5, 550 (2021)

arXiv:2012.04713 [pdf, other]

doi 10.1007/s11128-021-03298-4

Classical symmetries and the Quantum Approximate Optimization Algorithm

Authors: Ruslan Shaydulin, Stuart Hadfield, Tad Hogg, Ilya Safro

Abstract: We study the relationship between the Quantum Approximate Optimization Algorithm (QAOA) and the underlying symmetries of the objective function to be optimized. Our approach formalizes the connection between quantum symmetry properties of the QAOA dynamics and the group of classical symmetries of the objective function. The connection is general and includes but is not limited to problems defined… ▽ More We study the relationship between the Quantum Approximate Optimization Algorithm (QAOA) and the underlying symmetries of the objective function to be optimized. Our approach formalizes the connection between quantum symmetry properties of the QAOA dynamics and the group of classical symmetries of the objective function. The connection is general and includes but is not limited to problems defined on graphs. We show a series of results exploring the connection and highlight examples of hard problem classes where a nontrivial symmetry subgroup can be obtained efficiently. In particular we show how classical objective function symmetries lead to invariant measurement outcome probabilities across states connected by such symmetries, independent of the choice of algorithm parameters or number of layers. To illustrate the power of the developed connection, we apply machine learning techniques towards predicting QAOA performance based on symmetry considerations. We provide numerical evidence that a small set of graph symmetry properties suffices to predict the minimum QAOA depth required to achieve a target approximation ratio on the MaxCut problem, in a practically important setting where QAOA parameter schedules are constrained to be linear and hence easier to optimize. △ Less

Submitted 27 October, 2021; v1 submitted 8 December, 2020; originally announced December 2020.

Journal ref: Quantum Inf Process 20, 359 (2021)

arXiv:2011.05064 [pdf, other]

What Did You Think Would Happen? Explaining Agent Behaviour Through Intended Outcomes

Authors: Herman Yau, Chris Russell, Simon Hadfield

Abstract: We present a novel form of explanation for Reinforcement Learning, based around the notion of intended outcome. These explanations describe the outcome an agent is trying to achieve by its actions. We provide a simple proof that general methods for post-hoc explanations of this nature are impossible in traditional reinforcement learning. Rather, the information needed for the explanations must be… ▽ More We present a novel form of explanation for Reinforcement Learning, based around the notion of intended outcome. These explanations describe the outcome an agent is trying to achieve by its actions. We provide a simple proof that general methods for post-hoc explanations of this nature are impossible in traditional reinforcement learning. Rather, the information needed for the explanations must be collected in conjunction with training the agent. We derive approaches designed to extract local explanations based on intention for several variants of Q-function approximation and prove consistency between the explanations and the Q-values learned. We demonstrate our method on multiple reinforcement learning problems, and provide code to help researchers introspecting their RL environments and algorithms. △ Less

Submitted 10 November, 2020; originally announced November 2020.

arXiv:2010.01668 [pdf, other]

Diagonal Memory Optimisation for Machine Learning on Micro-controllers

Authors: Peter Blacker, Christopher Paul Bridges, Simon Hadfield

Abstract: As machine learning spreads into more and more application areas, micro controllers and low power CPUs are increasingly being used to perform inference with machine learning models. The capability to deploy onto these limited hardware targets is enabling machine learning models to be used across a diverse range of new domains. Optimising the inference process on these targets poses different chall… ▽ More As machine learning spreads into more and more application areas, micro controllers and low power CPUs are increasingly being used to perform inference with machine learning models. The capability to deploy onto these limited hardware targets is enabling machine learning models to be used across a diverse range of new domains. Optimising the inference process on these targets poses different challenges from either desktop CPU or GPU implementations, where the small amounts of RAM available on these targets sets limits on size of models which can be executed. Analysis of the memory use patterns of eleven machine learning models was performed. Memory load and store patterns were observed using a modified version of the Valgrind debugging tool, identifying memory areas holding values necessary for the calculation as inference progressed. These analyses identified opportunities optimise the memory use of these models by overlapping the input and output buffers of individual tensor operations. Three methods are presented which can calculate the safe overlap of input and output buffers for tensor operations. Ranging from a computationally expensive approach with the ability to operate on compiled layer operations, to a versatile analytical solution which requires access to the original source code of the layer. The diagonal memory optimisation technique is described and shown to achieve memory savings of up to 34.5% when applied to eleven common models. Micro-controller targets are identified where it is only possible to deploy some models if diagonal memory optimisation is used. △ Less

Submitted 16 November, 2020; v1 submitted 4 October, 2020; originally announced October 2020.

Comments: 10 Page, journal pre-print

arXiv:2009.00299 [pdf, other]

Multi-channel Transformers for Multi-articulatory Sign Language Translation

Authors: Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden

Abstract: Sign languages use multiple asynchronous information channels (articulators), not just the hands but also the face and body, which computational approaches often ignore. In this paper we tackle the multi-articulatory sign language translation task and propose a novel multi-channel transformer architecture. The proposed architecture allows both the inter and intra contextual relationships between d… ▽ More Sign languages use multiple asynchronous information channels (articulators), not just the hands but also the face and body, which computational approaches often ignore. In this paper we tackle the multi-articulatory sign language translation task and propose a novel multi-channel transformer architecture. The proposed architecture allows both the inter and intra contextual relationships between different sign articulators to be modelled within the transformer network itself, while also maintaining channel specific information. We evaluate our approach on the RWTH-PHOENIX-Weather-2014T dataset and report competitive translation performance. Importantly, we overcome the reliance on gloss annotations which underpin other state-of-the-art approaches, thereby removing future need for expensive curated datasets. △ Less

Submitted 1 September, 2020; originally announced September 2020.

arXiv:2003.13830 [pdf, other]

Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation

Authors: Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden

Abstract: Prior work on Sign Language Translation has shown that having a mid-level sign gloss representation (effectively recognizing the individual signs) improves the translation performance drastically. In fact, the current state-of-the-art in translation requires gloss level tokenization in order to work. We introduce a novel transformer based architecture that jointly learns Continuous Sign Language R… ▽ More Prior work on Sign Language Translation has shown that having a mid-level sign gloss representation (effectively recognizing the individual signs) improves the translation performance drastically. In fact, the current state-of-the-art in translation requires gloss level tokenization in order to work. We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation while being trainable in an end-to-end manner. This is achieved by using a Connectionist Temporal Classification (CTC) loss to bind the recognition and translation problems into a single unified architecture. This joint approach does not require any ground-truth timing information, simultaneously solving two co-dependant sequence-to-sequence learning problems and leads to significant performance gains. We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset. We report state-of-the-art sign language recognition and translation results achieved by our Sign Language Transformers. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models, in some cases more than doubling the performance (9.58 vs. 21.80 BLEU-4 Score). We also share new baseline translation results using transformer networks for several other text-to-text sign language translation tasks. △ Less

Submitted 30 March, 2020; originally announced March 2020.

arXiv:2003.13446 [pdf, other]

DeFeat-Net: General Monocular Depth via Simultaneous Unsupervised Representation Learning

Authors: Jaime Spencer, Richard Bowden, Simon Hadfield

Abstract: In the current monocular depth research, the dominant approach is to employ unsupervised training on large datasets, driven by warped photometric consistency. Such approaches lack robustness and are unable to generalize to challenging domains such as nighttime scenes or adverse weather conditions where assumptions about photometric consistency break down. We propose DeFeat-Net (Depth & Feature n… ▽ More In the current monocular depth research, the dominant approach is to employ unsupervised training on large datasets, driven by warped photometric consistency. Such approaches lack robustness and are unable to generalize to challenging domains such as nighttime scenes or adverse weather conditions where assumptions about photometric consistency break down. We propose DeFeat-Net (Depth & Feature network), an approach to simultaneously learn a cross-domain dense feature representation, alongside a robust depth-estimation framework based on warped feature consistency. The resulting feature representation is learned in an unsupervised manner with no explicit ground-truth correspondences required. We show that within a single domain, our technique is comparable to both the current state of the art in monocular depth estimation and supervised feature representation learning. However, by simultaneously learning features, depth and motion, our technique is able to generalize to challenging domains, allowing DeFeat-Net to outperform the current state-of-the-art with around 10% reduction in all error measures on more challenging sequences such as nighttime driving. △ Less

Submitted 30 March, 2020; originally announced March 2020.

arXiv:2003.13431 [pdf, other]

Same Features, Different Day: Weakly Supervised Feature Learning for Seasonal Invariance

Authors: Jaime Spencer, Richard Bowden, Simon Hadfield

Abstract: "Like night and day" is a commonly used expression to imply that two things are completely different. Unfortunately, this tends to be the case for current visual feature representations of the same scene across varying seasons or times of day. The aim of this paper is to provide a dense feature representation that can be used to perform localization, sparse matching or image retrieval, regardless… ▽ More "Like night and day" is a commonly used expression to imply that two things are completely different. Unfortunately, this tends to be the case for current visual feature representations of the same scene across varying seasons or times of day. The aim of this paper is to provide a dense feature representation that can be used to perform localization, sparse matching or image retrieval, regardless of the current seasonal or temporal appearance. Recently, there have been several proposed methodologies for deep learning dense feature representations. These methods make use of ground truth pixel-wise correspondences between pairs of images and focus on the spatial properties of the features. As such, they don't address temporal or seasonal variation. Furthermore, obtaining the required pixel-wise correspondence data to train in cross-seasonal environments is highly complex in most scenarios. We propose Deja-Vu, a weakly supervised approach to learning season invariant features that does not require pixel-wise ground truth data. The proposed system only requires coarse labels indicating if two images correspond to the same location or not. From these labels, the network is trained to produce "similar" dense feature maps for corresponding locations despite environmental changes. Code will be made available at: https://github.com/jspenmar/DejaVu_Features △ Less

Submitted 30 March, 2020; originally announced March 2020.

arXiv:1908.03185 [pdf, other]

Optimizing quantum heuristics with meta-learning

Authors: Max Wilson, Sam Stromswold, Filip Wudarski, Stuart Hadfield, Norm M. Tubman, Eleanor Rieffel

Abstract: Variational quantum algorithms, a class of quantum heuristics, are promising candidates for the demonstration of useful quantum computation. Finding the best way to amplify the performance of these methods on hardware is an important task. Here, we evaluate the optimization of quantum heuristics with an existing class of techniques called `meta-learners'. We compare the performance of a meta-learn… ▽ More Variational quantum algorithms, a class of quantum heuristics, are promising candidates for the demonstration of useful quantum computation. Finding the best way to amplify the performance of these methods on hardware is an important task. Here, we evaluate the optimization of quantum heuristics with an existing class of techniques called `meta-learners'. We compare the performance of a meta-learner to Bayesian optimization, evolutionary strategies, L-BFGS-B and Nelder-Mead approaches, for two quantum heuristics (quantum alternating operator ansatz and variational quantum eigensolver), on three problems, in three simulation environments. We show that the meta-learner comes near to the global optima more frequently than all other optimizers we tested in a noisy parameter setting environment. We also find that the meta-learner is generally more resistant to noise, for example seeing a smaller reduction in performance in Noisy and Sampling environments and performs better on average by a `gain' metric than its closest comparable competitor L-BFGS-B. These results are an important indication that meta-learning and associated machine learning methods will be integral to the useful application of noisy near-term quantum computers. △ Less

Submitted 8 August, 2019; originally announced August 2019.

arXiv:1903.10427 [pdf, other]

doi 10.1109/CVPR.2019.00636

Scale-Adaptive Neural Dense Features: Learning via Hierarchical Context Aggregation

Authors: Jaime Spencer, Richard Bowden, Simon Hadfield

Abstract: How do computers and intelligent agents view the world around them? Feature extraction and representation constitutes one the basic building blocks towards answering this question. Traditionally, this has been done with carefully engineered hand-crafted techniques such as HOG, SIFT or ORB. However, there is no ``one size fits all'' approach that satisfies all requirements. In recent years, the ris… ▽ More How do computers and intelligent agents view the world around them? Feature extraction and representation constitutes one the basic building blocks towards answering this question. Traditionally, this has been done with carefully engineered hand-crafted techniques such as HOG, SIFT or ORB. However, there is no ``one size fits all'' approach that satisfies all requirements. In recent years, the rising popularity of deep learning has resulted in a myriad of end-to-end solutions to many computer vision problems. These approaches, while successful, tend to lack scalability and can't easily exploit information learned by other systems. Instead, we propose SAND features, a dedicated deep learning solution to feature extraction capable of providing hierarchical context information. This is achieved by employing sparse relative labels indicating relationships of similarity/dissimilarity between image locations. The nature of these labels results in an almost infinite set of dissimilar examples to choose from. We demonstrate how the selection of negative examples during training can be used to modify the feature space and vary it's properties. To demonstrate the generality of this approach, we apply the proposed features to a multitude of tasks, each requiring different properties. This includes disparity estimation, semantic segmentation, self-localisation and SLAM. In all cases, we show how incorporating SAND features results in better or comparable results to the baseline, whilst requiring little to no additional training. Code can be found at: https://github.com/jspenmar/SAND_features △ Less

Submitted 25 March, 2019; originally announced March 2019.

Comments: CVPR2019

arXiv:1811.07583 [pdf, other]

doi 10.1007/978-3-030-11021-5_44

Localisation via Deep Imagination: learn the features not the map

Authors: Jaime Spencer, Oscar Mendez, Richard Bowden, Simon Hadfield

Abstract: How many times does a human have to drive through the same area to become familiar with it? To begin with, we might first build a mental model of our surroundings. Upon revisiting this area, we can use this model to extrapolate to new unseen locations and imagine their appearance. Based on this, we propose an approach where an agent is capable of modelling new environments after a single visitatio… ▽ More How many times does a human have to drive through the same area to become familiar with it? To begin with, we might first build a mental model of our surroundings. Upon revisiting this area, we can use this model to extrapolate to new unseen locations and imagine their appearance. Based on this, we propose an approach where an agent is capable of modelling new environments after a single visitation. To this end, we introduce "Deep Imagination", a combination of classical Visual-based Monte Carlo Localisation and deep learning. By making use of a feature embedded 3D map, the system can "imagine" the view from any novel location. These "imagined" views are contrasted with the current observation in order to estimate the agent's current location. In order to build the embedded map, we train a deep Siamese Fully Convolutional U-Net to perform dense feature extraction. By training these features to be generic, no additional training or fine tuning is required to adapt to new environments. Our results demonstrate the generality and transfer capability of our learnt dense features by training and evaluating on multiple datasets. Additionally, we include several visualizations of the feature representations and resulting 3D maps, as well as their application to localisation. △ Less

Submitted 19 November, 2018; originally announced November 2018.

Comments: VNAD @ ECCV2018

arXiv:1709.01500 [pdf, other]

SeDAR - Semantic Detection and Ranging: Humans can localise without LiDAR, can robots?

Authors: Oscar Mendez, Simon Hadfield, Nicolas Pugeault, Richard Bowden

Abstract: How does a person work out their location using a floorplan? It is probably safe to say that we do not explicitly measure depths to every visible surface and try to match them against different pose estimates in the floorplan. And yet, this is exactly how most robotic scan-matching algorithms operate. Similarly, we do not extrude the 2D geometry present in the floorplan into 3D and try to align it… ▽ More How does a person work out their location using a floorplan? It is probably safe to say that we do not explicitly measure depths to every visible surface and try to match them against different pose estimates in the floorplan. And yet, this is exactly how most robotic scan-matching algorithms operate. Similarly, we do not extrude the 2D geometry present in the floorplan into 3D and try to align it to the real-world. And yet, this is how most vision-based approaches localise. Humans do the exact opposite. Instead of depth, we use high level semantic cues. Instead of extruding the floorplan up into the third dimension, we collapse the 3D world into a 2D representation. Evidence of this is that many of the floorplans we use in everyday life are not accurate, opting instead for high levels of discriminative landmarks. In this work, we use this insight to present a global localisation approach that relies solely on the semantic labels present in the floorplan and extracted from RGB images. While our approach is able to use range measurements if available, we demonstrate that they are unnecessary as we can achieve results comparable to state-of-the-art without them. △ Less

Submitted 2 May, 2018; v1 submitted 5 September, 2017; originally announced September 2017.

Showing 1–39 of 39 results for author: Hadfield, S