Skip to main content

Showing 1–50 of 221 results for author: Zisserman, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.16828  [pdf, other

    cs.CV cs.LG

    Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

    Authors: Charig Yang, Weidi Xie, Andrew Zisserman

    Abstract: Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a flexible transformer-based model for general-purpose ordering of ima… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: Project page: https://charigyang.github.io/order/

  2. arXiv:2404.14412  [pdf, other

    cs.CV

    AutoAD III: The Prequel -- Back to the Pixels

    Authors: Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

    Abstract: Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three c… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: CVPR2024. Project page: https://www.robots.ox.ac.uk/~vgg/research/autoad/

  3. arXiv:2404.12389  [pdf, other

    cs.CV

    Moving Object Segmentation: All You Need Is SAM (and Flow)

    Authors: Junyu Xie, Charig Yang, Weidi Xie, Andrew Zisserman

    Abstract: The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful,and sometimes complex, approaches and training schemes including: self-supervised learning, learning from synthetic datasets, object-centric representations, amodal representations, and many more. Our interest in this paper is to determin… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

    Comments: Project Page: https://www.robots.ox.ac.uk/~vgg/research/flowsam/

  4. arXiv:2404.05559  [pdf, other

    cs.CV

    TIM: A Time Interval Machine for Audio-Visual Action Recognition

    Authors: Jacob Chalk, Jaesung Huh, Evangelos Kazakos, Andrew Zisserman, Dima Damen

    Abstract: Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modalit… ▽ More

    Submitted 9 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024. Project Webpage: https://jacobchalk.github.io/TIM-Project

  5. arXiv:2403.12026  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    FlexCap: Generating Rich, Localized, and Flexible Captions in Images

    Authors: Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar

    Abstract: We introduce a versatile $\textit{flexible-captioning}$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To ach… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  6. arXiv:2403.10997  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

    Authors: Yash Bhalgat, Iro Laina, João F. Henriques, Andrew Zisserman, Andrea Vedaldi

    Abstract: Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. Our method… ▽ More

    Submitted 16 March, 2024; originally announced March 2024.

  7. arXiv:2402.19106  [pdf, other

    eess.AS cs.IR cs.SD

    A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

    Authors: Andreea-Maria Oncescu, João F. Henriques, Andrew Zisserman, Samuel Albanie, A. Sophia Koepke

    Abstract: Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio in… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: 9 pages, 2 figures, 9 tables, Accepted at ICASSP 2024

  8. arXiv:2402.00847  [pdf, other

    cs.CV stat.ML

    BootsTAP: Bootstrapped Training for Tracking-Any-Point

    Authors: Carl Doersch, Yi Yang, Dilara Gokay, Pauline Luc, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ross Goroshin, João Carreira, Andrew Zisserman

    Abstract: To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to be able to track any point corresponding to a solid surface in a video, potentially densely in space and time. Large-scale ground-truth training data for TAP i… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

  9. arXiv:2401.16423  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Synchformer: Efficient Synchronization from Sparse Cues

    Authors: Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman

    Abstract: Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: Extended version of the ICASSP 24 paper. Project page: https://www.robots.ox.ac.uk/~vgg/research/synchformer/ Code: https://github.com/v-iashin/Synchformer

  10. arXiv:2401.12039  [pdf, other

    cs.CV cs.SD eess.AS

    Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

    Authors: Bruno Korbar, Jaesung Huh, Andrew Zisserman

    Abstract: The goal of this paper is automatic character-aware subtitle generation. Given a video and a minimal amount of metadata, we propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. The key idea is to first use audio-visual cues to select a set of high-precision audio exemplars for each character, and the… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

    Comments: Accepted for publication in ICASSP 2024

  11. arXiv:2401.10224  [pdf, other

    cs.CV

    The Manga Whisperer: Automatically Generating Transcriptions for Comics

    Authors: Ragav Sachdeva, Andrew Zisserman

    Abstract: In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet, the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work, we seek to address this substantial barrier, with the aim of ensuring that… ▽ More

    Submitted 21 March, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted at CVPR'24

  12. arXiv:2312.17247  [pdf, other

    cs.CV

    Amodal Ground Truth and Completion in the Wild

    Authors: Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman

    Abstract: This paper studies amodal image segmentation: predicting entire object segmentation masks including both visible and invisible (occluded) parts. In previous work, the amodal segmentation ground truth on real images is usually predicted by manual annotaton and thus is subjective. In contrast, we use 3D data to establish an automatic pipeline to determine authentic ground truth amodal masks for part… ▽ More

    Submitted 29 April, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  13. arXiv:2312.13090  [pdf, other

    cs.CV

    Perception Test 2023: A Summary of the First Challenge And Outcome

    Authors: Joseph Heyward, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean

    Abstract: The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark. The challenge had six tracks covering low-level and high-level tasks, with both a language and non-language interface, across video, audio,… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

  14. arXiv:2312.11897  [pdf, other

    cs.CV

    Text-Conditioned Resampler For Long Form Video Understanding

    Authors: Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari

    Abstract: In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can pro… ▽ More

    Submitted 25 March, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

  15. arXiv:2312.11463  [pdf, other

    cs.CV

    Appearance-based Refinement for Object-Centric Motion Segmentation

    Authors: Junyu Xie, Weidi Xie, Andrew Zisserman

    Abstract: The goal of this paper is to discover, segment, and track independently moving objects in complex visual scenes. Previous approaches have explored the use of optical flow for motion segmentation, leading to imperfect predictions due to partial motion, background distraction, and object articulations and interactions. To address this issue, we introduce an appearance-based refinement method that le… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

    Comments: Total 26 pages, 13 figures (including main text: 9 pages, 5 figures)

  16. arXiv:2312.07395  [pdf, other

    cs.CV cs.CL

    A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames

    Authors: Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak, Justin Chiu, Joe Heyward, Viorica Patraucean, Jiajun Shen, Antoine Miech, Andrew Zisserman, Aida Nematzdeh

    Abstract: Understanding long, real-world videos requires modeling of long-range visual dependencies. To this end, we explore video-first architectures, building on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. However, we expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in stan… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

  17. arXiv:2312.00598  [pdf, other

    cs.CV cs.AI

    Learning from One Continuous Video Stream

    Authors: João Carreira, Michael King, Viorica Pătrăucean, Dilara Gokay, Cătălin Ionescu, Yi Yang, Daniel Zoran, Joseph Heyward, Carl Doersch, Yusuf Aytar, Dima Damen, Andrew Zisserman

    Abstract: We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of str… ▽ More

    Submitted 28 March, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

    Comments: CVPR camera ready version

  18. arXiv:2311.17055  [pdf, other

    cs.CV cs.AI cs.IT cs.LG

    No Representation Rules Them All in Category Discovery

    Authors: Sagar Vaze, Andrea Vedaldi, Andrew Zisserman

    Abstract: In this paper we tackle the problem of Generalized Category Discovery (GCD). Specifically, given a dataset with labelled and unlabelled images, the task is to cluster all images in the unlabelled subset, whether or not they belong to the labelled categories. Our first contribution is to recognize that most existing GCD benchmarks only contain labels for a single clustering of the data, making it d… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

    Comments: NeurIPS 2023

  19. arXiv:2311.09424  [pdf, other

    cs.CV

    Predicting Spine Geometry and Scoliosis from DXA Scans

    Authors: Amir Jamaludin, Timor Kadir, Emma Clark, Andrew Zisserman

    Abstract: Our objective in this paper is to estimate spine curvature in DXA scans. To this end we first train a neural network to predict the middle spine curve in the scan, and then use an integral-based method to determine the curvature along the spine curve. We use the curvature to compare to the standard angle scoliosis measure obtained using the DXA Scoliosis Method (DSM). The performance improves over… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

    Comments: CSI@MICCAI 2019 Submission

  20. arXiv:2310.16477  [pdf, other

    cs.CV

    Show from Tell: Audio-Visual Modelling in Clinical Settings

    Authors: Jianbo Jiao, Mohammad Alsharid, Lior Drukker, Aris T. Papageorghiou, Andrew Zisserman, J. Alison Noble

    Abstract: Auditory and visual signals usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, the audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals -- usually speech. In this paper, we consider a… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

  21. arXiv:2310.06838  [pdf, other

    cs.CV

    AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description

    Authors: Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

    Abstract: Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges -- AD must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. To this end, we develop a new model for auto… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

    Comments: ICCV2023. Project page: https://www.robots.ox.ac.uk/vgg/research/autoad/

  22. arXiv:2310.06836  [pdf, other

    cs.CV

    What Does Stable Diffusion Know about the 3D Scene?

    Authors: Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman

    Abstract: Recent advances in generative models like Stable Diffusion enable the generation of highly photo-realistic images. Our objective in this paper is to probe the diffusion network to determine to what extent it 'understands' different properties of the 3D scene depicted in an image. To this end, we make the following contributions: (i) We introduce a protocol to evaluate whether features of an off-th… ▽ More

    Submitted 4 March, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

  23. arXiv:2310.05304  [pdf, other

    cs.CV

    GestSync: Determining who is speaking without a talking head

    Authors: Sindhu B Hegde, Andrew Zisserman

    Abstract: In this paper we introduce a new synchronisation task, Gesture-Sync: determining if a person's gestures are correlated with their speech or not. In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement than there is between voice and lip motion. We introduce a dual-encoder model for this task, and compare a number of… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Accepted in BMVC 2023, 10 pages paper, 7 pages supplementary, 7 Figures

  24. arXiv:2309.03899  [pdf, other

    cs.CV

    The Making and Breaking of Camouflage

    Authors: Hala Lamdouar, Weidi Xie, Andrew Zisserman

    Abstract: Not all camouflages are equally effective, as even a partially visible contour or a slight color difference can make the animal stand out and break its camouflage. In this paper, we address the question of what makes a camouflage successful, by proposing three scores for automatically assessing its effectiveness. In particular, we show that camouflage can be measured by the similarity between back… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: ICCV 2023

  25. arXiv:2308.10417  [pdf, other

    cs.CV

    The Change You Want to See (Now in 3D)

    Authors: Ragav Sachdeva, Andrew Zisserman

    Abstract: The goal of this paper is to detect what has changed, if anything, between two "in the wild" images of the same 3D scene acquired from different camera positions and at different temporal instances. The open-set nature of this problem, occlusions/dis-occlusions due to the shift in viewpoint, and the lack of suitable training datasets, presents substantial challenges in devising a solution. To ad… ▽ More

    Submitted 11 September, 2023; v1 submitted 20 August, 2023; originally announced August 2023.

  26. arXiv:2308.07918  [pdf, other

    cs.CV

    Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

    Authors: Chuhan Zhang, Ankush Gupta, Andrew Zisserman

    Abstract: We introduce an object-aware decoder for improving the performance of spatio-temporal representations on ego-centric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is ab… ▽ More

    Submitted 15 August, 2023; originally announced August 2023.

    Comments: ICCV2023

  27. arXiv:2307.09006  [pdf, other

    cs.SD cs.LG eess.AS

    OxfordVGG Submission to the EGO4D AV Transcription Challenge

    Authors: Jaesung Huh, Max Bain, Andrew Zisserman

    Abstract: This report presents the technical details of our submission on the EGO4D Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the OxfordVGG team. We present WhisperX, a system for efficient speech transcription of long-form audio with word-level time alignment, along with two text normalisers which are publicly available. Our final submission obtained 56.0% of the Word Error Rate (W… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

    Comments: Technical Report

  28. arXiv:2306.08637  [pdf, other

    cs.CV

    TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement

    Authors: Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, Andrew Zisserman

    Abstract: We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on loc… ▽ More

    Submitted 30 August, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

    Comments: Published at ICCV 2023

  29. arXiv:2306.05493  [pdf, other

    cs.CV cs.AI cs.LG

    Multi-Modal Classifiers for Open-Vocabulary Object Detection

    Authors: Prannay Kaul, Weidi Xie, Andrew Zisserman

    Abstract: The goal of this paper is open-vocabulary object detection (OVOD) $\unicode{x2013}$ building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two-stage object detector architecture, and explore three ways for specifying novel categories: via… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Comments: ICML 2023, project page: https://www.robots.ox.ac.uk/vgg/research/mm-ovod/

    ACM Class: I.4.6; I.4.8; I.4.9; I.2.10

  30. arXiv:2306.04633  [pdf, other

    cs.CV cs.AI cs.LG

    Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion

    Authors: Yash Bhalgat, Iro Laina, João F. Henriques, Andrew Zisserman, Andrea Vedaldi

    Abstract: Instance segmentation in 3D is a challenging task due to the lack of large-scale annotated datasets. In this paper, we show that this task can be addressed effectively by leveraging instead 2D pre-trained models for instance segmentation. We propose a novel approach to lift 2D segments to 3D and fuse them by means of a neural field representation, which encourages multi-view consistency across fra… ▽ More

    Submitted 1 December, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023 (Spotlight). Code: https://github.com/yashbhalgat/Contrastive-Lift

  31. arXiv:2306.01851  [pdf, other

    cs.CV

    Open-world Text-specified Object Counting

    Authors: Niki Amini-Naieni, Kiana Amini-Naieni, Tengda Han, Andrew Zisserman

    Abstract: Our objective is open-world object counting in images, where the target object class is specified by a text description. To this end, we propose CounTX, a class-agnostic, single-stage model using a transformer decoder counting head on top of pre-trained joint text-image representations. CounTX is able to count the number of instances of any class given only an image and a text description of the t… ▽ More

    Submitted 15 September, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: BMVC 2023

  32. arXiv:2305.13786  [pdf, other

    cs.CV cs.AI cs.LG

    Perception Test: A Diagnostic Benchmark for Multimodal Video Models

    Authors: Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira

    Abstract: We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning… ▽ More

    Submitted 30 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks

  33. arXiv:2304.06708  [pdf, other

    cs.CV cs.AI cs.CL

    Verbs in Action: Improving verb understanding in video-language models

    Authors: Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid

    Abstract: Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In th… ▽ More

    Submitted 13 April, 2023; originally announced April 2023.

  34. arXiv:2303.17644  [pdf, other

    cs.CV

    Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime

    Authors: Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman

    Abstract: This paper explores training medical vision-language models (VLMs) -- where the visual and language inputs are embedded into a common space -- with a particular focus on scenarios where training data is limited, as is often the case in clinical datasets. We explore several candidate methods to improve low-data performance, including: (i) adapting generic pre-trained models to novel image and text… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: Accepted to MIDL 2023

  35. arXiv:2303.16899  [pdf, other

    cs.CV

    AutoAD: Movie Description in Context

    Authors: Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

    Abstract: The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form. Generating high-quality movie AD is challenging due to the dependency of the descriptions on context, and the limited amount of training data available. In this work, we leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network t… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

    Comments: CVPR2023 Highlight. Project page: https://www.robots.ox.ac.uk/~vgg/research/autoad/

  36. arXiv:2303.13518  [pdf, other

    cs.CV cs.AI cs.LG

    Three ways to improve feature alignment for open vocabulary detection

    Authors: Relja Arandjelović, Alex Andonian, Arthur Mensch, Olivier J. Hénaff, Jean-Baptiste Alayrac, Andrew Zisserman

    Abstract: The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose t… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

  37. arXiv:2303.00747  [pdf, other

    cs.SD eess.AS

    WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

    Authors: Max Bain, Jaesung Huh, Tengda Han, Andrew Zisserman

    Abstract: Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps c… ▽ More

    Submitted 11 July, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

    Comments: Accepted to INTERSPEECH 2023

  38. arXiv:2302.10248  [pdf, ps, other

    cs.SD cs.LG eess.AS

    VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

    Authors: Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman

    Abstract: This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild". The challenge consisted of: (i) the provision of publicly available speaker re… ▽ More

    Submitted 6 March, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

  39. arXiv:2302.00646  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Epic-Sounds: A Large-scale Dataset of Actions That Sound

    Authors: Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, Andrew Zisserman

    Abstract: We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through groupi… ▽ More

    Submitted 1 February, 2023; originally announced February 2023.

    Comments: 6 pages, 4 figures

  40. arXiv:2301.09595  [pdf, other

    cs.CV

    Zorro: the masked multimodal transformer

    Authors: Adrià Recasens, Jason Lin, Joāo Carreira, Drew Jaegle, Luyu Wang, Jean-baptiste Alayrac, Pauline Luc, Antoine Miech, Lucas Smaira, Ross Hemsley, Andrew Zisserman

    Abstract: Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires in… ▽ More

    Submitted 22 February, 2023; v1 submitted 23 January, 2023; originally announced January 2023.

  41. arXiv:2211.15107  [pdf, other

    cs.CV cs.AI cs.LG

    A Light Touch Approach to Teaching Transformers Multi-view Geometry

    Authors: Yash Bhalgat, Joao F. Henriques, Andrew Zisserman

    Abstract: Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propo… ▽ More

    Submitted 2 April, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

    Comments: Camera-ready version. Accepted to CVPR 2023

  42. arXiv:2211.08954  [pdf, other

    cs.CV

    Weakly-supervised Fingerspelling Recognition in British Sign Language Videos

    Authors: K R Prajwal, Hannah Bull, Liliane Momeni, Samuel Albanie, Gül Varol, Andrew Zisserman

    Abstract: The goal of this work is to detect and recognize sequences of letters signed using fingerspelling in British Sign Language (BSL). Previous fingerspelling recognition methods have not focused on BSL, which has a very different signing alphabet (e.g., two-handed instead of one-handed) to American Sign Language (ASL). They also use manual annotations for training. In contrast to previous methods, our… ▽ More

    Submitted 16 November, 2022; originally announced November 2022.

    Comments: Appears in: British Machine Vision Conference 2022 (BMVC 2022)

  43. arXiv:2211.03726  [pdf, other

    cs.CV stat.ML

    TAP-Vid: A Benchmark for Tracking Any Point in a Video

    Authors: Carl Doersch, Ankush Gupta, Larisa Markeeva, Adrià Recasens, Lucas Smaira, Yusuf Aytar, João Carreira, Andrew Zisserman, Yi Yang

    Abstract: Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation e… ▽ More

    Submitted 31 March, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

    Comments: Published in NeurIPS Datasets and Benchmarks track, 2022

  44. arXiv:2210.14601  [pdf, other

    cs.CV

    End-to-end Tracking with a Multi-query Transformer

    Authors: Bruno Korbar, Andrew Zisserman

    Abstract: Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time. Our aim in this paper is to move beyond tracking-by-detection approaches, that perform well on datasets where the object classes are known, to class-agnostic tracking that performs well also for unknown object classes.To this end,… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

  45. arXiv:2210.10046  [pdf, other

    cs.CV

    A Tri-Layer Plugin to Improve Occluded Detection

    Authors: Guanqi Zhan, Weidi Xie, Andrew Zisserman

    Abstract: Detecting occluded objects still remains a challenge for state-of-the-art object detectors. The objective of this work is to improve the detection for such objects, and thereby improve the overall performance of a modern object detector. To this end we make the following four contributions: (1) We propose a simple 'plugin' module for the detection head of two-stage object detectors to improve th… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: BMVC 2022

  46. arXiv:2210.07055  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors

    Authors: Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman

    Abstract: The objective of this paper is audio-visual synchronisation of general videos 'in the wild'. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many seconds-long video clip, i.e. the synchronisation signal is 'sparse in space and time'. This contrasts with the case of synchronising videos of talking heads, wher… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

    Comments: Accepted as a spotlight presentation for the BMVC 2022. Code: https://github.com/v-iashin/SparseSync Project page: https://v-iashin.github.io/SparseSync

  47. arXiv:2210.04889  [pdf, other

    cs.CV

    Turbo Training with Token Dropout

    Authors: Tengda Han, Weidi Xie, Andrew Zisserman

    Abstract: The objective of this paper is an efficient training method for video tasks. We make three contributions: (1) We propose Turbo training, a simple and versatile training paradigm for Transformers on multiple video tasks. (2) We illustrate the advantages of Turbo training on action classification, video-language representation learning, and long-video activity classification, showing that Turbo trai… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

    Comments: BMVC2022

  48. arXiv:2210.02995  [pdf, other

    cs.CV

    Compressed Vision for Efficient Video Understanding

    Authors: Olivia Wiles, Joao Carreira, Iain Barr, Andrew Zisserman, Mateusz Malinowski

    Abstract: Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long v… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

    Comments: ACCV

  49. arXiv:2209.14341  [pdf, other

    cs.CV

    The Change You Want to See

    Authors: Ragav Sachdeva, Andrew Zisserman

    Abstract: We live in a dynamic world where things change all the time. Given two images of the same scene, being able to automatically detect the changes in them has practical applications in a variety of domains. In this paper, we tackle the change detection problem with the goal of detecting "object-level" changes in an image pair despite differences in their viewpoint and illumination. To this end, we ma… ▽ More

    Submitted 28 September, 2022; originally announced September 2022.

    Comments: Paper accepted at WACV 2023

  50. arXiv:2208.13721  [pdf, other

    cs.CV

    CounTR: Transformer-based Generalised Visual Counting

    Authors: Chang Liu, Yujie Zhong, Andrew Zisserman, Weidi Xie

    Abstract: In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars", i.e. zero-shot or few-shot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalise… ▽ More

    Submitted 2 June, 2023; v1 submitted 29 August, 2022; originally announced August 2022.

    Comments: Accepted by BMVC2022