-
MoST: Multi-modality Scene Tokenization for Motion Prediction
Authors:
Norman Mu,
Jingwei Ji,
Zhenpei Yang,
Nate Harada,
Haotian Tang,
Kan Chen,
Charles R. Qi,
Runzhou Ge,
Kratarth Goel,
Zoey Yang,
Scott Ettinger,
Rami Al-Rfou,
Dragomir Anguelov,
Yin Zhou
Abstract:
Many existing motion prediction approaches rely on symbolic perception outputs to generate agent trajectories, such as bounding boxes, road graph information and traffic lights. This symbolic representation is a high-level abstraction of the real world, which may render the motion prediction model vulnerable to perception errors (e.g., failures in detecting open-vocabulary obstacles) while missing…
▽ More
Many existing motion prediction approaches rely on symbolic perception outputs to generate agent trajectories, such as bounding boxes, road graph information and traffic lights. This symbolic representation is a high-level abstraction of the real world, which may render the motion prediction model vulnerable to perception errors (e.g., failures in detecting open-vocabulary obstacles) while missing salient information from the scene context (e.g., poor road conditions). An alternative paradigm is end-to-end learning from raw sensors. However, this approach suffers from the lack of interpretability and requires significantly more training resources. In this work, we propose tokenizing the visual world into a compact set of scene elements and then leveraging pre-trained image foundation models and LiDAR neural networks to encode all the scene elements in an open-vocabulary manner. The image foundation model enables our scene tokens to encode the general knowledge of the open world while the LiDAR neural network encodes geometry information. Our proposed representation can efficiently encode the multi-frame multi-modality observations with a few hundred tokens and is compatible with most transformer-based architectures. To evaluate our method, we have augmented Waymo Open Motion Dataset with camera embeddings. Experiments over Waymo Open Motion Dataset show that our approach leads to significant performance improvements over the state-of-the-art.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
TransportationGames: Benchmarking Transportation Knowledge of (Multimodal) Large Language Models
Authors:
Xue Zhang,
Xiangyu Shi,
Xinyue Lou,
Rui Qi,
Yufeng Chen,
Jinan Xu,
Wenjuan Han
Abstract:
Large language models (LLMs) and multimodal large language models (MLLMs) have shown excellent general capabilities, even exhibiting adaptability in many professional domains such as law, economics, transportation, and medicine. Currently, many domain-specific benchmarks have been proposed to verify the performance of (M)LLMs in specific fields. Among various domains, transportation plays a crucia…
▽ More
Large language models (LLMs) and multimodal large language models (MLLMs) have shown excellent general capabilities, even exhibiting adaptability in many professional domains such as law, economics, transportation, and medicine. Currently, many domain-specific benchmarks have been proposed to verify the performance of (M)LLMs in specific fields. Among various domains, transportation plays a crucial role in modern society as it impacts the economy, the environment, and the quality of life for billions of people. However, it is unclear how much traffic knowledge (M)LLMs possess and whether they can reliably perform transportation-related tasks. To address this gap, we propose TransportationGames, a carefully designed and thorough evaluation benchmark for assessing (M)LLMs in the transportation domain. By comprehensively considering the applications in real-world scenarios and referring to the first three levels in Bloom's Taxonomy, we test the performance of various (M)LLMs in memorizing, understanding, and applying transportation knowledge by the selected tasks. The experimental results show that although some models perform well in some tasks, there is still much room for improvement overall. We hope the release of TransportationGames can serve as a foundation for future research, thereby accelerating the implementation and application of (M)LLMs in the transportation domain.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
AI Mobile Application for Archaeological Dating of Bronze Dings
Authors:
Chuntao Li,
Ruihua Qi,
Chuan Tang,
Jiafu Wei,
Xi Yang,
Qian Zhang,
Rixin Zhou
Abstract:
We develop an AI application for archaeological dating of bronze Dings. A classification model is employed to predict the period of the input Ding, and a detection model is used to show the feature parts for making a decision of archaeological dating. To train the two deep learning models, we collected a large number of Ding images from published materials, and annotated the period and the feature…
▽ More
We develop an AI application for archaeological dating of bronze Dings. A classification model is employed to predict the period of the input Ding, and a detection model is used to show the feature parts for making a decision of archaeological dating. To train the two deep learning models, we collected a large number of Ding images from published materials, and annotated the period and the feature parts on each image by archaeological experts. Furthermore, we design a user system and deploy our pre-trained models based on the platform of WeChat Mini Program for ease of use. Only need a smartphone installed WeChat APP, users can easily know the result of intelligent archaeological dating, the feature parts, and other reference artifacts, by taking a photo of a bronze Ding. To use our application, please scan this QR code by WeChat.
△ Less
Submitted 5 September, 2023;
originally announced January 2024.
-
Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving
Authors:
Mahyar Najibi,
Jingwei Ji,
Yin Zhou,
Charles R. Qi,
Xinchen Yan,
Scott Ettinger,
Dragomir Anguelov
Abstract:
Closed-set 3D perception models trained on only a pre-defined set of object categories can be inadequate for safety critical applications such as autonomous driving where new object types can be encountered after deployment. In this paper, we present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without…
▽ More
Closed-set 3D perception models trained on only a pre-defined set of object categories can be inadequate for safety critical applications such as autonomous driving where new object types can be encountered after deployment. In this paper, we present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels. Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants. Compared to the recent studies in this domain, which can only provide class-agnostic auto labels limited to moving objects, our method can handle both static and moving objects in the unsupervised manner and is able to output open-vocabulary semantic labels thanks to the proposed vision-language knowledge distillation. Experiments on the Waymo Open Dataset show that our approach outperforms the prior work by significant margins on various unsupervised 3D perception tasks.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Toward Zero-shot Character Recognition: A Gold Standard Dataset with Radical-level Annotations
Authors:
Xiaolei Diao,
Daqian Shi,
Jian Li,
Lida Shi,
Mingzhe Yue,
Ruihua Qi,
Chuntao Li,
Hao Xu
Abstract:
Optical character recognition (OCR) methods have been applied to diverse tasks, e.g., street view text recognition and document analysis. Recently, zero-shot OCR has piqued the interest of the research community because it considers a practical OCR scenario with unbalanced data distribution. However, there is a lack of benchmarks for evaluating such zero-shot methods that apply a divide-and-conque…
▽ More
Optical character recognition (OCR) methods have been applied to diverse tasks, e.g., street view text recognition and document analysis. Recently, zero-shot OCR has piqued the interest of the research community because it considers a practical OCR scenario with unbalanced data distribution. However, there is a lack of benchmarks for evaluating such zero-shot methods that apply a divide-and-conquer recognition strategy by decomposing characters into radicals. Meanwhile, radical recognition, as another important OCR task, also lacks radical-level annotation for model training. In this paper, we construct an ancient Chinese character image dataset that contains both radical-level and character-level annotations to satisfy the requirements of the above-mentioned methods, namely, ACCID, where radical-level annotations include radical categories, radical locations, and structural relations. To increase the adaptability of ACCID, we propose a splicing-based synthetic character algorithm to augment the training samples and apply an image denoising method to improve the image quality. By introducing character decomposition and recombination, we propose a baseline method for zero-shot OCR. The experimental results demonstrate the validity of ACCID and the baseline model quantitatively and qualitatively.
△ Less
Submitted 1 August, 2023;
originally announced August 2023.
-
MoDAR: Using Motion Forecasting for 3D Object Detection in Point Cloud Sequences
Authors:
Yingwei Li,
Charles R. Qi,
Yin Zhou,
Chenxi Liu,
Dragomir Anguelov
Abstract:
Occluded and long-range objects are ubiquitous and challenging for 3D object detection. Point cloud sequence data provide unique opportunities to improve such cases, as an occluded or distant object can be observed from different viewpoints or gets better visibility over time. However, the efficiency and effectiveness in encoding long-term sequence data can still be improved. In this work, we prop…
▽ More
Occluded and long-range objects are ubiquitous and challenging for 3D object detection. Point cloud sequence data provide unique opportunities to improve such cases, as an occluded or distant object can be observed from different viewpoints or gets better visibility over time. However, the efficiency and effectiveness in encoding long-term sequence data can still be improved. In this work, we propose MoDAR, using motion forecasting outputs as a type of virtual modality, to augment LiDAR point clouds. The MoDAR modality propagates object information from temporal contexts to a target frame, represented as a set of virtual points, one for each object from a waypoint on a forecasted trajectory. A fused point cloud of both raw sensor points and the virtual points can then be fed to any off-the-shelf point-cloud based 3D object detector. Evaluated on the Waymo Open Dataset, our method significantly improves prior art detectors by using motion forecasting from extra-long sequences (e.g. 18 seconds), achieving new state of the arts, while not adding much computation overhead.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Explanation Strategies for Image Classification in Humans vs. Current Explainable AI
Authors:
Ruoxi Qi,
Yueyuan Zheng,
Yi Yang,
Caleb Chen Cao,
Janet H. Hsiao
Abstract:
Explainable AI (XAI) methods provide explanations of AI models, but our understanding of how they compare with human explanations remains limited. In image classification, we found that humans adopted more explorative attention strategies for explanation than the classification task itself. Two representative explanation strategies were identified through clustering: One involved focused visual sc…
▽ More
Explainable AI (XAI) methods provide explanations of AI models, but our understanding of how they compare with human explanations remains limited. In image classification, we found that humans adopted more explorative attention strategies for explanation than the classification task itself. Two representative explanation strategies were identified through clustering: One involved focused visual scanning on foreground objects with more conceptual explanations diagnostic for inferring class labels, whereas the other involved explorative scanning with more visual explanations rated higher for effectiveness. Interestingly, XAI saliency-map explanations had the highest similarity to the explorative attention strategy in humans, and explanations highlighting discriminative features from invoking observable causality through perturbation had higher similarity to human strategies than those highlighting internal features associated with higher class score. Thus, humans differ in information and strategy use for explanations, and XAI methods that highlight features informing observable causality match better with human explanations, potentially more accessible to users.
△ Less
Submitted 10 April, 2023;
originally announced April 2023.
-
WOMD-LiDAR: Raw Sensor Dataset Benchmark for Motion Forecasting
Authors:
Kan Chen,
Runzhou Ge,
Hang Qiu,
Rami AI-Rfou,
Charles R. Qi,
Xuanyu Zhou,
Zoey Yang,
Scott Ettinger,
Pei Sun,
Zhaoqi Leng,
Mustafa Baniodeh,
Ivan Bogun,
Weiyue Wang,
Mingxing Tan,
Dragomir Anguelov
Abstract:
Widely adopted motion forecasting datasets substitute the observed sensory inputs with higher-level abstractions such as 3D boxes and polylines. These sparse shapes are inferred through annotating the original scenes with perception systems' predictions. Such intermediate representations tie the quality of the motion forecasting models to the performance of computer vision models. Moreover, the hu…
▽ More
Widely adopted motion forecasting datasets substitute the observed sensory inputs with higher-level abstractions such as 3D boxes and polylines. These sparse shapes are inferred through annotating the original scenes with perception systems' predictions. Such intermediate representations tie the quality of the motion forecasting models to the performance of computer vision models. Moreover, the human-designed explicit interfaces between perception and motion forecasting typically pass only a subset of the semantic information present in the original sensory input. To study the effect of these modular approaches, design new paradigms that mitigate these limitations, and accelerate the development of end-to-end motion forecasting models, we augment the Waymo Open Motion Dataset (WOMD) with large-scale, high-quality, diverse LiDAR data for the motion forecasting task.
The new augmented dataset WOMD-LiDAR consists of over 100,000 scenes that each spans 20 seconds, consisting of well-synchronized and calibrated high quality LiDAR point clouds captured across a range of urban and suburban geographies (https://waymo.com/open/data/motion/). Compared to Waymo Open Dataset (WOD), WOMD-LiDAR dataset contains 100x more scenes. Furthermore, we integrate the LiDAR data into the motion forecasting model training and provide a strong baseline. Experiments show that the LiDAR data brings improvement in the motion forecasting task. We hope that WOMD-LiDAR will provide new opportunities for boosting end-to-end motion forecasting models.
△ Less
Submitted 18 February, 2024; v1 submitted 7 April, 2023;
originally announced April 2023.
-
GINA-3D: Learning to Generate Implicit Neural Assets in the Wild
Authors:
Bokui Shen,
Xinchen Yan,
Charles R. Qi,
Mahyar Najibi,
Boyang Deng,
Leonidas Guibas,
Yin Zhou,
Dragomir Anguelov
Abstract:
Modeling the 3D world from sensor data for simulation is a scalable way of developing testing and validation environments for robotic learning problems such as autonomous driving. However, manually creating or re-creating real-world-like environments is difficult, expensive, and not scalable. Recent generative model techniques have shown promising progress to address such challenges by learning 3D…
▽ More
Modeling the 3D world from sensor data for simulation is a scalable way of developing testing and validation environments for robotic learning problems such as autonomous driving. However, manually creating or re-creating real-world-like environments is difficult, expensive, and not scalable. Recent generative model techniques have shown promising progress to address such challenges by learning 3D assets using only plentiful 2D images -- but still suffer limitations as they leverage either human-curated image datasets or renderings from manually-created synthetic 3D environments. In this paper, we introduce GINA-3D, a generative model that uses real-world driving data from camera and LiDAR sensors to create realistic 3D implicit neural assets of diverse vehicles and pedestrians. Compared to the existing image datasets, the real-world driving setting poses new challenges due to occlusions, lighting-variations and long-tail distributions. GINA-3D tackles these challenges by decoupling representation learning and generative modeling into two stages with a learned tri-plane latent structure, inspired by recent advances in generative modeling of images. To evaluate our approach, we construct a large-scale object-centric dataset containing over 1.2M images of vehicles and pedestrians from the Waymo Open Dataset, and a new set of 80K images of long-tail instances such as construction equipment, garbage trucks, and cable cars. We compare our model with existing approaches and demonstrate that it achieves state-of-the-art performance in quality and diversity for both generated images and geometries.
△ Less
Submitted 28 August, 2023; v1 submitted 4 April, 2023;
originally announced April 2023.
-
Multi-Granularity Archaeological Dating of Chinese Bronze Dings Based on a Knowledge-Guided Relation Graph
Authors:
Rixin Zhou,
Jiafu Wei,
Qian Zhang,
Ruihua Qi,
Xi Yang,
Chuntao Li
Abstract:
The archaeological dating of bronze dings has played a critical role in the study of ancient Chinese history. Current archaeology depends on trained experts to carry out bronze dating, which is time-consuming and labor-intensive. For such dating, in this study, we propose a learning-based approach to integrate advanced deep learning techniques and archaeological knowledge. To achieve this, we firs…
▽ More
The archaeological dating of bronze dings has played a critical role in the study of ancient Chinese history. Current archaeology depends on trained experts to carry out bronze dating, which is time-consuming and labor-intensive. For such dating, in this study, we propose a learning-based approach to integrate advanced deep learning techniques and archaeological knowledge. To achieve this, we first collect a large-scale image dataset of bronze dings, which contains richer attribute information than other existing fine-grained datasets. Second, we introduce a multihead classifier and a knowledge-guided relation graph to mine the relationship between attributes and the ding era. Third, we conduct comparison experiments with various existing methods, the results of which show that our dating method achieves a state-of-the-art performance. We hope that our data and applied networks will enrich fine-grained classification research relevant to other interdisciplinary areas of expertise. The dataset and source code used are included in our supplementary materials, and will be open after submission owing to the anonymity policy. Source codes and data are available at: https://github.com/zhourixin/bronze-Ding.
△ Less
Submitted 2 June, 2023; v1 submitted 27 March, 2023;
originally announced March 2023.
-
New Approximation Algorithms for Touring Regions
Authors:
Benjamin Qi,
Richard Qi,
Xinyang Chen
Abstract:
We analyze the touring regions problem: find a ($1+ε$)-approximate Euclidean shortest path in $d$-dimensional space that starts at a given starting point, ends at a given ending point, and visits given regions $R_1, R_2, R_3, \dots, R_n$ in that order.
Our main result is an $\mathcal O \left(\frac{n}{\sqrtε}\log{\frac{1}ε} + \frac{1}ε \right)$-time algorithm for touring disjoint disks. We also g…
▽ More
We analyze the touring regions problem: find a ($1+ε$)-approximate Euclidean shortest path in $d$-dimensional space that starts at a given starting point, ends at a given ending point, and visits given regions $R_1, R_2, R_3, \dots, R_n$ in that order.
Our main result is an $\mathcal O \left(\frac{n}{\sqrtε}\log{\frac{1}ε} + \frac{1}ε \right)$-time algorithm for touring disjoint disks. We also give an $\mathcal O\left (\min\left(\frac{n}ε, \frac{n^2}{\sqrt ε}\right) \right)$-time algorithm for touring disjoint two-dimensional convex fat bodies. Both of these results naturally generalize to larger dimensions; we obtain $\mathcal O\left(\frac{n}{ε^{d-1}}\log^2\frac{1}ε+\frac{1}{ε^{2d-2}}\right)$ and $\mathcal O\left(\frac{n}{ε^{2d-2}}\right)$-time algorithms for touring disjoint $d$-dimensional balls and convex fat bodies, respectively.
△ Less
Submitted 13 March, 2023; v1 submitted 12 March, 2023;
originally announced March 2023.
-
ABODE-Net: An Attention-based Deep Learning Model for Non-intrusive Building Occupancy Detection Using Smart Meter Data
Authors:
Zhirui Luo,
Ruobin Qi,
Qingqing Li,
Jun Zheng,
Sihua Shao
Abstract:
Occupancy information is useful for efficient energy management in the building sector. The massive high-resolution electrical power consumption data collected by smart meters in the advanced metering infrastructure (AMI) network make it possible to infer buildings' occupancy status in a non-intrusive way. In this paper, we propose a deep leaning model called ABODE-Net which employs a novel Parall…
▽ More
Occupancy information is useful for efficient energy management in the building sector. The massive high-resolution electrical power consumption data collected by smart meters in the advanced metering infrastructure (AMI) network make it possible to infer buildings' occupancy status in a non-intrusive way. In this paper, we propose a deep leaning model called ABODE-Net which employs a novel Parallel Attention (PA) block for building occupancy detection using smart meter data. The PA block combines the temporal, variable, and channel attention modules in a parallel way to signify important features for occupancy detection. We adopt two smart meter datasets widely used for building occupancy detection in our performance evaluation. A set of state-of-the-art shallow machine learning and deep learning models are included for performance comparison. The results show that ABODE-Net significantly outperforms other models in all experimental cases, which proves its validity as a solution for non-intrusive building occupancy detection.
△ Less
Submitted 21 December, 2022;
originally announced December 2022.
-
NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors
Authors:
Congyue Deng,
Chiyu "Max'' Jiang,
Charles R. Qi,
Xinchen Yan,
Yin Zhou,
Leonidas Guibas,
Dragomir Anguelov
Abstract:
2D-to-3D reconstruction is an ill-posed problem, yet humans are good at solving this problem due to their prior knowledge of the 3D world developed over years. Driven by this observation, we propose NeRDi, a single-view NeRF synthesis framework with general image priors from 2D diffusion models. Formulating single-view reconstruction as an image-conditioned 3D generation problem, we optimize the N…
▽ More
2D-to-3D reconstruction is an ill-posed problem, yet humans are good at solving this problem due to their prior knowledge of the 3D world developed over years. Driven by this observation, we propose NeRDi, a single-view NeRF synthesis framework with general image priors from 2D diffusion models. Formulating single-view reconstruction as an image-conditioned 3D generation problem, we optimize the NeRF representations by minimizing a diffusion loss on its arbitrary view renderings with a pretrained image diffusion model under the input-view constraint. We leverage off-the-shelf vision-language models and introduce a two-section language guidance as conditioning inputs to the diffusion model. This is essentially helpful for improving multiview content coherence as it narrows down the general image prior conditioned on the semantic and visual features of the single-view input image. Additionally, we introduce a geometric loss based on estimated depth maps to regularize the underlying 3D geometry of the NeRF. Experimental results on the DTU MVS dataset show that our method can synthesize novel views with higher quality even compared to existing methods trained on this dataset. We also demonstrate our generalizability in zero-shot NeRF synthesis for in-the-wild images.
△ Less
Submitted 6 December, 2022;
originally announced December 2022.
-
Improving the Intra-class Long-tail in 3D Detection via Rare Example Mining
Authors:
Chiyu Max Jiang,
Mahyar Najibi,
Charles R. Qi,
Yin Zhou,
Dragomir Anguelov
Abstract:
Continued improvements in deep learning architectures have steadily advanced the overall performance of 3D object detectors to levels on par with humans for certain tasks and datasets, where the overall performance is mostly driven by common examples. However, even the best performing models suffer from the most naive mistakes when it comes to rare examples that do not appear frequently in the tra…
▽ More
Continued improvements in deep learning architectures have steadily advanced the overall performance of 3D object detectors to levels on par with humans for certain tasks and datasets, where the overall performance is mostly driven by common examples. However, even the best performing models suffer from the most naive mistakes when it comes to rare examples that do not appear frequently in the training data, such as vehicles with irregular geometries. Most studies in the long-tail literature focus on class-imbalanced classification problems with known imbalanced label counts per class, but they are not directly applicable to the intra-class long-tail examples in problems with large intra-class variations such as 3D object detection, where instances with the same class label can have drastically varied properties such as shapes and sizes. Other works propose to mitigate this problem using active learning based on the criteria of uncertainty, difficulty, or diversity. In this study, we identify a new conceptual dimension - rareness - to mine new data for improving the long-tail performance of models. We show that rareness, as opposed to difficulty, is the key to data-centric improvements for 3D detectors, since rareness is the result of a lack in data support while difficulty is related to the fundamental ambiguity in the problem. We propose a general and effective method to identify the rareness of objects based on density estimation in the feature space using flow models, and propose a principled cost-aware formulation for mining rare object tracks, which improves overall model performance, but more importantly - significantly improves the performance for rare objects (by 30.97\%
△ Less
Submitted 15 October, 2022;
originally announced October 2022.
-
LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds
Authors:
Minghua Liu,
Yin Zhou,
Charles R. Qi,
Boqing Gong,
Hao Su,
Dragomir Anguelov
Abstract:
Semantic segmentation of LiDAR point clouds is an important task in autonomous driving. However, training deep models via conventional supervised methods requires large datasets which are costly to label. It is critical to have label-efficient segmentation approaches to scale up the model to new operational domains or to improve performance on rare cases. While most prior works focus on indoor sce…
▽ More
Semantic segmentation of LiDAR point clouds is an important task in autonomous driving. However, training deep models via conventional supervised methods requires large datasets which are costly to label. It is critical to have label-efficient segmentation approaches to scale up the model to new operational domains or to improve performance on rare cases. While most prior works focus on indoor scenes, we are one of the first to propose a label-efficient semantic segmentation pipeline for outdoor scenes with LiDAR point clouds. Our method co-designs an efficient labeling process with semi/weakly supervised learning and is applicable to nearly any 3D semantic segmentation backbones. Specifically, we leverage geometry patterns in outdoor scenes to have a heuristic pre-segmentation to reduce the manual labeling and jointly design the learning targets with the labeling process. In the learning step, we leverage prototype learning to get more descriptive point embeddings and use multi-scan distillation to exploit richer semantics from temporally aggregated point clouds to boost the performance of single-scan models. Evaluated on the SemanticKITTI and the nuScenes datasets, we show that our proposed method outperforms existing label-efficient methods. With extremely limited human annotations (e.g., 0.1% point labels), our proposed method is even highly competitive compared to the fully supervised counterpart with 100% labels.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving
Authors:
Mahyar Najibi,
Jingwei Ji,
Yin Zhou,
Charles R. Qi,
Xinchen Yan,
Scott Ettinger,
Dragomir Anguelov
Abstract:
Learning-based perception and prediction modules in modern autonomous driving systems typically rely on expensive human annotation and are designed to perceive only a handful of predefined object categories. This closed-set paradigm is insufficient for the safety-critical autonomous driving task, where the autonomous vehicle needs to process arbitrarily many types of traffic participants and their…
▽ More
Learning-based perception and prediction modules in modern autonomous driving systems typically rely on expensive human annotation and are designed to perceive only a handful of predefined object categories. This closed-set paradigm is insufficient for the safety-critical autonomous driving task, where the autonomous vehicle needs to process arbitrarily many types of traffic participants and their motion behaviors in a highly dynamic world. To address this difficulty, this paper pioneers a novel and challenging direction, i.e., training perception and prediction models to understand open-set moving objects, with no human supervision. Our proposed framework uses self-learned flow to trigger an automated meta labeling pipeline to achieve automatic supervision. 3D detection experiments on the Waymo Open Dataset show that our method significantly outperforms classical unsupervised approaches and is even competitive to the counterpart with supervised scene flow. We further show that our approach generates highly promising results in open-set 3D detection and trajectory prediction, confirming its potential in closing the safety gap of fully supervised systems.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
LidarNAS: Unifying and Searching Neural Architectures for 3D Point Clouds
Authors:
Chenxi Liu,
Zhaoqi Leng,
Pei Sun,
Shuyang Cheng,
Charles R. Qi,
Yin Zhou,
Mingxing Tan,
Dragomir Anguelov
Abstract:
Developing neural models that accurately understand objects in 3D point clouds is essential for the success of robotics and autonomous driving. However, arguably due to the higher-dimensional nature of the data (as compared to images), existing neural architectures exhibit a large variety in their designs, including but not limited to the views considered, the format of the neural features, and th…
▽ More
Developing neural models that accurately understand objects in 3D point clouds is essential for the success of robotics and autonomous driving. However, arguably due to the higher-dimensional nature of the data (as compared to images), existing neural architectures exhibit a large variety in their designs, including but not limited to the views considered, the format of the neural features, and the neural operations used. Lack of a unified framework and interpretation makes it hard to put these designs in perspective, as well as systematically explore new ones. In this paper, we begin by proposing a unified framework of such, with the key idea being factorizing the neural networks into a series of view transforms and neural layers. We demonstrate that this modular framework can reproduce a variety of existing works while allowing a fair comparison of backbone designs. Then, we show how this framework can easily materialize into a concrete neural architecture search (NAS) space, allowing a principled NAS-for-3D exploration. In performing evolutionary NAS on the 3D object detection task on the Waymo Open Dataset, not only do we outperform the state-of-the-art models, but also report the interesting finding that NAS tends to discover the same macro-level architecture concept for both the vehicle and pedestrian classes.
△ Less
Submitted 10 October, 2022;
originally announced October 2022.
-
ForecastTKGQuestions: A Benchmark for Temporal Question Answering and Forecasting over Temporal Knowledge Graphs
Authors:
Zifeng Ding,
Zongyue Li,
Ruoxia Qi,
Jingpei Wu,
Bailan He,
Yunpu Ma,
Zhao Meng,
Shuo Chen,
Ruotong Liao,
Zhen Han,
Volker Tresp
Abstract:
Question answering over temporal knowledge graphs (TKGQA) has recently found increasing interest. TKGQA requires temporal reasoning techniques to extract the relevant information from temporal knowledge bases. The only existing TKGQA dataset, i.e., CronQuestions, consists of temporal questions based on the facts from a fixed time period, where a temporal knowledge graph (TKG) spanning the same per…
▽ More
Question answering over temporal knowledge graphs (TKGQA) has recently found increasing interest. TKGQA requires temporal reasoning techniques to extract the relevant information from temporal knowledge bases. The only existing TKGQA dataset, i.e., CronQuestions, consists of temporal questions based on the facts from a fixed time period, where a temporal knowledge graph (TKG) spanning the same period can be fully used for answer inference, allowing the TKGQA models to use even the future knowledge to answer the questions based on the past facts. In real-world scenarios, however, it is also common that given the knowledge until now, we wish the TKGQA systems to answer the questions asking about the future. As humans constantly seek plans for the future, building TKGQA systems for answering such forecasting questions is important. Nevertheless, this has still been unexplored in previous research. In this paper, we propose a novel task: forecasting question answering over temporal knowledge graphs. We also propose a large-scale TKGQA benchmark dataset, i.e., ForecastTKGQuestions, for this task. It includes three types of questions, i.e., entity prediction, yes-no, and fact reasoning questions. For every forecasting question in our dataset, QA models can only have access to the TKG information before the timestamp annotated in the given question for answer inference. We find that the state-of-the-art TKGQA methods perform poorly on forecasting questions, and they are unable to answer yes-no questions and fact reasoning questions. To this end, we propose ForecastTKGQA, a TKGQA model that employs a TKG forecasting module for future inference, to answer all three types of questions. Experimental results show that ForecastTKGQA outperforms recent TKGQA methods on the entity prediction questions, and it also shows great effectiveness in answering the other two types of questions.
△ Less
Submitted 18 July, 2023; v1 submitted 12 August, 2022;
originally announced August 2022.
-
Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking
Authors:
Longlong Jing,
Ruichi Yu,
Henrik Kretzschmar,
Kang Li,
Charles R. Qi,
Hang Zhao,
Alper Ayvaci,
Xu Chen,
Dillon Cower,
Yingwei Li,
Yurong You,
Han Deng,
Congcong Li,
Dragomir Anguelov
Abstract:
Monocular image-based 3D perception has become an active research area in recent years owing to its applications in autonomous driving. Approaches to monocular 3D perception including detection and tracking, however, often yield inferior performance when compared to LiDAR-based techniques. Through systematic analysis, we identified that per-object depth estimation accuracy is a major factor boundi…
▽ More
Monocular image-based 3D perception has become an active research area in recent years owing to its applications in autonomous driving. Approaches to monocular 3D perception including detection and tracking, however, often yield inferior performance when compared to LiDAR-based techniques. Through systematic analysis, we identified that per-object depth estimation accuracy is a major factor bounding the performance. Motivated by this observation, we propose a multi-level fusion method that combines different representations (RGB and pseudo-LiDAR) and temporal information across multiple frames for objects (tracklets) to enhance per-object depth estimation. Our proposed fusion method achieves the state-of-the-art performance of per-object depth estimation on the Waymo Open Dataset, the KITTI detection dataset, and the KITTI MOT dataset. We further demonstrate that by simply replacing estimated depth with fusion-enhanced depth, we can achieve significant improvements in monocular 3D perception tasks, including detection and tracking.
△ Less
Submitted 7 June, 2022;
originally announced June 2022.
-
RIDDLE: Lidar Data Compression with Range Image Deep Delta Encoding
Authors:
Xuanyu Zhou,
Charles R. Qi,
Yin Zhou,
Dragomir Anguelov
Abstract:
Lidars are depth measuring sensors widely used in autonomous driving and augmented reality. However, the large volume of data produced by lidars can lead to high costs in data storage and transmission. While lidar data can be represented as two interchangeable representations: 3D point clouds and range images, most previous work focus on compressing the generic 3D point clouds. In this work, we sh…
▽ More
Lidars are depth measuring sensors widely used in autonomous driving and augmented reality. However, the large volume of data produced by lidars can lead to high costs in data storage and transmission. While lidar data can be represented as two interchangeable representations: 3D point clouds and range images, most previous work focus on compressing the generic 3D point clouds. In this work, we show that directly compressing the range images can leverage the lidar scanning pattern, compared to compressing the unprojected point clouds. We propose a novel data-driven range image compression algorithm, named RIDDLE (Range Image Deep DeLta Encoding). At its core is a deep model that predicts the next pixel value in a raster scanning order, based on contextual laser shots from both the current and past scans (represented as a 4D point cloud of spherical coordinates and time). The deltas between predictions and original values can then be compressed by entropy encoding. Evaluated on the Waymo Open Dataset and KITTI, our method demonstrates significant improvement in the compression rate (under the same distortion) compared to widely used point cloud and range image compression algorithms as well as recent deep methods.
△ Less
Submitted 2 June, 2022;
originally announced June 2022.
-
Global Contrast Masked Autoencoders Are Powerful Pathological Representation Learners
Authors:
Hao Quan,
Xingyu Li,
Weixing Chen,
Qun Bai,
Mingchen Zou,
Ruijie Yang,
Tingting Zheng,
Ruiqun Qi,
Xinghua Gao,
Xiaoyu Cui
Abstract:
Based on digital pathology slice scanning technology, artificial intelligence algorithms represented by deep learning have achieved remarkable results in the field of computational pathology. Compared to other medical images, pathology images are more difficult to annotate, and thus, there is an extreme lack of available datasets for conducting supervised learning to train robust deep learning mod…
▽ More
Based on digital pathology slice scanning technology, artificial intelligence algorithms represented by deep learning have achieved remarkable results in the field of computational pathology. Compared to other medical images, pathology images are more difficult to annotate, and thus, there is an extreme lack of available datasets for conducting supervised learning to train robust deep learning models. In this paper, we propose a self-supervised learning (SSL) model, the global contrast-masked autoencoder (GCMAE), which can train the encoder to have the ability to represent local-global features of pathological images, also significantly improve the performance of transfer learning across data sets. In this study, the ability of the GCMAE to learn migratable representations was demonstrated through extensive experiments using a total of three different disease-specific hematoxylin and eosin (HE)-stained pathology datasets: Camelyon16, NCTCRC and BreakHis. In addition, this study designed an effective automated pathology diagnosis process based on the GCMAE for clinical applications. The source code of this paper is publicly available at https://github.com/StarUniversus/gcmae.
△ Less
Submitted 15 November, 2023; v1 submitted 18 May, 2022;
originally announced May 2022.
-
Multi-Class 3D Object Detection with Single-Class Supervision
Authors:
Mao Ye,
Chenxi Liu,
Maoqing Yao,
Weiyue Wang,
Zhaoqi Leng,
Charles R. Qi,
Dragomir Anguelov
Abstract:
While multi-class 3D detectors are needed in many robotics applications, training them with fully labeled datasets can be expensive in labeling cost. An alternative approach is to have targeted single-class labels on disjoint data samples. In this paper, we are interested in training a multi-class 3D object detection model, while using these single-class labeled data. We begin by detailing the uni…
▽ More
While multi-class 3D detectors are needed in many robotics applications, training them with fully labeled datasets can be expensive in labeling cost. An alternative approach is to have targeted single-class labels on disjoint data samples. In this paper, we are interested in training a multi-class 3D object detection model, while using these single-class labeled data. We begin by detailing the unique stance of our "Single-Class Supervision" (SCS) setting with respect to related concepts such as partial supervision and semi supervision. Then, based on the case study of training the multi-class version of Range Sparse Net (RSN), we adapt a spectrum of algorithms -- from supervised learning to pseudo-labeling -- to fully exploit the properties of our SCS setting, and perform extensive ablation studies to identify the most effective algorithm and practice. Empirical experiments on the Waymo Open Dataset show that proper training under SCS can approach or match full supervision training while saving labeling costs.
△ Less
Submitted 11 May, 2022;
originally announced May 2022.
-
Random Ensemble Reinforcement Learning for Traffic Signal Control
Authors:
Ruijie Qi,
Jianbin Huang,
He Li,
Qinglin Tan,
Longji Huang,
Jiangtao Cui
Abstract:
Traffic signal control is a significant part of the construction of intelligent transportation. An efficient traffic signal control strategy can reduce traffic congestion, improve urban road traffic efficiency and facilitate people's lives. Existing reinforcement learning approaches for traffic signal control mainly focus on learning through a separate neural network. Such an independent neural ne…
▽ More
Traffic signal control is a significant part of the construction of intelligent transportation. An efficient traffic signal control strategy can reduce traffic congestion, improve urban road traffic efficiency and facilitate people's lives. Existing reinforcement learning approaches for traffic signal control mainly focus on learning through a separate neural network. Such an independent neural network may fall into the local optimum of the training results. Worse more, the collected data can only be sampled once, so the data utilization rate is low. Therefore, we propose the Random Ensemble Double DQN Light (RELight) model. It can dynamically learn traffic signal control strategies through reinforcement learning and combine random ensemble learning to avoid falling into the local optimum to reach the optimal strategy. Moreover, we introduce the Update-To-Data (UTD) ratio to control the number of data reuses to improve the problem of low data utilization. In addition, we have conducted sufficient experiments on synthetic data and real-world data to prove that our proposed method can achieve better traffic signal control effects than the existing optimal methods.
△ Less
Submitted 10 March, 2022;
originally announced March 2022.
-
Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving
Authors:
Jingxiao Zheng,
Xinwei Shi,
Alexander Gorban,
Junhua Mao,
Yang Song,
Charles R. Qi,
Ting Liu,
Visesh Chari,
Andre Cornman,
Yin Zhou,
Congcong Li,
Dragomir Anguelov
Abstract:
3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other use cases in many factors, including the 3D resolution and range of data, absence of dense depth maps, failure modes for LiDAR, relative location between the camera and LiDAR, and a high bar for estimation accuracy. Data collected for other use cases (such as virtual reality, gaming, and animation) may therefore not be u…
▽ More
3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other use cases in many factors, including the 3D resolution and range of data, absence of dense depth maps, failure modes for LiDAR, relative location between the camera and LiDAR, and a high bar for estimation accuracy. Data collected for other use cases (such as virtual reality, gaming, and animation) may therefore not be usable for AV applications. This necessitates the collection and annotation of a large amount of 3D data for HPE in AV, which is time-consuming and expensive. In this paper, we propose one of the first approaches to alleviate this problem in the AV setting. Specifically, we propose a multi-modal approach which uses 2D labels on RGB images as weak supervision to perform 3D HPE. The proposed multi-modal architecture incorporates LiDAR and camera inputs with an auxiliary segmentation branch. On the Waymo Open Dataset, our approach achieves a 22% relative improvement over camera-only 2D HPE baseline, and 6% improvement over LiDAR-only model. Finally, careful ablation studies and parts based analysis illustrate the advantages of each of our contributions.
△ Less
Submitted 22 December, 2021;
originally announced December 2021.
-
Revisiting 3D Object Detection From an Egocentric Perspective
Authors:
Boyang Deng,
Charles R. Qi,
Mahyar Najibi,
Thomas Funkhouser,
Yin Zhou,
Dragomir Anguelov
Abstract:
3D object detection is a key module for safety-critical robotics applications such as autonomous driving. For these applications, we care most about how the detections affect the ego-agent's behavior and safety (the egocentric perspective). Intuitively, we seek more accurate descriptions of object geometry when it's more likely to interfere with the ego-agent's motion trajectory. However, current…
▽ More
3D object detection is a key module for safety-critical robotics applications such as autonomous driving. For these applications, we care most about how the detections affect the ego-agent's behavior and safety (the egocentric perspective). Intuitively, we seek more accurate descriptions of object geometry when it's more likely to interfere with the ego-agent's motion trajectory. However, current detection metrics, based on box Intersection-over-Union (IoU), are object-centric and aren't designed to capture the spatio-temporal relationship between objects and the ego-agent. To address this issue, we propose a new egocentric measure to evaluate 3D object detection, namely Support Distance Error (SDE). Our analysis based on SDE reveals that the egocentric detection quality is bounded by the coarse geometry of the bounding boxes. Given the insight that SDE would benefit from more accurate geometry descriptions, we propose to represent objects as amodal contours, specifically amodal star-shaped polygons, and devise a simple model, StarPoly, to predict such contours. Our experiments on the large-scale Waymo Open Dataset show that SDE better reflects the impact of detection quality on the ego-agent's safety compared to IoU; and the estimated contours from StarPoly consistently improve the egocentric detection quality over recent 3D object detectors.
△ Less
Submitted 14 December, 2021;
originally announced December 2021.
-
The Point-to-Set Principle and the Dimensions of Hamel Bases
Authors:
Jack H. Lutz,
Renrui Qi,
Liang Yu
Abstract:
We prove that every real number in [0,1] is the Hausdorff dimension of a Hamel basis of the vector space of reals over the field of rationals.
The logic of our proof is of particular interest. The statement of our theorem is classical; it does not involve the theory of computing. However, our proof makes essential use of algorithmic fractal dimension--a computability-theoretic construct--and the…
▽ More
We prove that every real number in [0,1] is the Hausdorff dimension of a Hamel basis of the vector space of reals over the field of rationals.
The logic of our proof is of particular interest. The statement of our theorem is classical; it does not involve the theory of computing. However, our proof makes essential use of algorithmic fractal dimension--a computability-theoretic construct--and the point-to-set principle of J. Lutz and N. Lutz (2018).
△ Less
Submitted 21 September, 2023; v1 submitted 22 September, 2021;
originally announced September 2021.
-
SPG: Unsupervised Domain Adaptation for 3D Object Detection via Semantic Point Generation
Authors:
Qiangeng Xu,
Yin Zhou,
Weiyue Wang,
Charles R. Qi,
Dragomir Anguelov
Abstract:
In autonomous driving, a LiDAR-based object detector should perform reliably at different geographic locations and under various weather conditions. While recent 3D detection research focuses on improving performance within a single domain, our study reveals that the performance of modern detectors can drop drastically cross-domain. In this paper, we investigate unsupervised domain adaptation (UDA…
▽ More
In autonomous driving, a LiDAR-based object detector should perform reliably at different geographic locations and under various weather conditions. While recent 3D detection research focuses on improving performance within a single domain, our study reveals that the performance of modern detectors can drop drastically cross-domain. In this paper, we investigate unsupervised domain adaptation (UDA) for LiDAR-based 3D object detection. On the Waymo Domain Adaptation dataset, we identify the deteriorating point cloud quality as the root cause of the performance drop. To address this issue, we present Semantic Point Generation (SPG), a general approach to enhance the reliability of LiDAR detectors against domain shifts. Specifically, SPG generates semantic points at the predicted foreground regions and faithfully recovers missing parts of the foreground objects, which are caused by phenomena such as occlusions, low reflectance or weather interference. By merging the semantic points with the original points, we obtain an augmented point cloud, which can be directly consumed by modern LiDAR-based detectors. To validate the wide applicability of SPG, we experiment with two representative detectors, PointPillars and PV-RCNN. On the UDA task, SPG significantly improves both detectors across all object categories of interest and at all difficulty levels. SPG can also benefit object detection in the original domain. On the Waymo Open Dataset and KITTI, SPG improves 3D detection results of these two methods across all categories. Combined with PV-RCNN, SPG achieves state-of-the-art 3D detection results on KITTI.
△ Less
Submitted 15 August, 2021;
originally announced August 2021.
-
Offboard 3D Object Detection from Point Cloud Sequences
Authors:
Charles R. Qi,
Yin Zhou,
Mahyar Najibi,
Pei Sun,
Khoa Vo,
Boyang Deng,
Dragomir Anguelov
Abstract:
While current 3D object recognition research mostly focuses on the real-time, onboard scenario, there are many offboard use cases of perception that are largely under-explored, such as using machines to automatically generate high-quality 3D labels. Existing 3D object detectors fail to satisfy the high-quality requirement for offboard uses due to the limited input and speed constraints. In this pa…
▽ More
While current 3D object recognition research mostly focuses on the real-time, onboard scenario, there are many offboard use cases of perception that are largely under-explored, such as using machines to automatically generate high-quality 3D labels. Existing 3D object detectors fail to satisfy the high-quality requirement for offboard uses due to the limited input and speed constraints. In this paper, we propose a novel offboard 3D object detection pipeline using point cloud sequence data. Observing that different frames capture complementary views of objects, we design the offboard detector to make use of the temporal points through both multi-frame object detection and novel object-centric refinement models. Evaluated on the Waymo Open Dataset, our pipeline named 3D Auto Labeling shows significant gains compared to the state-of-the-art onboard detectors and our offboard baselines. Its performance is even on par with human labels verified through a human label study. Further experiments demonstrate the application of auto labels for semi-supervised learning and provide extensive analysis to validate various design choices.
△ Less
Submitted 8 March, 2021;
originally announced March 2021.
-
Modeling, Vibration Control, and Trajectory Tracking of a Kinematically Constrained Planar Hybrid Cable-Driven Parallel Robot
Authors:
Ronghuai Qi,
Amir Khajepour,
William W. Melek
Abstract:
This paper presents a kinematically constrained planar hybrid cable-driven parallel robot (HCDPR) for warehousing applications as well as other potential applications such as rehabilitation. The proposed HCDPR can harness the strengths and benefits of serial and cable-driven parallel robots. Based on this robotic platform, the goal in this paper is to develop an integrated control system to reduce…
▽ More
This paper presents a kinematically constrained planar hybrid cable-driven parallel robot (HCDPR) for warehousing applications as well as other potential applications such as rehabilitation. The proposed HCDPR can harness the strengths and benefits of serial and cable-driven parallel robots. Based on this robotic platform, the goal in this paper is to develop an integrated control system to reduce vibrations and improve the trajectory accuracy and performance of the HCDPR, including deriving kinematic and dynamic equations, proposing solutions for redundancy resolution and optimization of stiffness, and developing two motion and vibration control strategies (controllers I and II). Finally, different case studies are conducted to evaluate the control performance, and the results show that the controller II can achieve the goal better.
△ Less
Submitted 27 December, 2020;
originally announced December 2020.
-
Workspace Analysis and Optimal Design of Cable-Driven Parallel Robots via Auxiliary Counterbalances
Authors:
Ronghuai Qi,
Hamed Jamshidifar,
Amir Khajepour
Abstract:
Cable-driven parallel robots (CDPRs) are widely investigated and applied in the worldwide; however, traditional configurations make them to be limited in reaching their maximum workspace duo to constraints such as the maximum allowable tensions of cables. In this paper, we introduce auxiliary counterbalances to tackle this problem and focus on workspace analysis and optimal design of CDPRs with su…
▽ More
Cable-driven parallel robots (CDPRs) are widely investigated and applied in the worldwide; however, traditional configurations make them to be limited in reaching their maximum workspace duo to constraints such as the maximum allowable tensions of cables. In this paper, we introduce auxiliary counterbalances to tackle this problem and focus on workspace analysis and optimal design of CDPRs with such systems. Besides, kinematics, dynamics, and parameters optimization formulas and algorithm are provided to maximize the reachable workspace of CDPRs. Case studies for different configurations are presented and discussed. Numerical results suggest the effectiveness of the aforementioned approaches, and the obtained parameters can also be applied for actual CDPRs design.
△ Less
Submitted 22 December, 2020;
originally announced December 2020.
-
Argument Mining Driven Analysis of Peer-Reviews
Authors:
Michael Fromm,
Evgeniy Faerman,
Max Berrendorf,
Siddharth Bhargava,
Ruoxia Qi,
Yao Zhang,
Lukas Dennert,
Sophia Selle,
Yang Mao,
Thomas Seidl
Abstract:
Peer reviewing is a central process in modern research and essential for ensuring high quality and reliability of published work. At the same time, it is a time-consuming process and increasing interest in emerging fields often results in a high review workload, especially for senior researchers in this area. How to cope with this problem is an open question and it is vividly discussed across all…
▽ More
Peer reviewing is a central process in modern research and essential for ensuring high quality and reliability of published work. At the same time, it is a time-consuming process and increasing interest in emerging fields often results in a high review workload, especially for senior researchers in this area. How to cope with this problem is an open question and it is vividly discussed across all major conferences. In this work, we propose an Argument Mining based approach for the assistance of editors, meta-reviewers, and reviewers. We demonstrate that the decision process in the field of scientific publications is driven by arguments and automatic argument identification is helpful in various use-cases. One of our findings is that arguments used in the peer-review process differ from arguments in other domains making the transfer of pre-trained models difficult. Therefore, we provide the community with a new peer-review dataset from different computer science conferences with annotated arguments. In our extensive empirical evaluation, we show that Argument Mining can be used to efficiently extract the most relevant parts from reviews, which are paramount for the publication decision. The process remains interpretable since the extracted arguments can be highlighted in a review without detaching them from their context.
△ Less
Submitted 10 December, 2020;
originally announced December 2020.
-
Redundancy Resolution and Disturbance Rejection via Torque Optimization in Hybrid Cable-Driven Robots
Authors:
Ronghuai Qi,
Amir Khajepour,
William W. Melek
Abstract:
This paper presents redundancy resolution and disturbance rejection via torque optimization in Hybrid Cable-Driven Robots (HCDRs). To begin with, we initiate a redundant HCDR for nonlinear whole-body system modeling and model reduction. Based on the reduced dynamic model, two new methods are proposed to solve the redundancy resolution problem: joint-space torque optimization for actuated joints (T…
▽ More
This paper presents redundancy resolution and disturbance rejection via torque optimization in Hybrid Cable-Driven Robots (HCDRs). To begin with, we initiate a redundant HCDR for nonlinear whole-body system modeling and model reduction. Based on the reduced dynamic model, two new methods are proposed to solve the redundancy resolution problem: joint-space torque optimization for actuated joints (TOAJ) and joint-space torque optimization for actuated and unactuated joints (TOAUJ), and they can be extended to other HCDRs. Compared to the existing approaches, this paper provides the first solution (TOAUJ-based method) for HCDRs that can solve the redundancy resolution problem as well as disturbance rejection. Additionally, this paper develops detailed algorithms targeting TOAJ and TOAUJ implementation. A simple yet effective controller is designed for generated data analysis and validation. Case studies are conducted to evaluate the performance of TOAJ and TOAUJ, and the results suggest the effectiveness of the aforementioned approaches.
△ Less
Submitted 24 November, 2020;
originally announced November 2020.
-
Emora: An Inquisitive Social Chatbot Who Cares For You
Authors:
Sarah E. Finch,
James D. Finch,
Ali Ahmadvand,
Ingyu,
Choi,
Xiangjue Dong,
Ruixiang Qi,
Harshita Sahijwani,
Sergey Volokhin,
Zihan Wang,
Zihao Wang,
Jinho D. Choi
Abstract:
Inspired by studies on the overwhelming presence of experience-sharing in human-human conversations, Emora, the social chatbot developed by Emory University, aims to bring such experience-focused interaction to the current field of conversational AI. The traditional approach of information-sharing topic handlers is balanced with a focus on opinion-oriented exchanges that Emora delivers, and new co…
▽ More
Inspired by studies on the overwhelming presence of experience-sharing in human-human conversations, Emora, the social chatbot developed by Emory University, aims to bring such experience-focused interaction to the current field of conversational AI. The traditional approach of information-sharing topic handlers is balanced with a focus on opinion-oriented exchanges that Emora delivers, and new conversational abilities are developed that support dialogues that consist of a collaborative understanding and learning process of the partner's life experiences. We present a curated dialogue system that leverages highly expressive natural language templates, powerful intent classification, and ontology resources to provide an engaging and interesting conversational experience to every user.
△ Less
Submitted 9 September, 2020;
originally announced September 2020.
-
PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding
Authors:
Saining Xie,
Jiatao Gu,
Demi Guo,
Charles R. Qi,
Leonidas J. Guibas,
Or Litany
Abstract:
Arguably one of the top success stories of deep learning is transfer learning. The finding that pre-training a network on a rich source set (eg., ImageNet) can help boost performance once fine-tuned on a usually much smaller target set, has been instrumental to many applications in language and vision. Yet, very little is known about its usefulness in 3D point cloud understanding. We see this as a…
▽ More
Arguably one of the top success stories of deep learning is transfer learning. The finding that pre-training a network on a rich source set (eg., ImageNet) can help boost performance once fine-tuned on a usually much smaller target set, has been instrumental to many applications in language and vision. Yet, very little is known about its usefulness in 3D point cloud understanding. We see this as an opportunity considering the effort required for annotating data in 3D. In this work, we aim at facilitating research on 3D representation learning. Different from previous works, we focus on high-level scene understanding tasks. To this end, we select a suite of diverse datasets and tasks to measure the effect of unsupervised pre-training on a large source set of 3D scenes. Our findings are extremely encouraging: using a unified triplet of architecture, source dataset, and contrastive loss for pre-training, we achieve improvement over recent best results in segmentation and detection across 6 different benchmarks for indoor and outdoor, real and synthetic datasets -- demonstrating that the learned representation can generalize across domains. Furthermore, the improvement was similar to supervised pre-training, suggesting that future efforts should favor scaling data collection over more detailed annotation. We hope these findings will encourage more research on unsupervised pretext task design for 3D deep learning.
△ Less
Submitted 20 November, 2020; v1 submitted 21 July, 2020;
originally announced July 2020.
-
Object-Centric Multi-View Aggregation
Authors:
Shubham Tulsiani,
Or Litany,
Charles R. Qi,
He Wang,
Leonidas J. Guibas
Abstract:
We present an approach for aggregating a sparse set of views of an object in order to compute a semi-implicit 3D representation in the form of a volumetric feature grid. Key to our approach is an object-centric canonical 3D coordinate system into which views can be lifted, without explicit camera pose estimation, and then combined -- in a manner that can accommodate a variable number of views and…
▽ More
We present an approach for aggregating a sparse set of views of an object in order to compute a semi-implicit 3D representation in the form of a volumetric feature grid. Key to our approach is an object-centric canonical 3D coordinate system into which views can be lifted, without explicit camera pose estimation, and then combined -- in a manner that can accommodate a variable number of views and is view order independent. We show that computing a symmetry-aware mapping from pixels to the canonical coordinate system allows us to better propagate information to unseen regions, as well as to robustly overcome pose ambiguities during inference. Our aggregate representation enables us to perform 3D inference tasks like volumetric reconstruction and novel view synthesis, and we use these tasks to demonstrate the benefits of our aggregation approach as compared to implicit or camera-centric alternatives.
△ Less
Submitted 21 July, 2020; v1 submitted 20 July, 2020;
originally announced July 2020.
-
ImVoteNet: Boosting 3D Object Detection in Point Clouds with Image Votes
Authors:
Charles R. Qi,
Xinlei Chen,
Or Litany,
Leonidas J. Guibas
Abstract:
3D object detection has seen quick progress thanks to advances in deep learning on point clouds. A few recent works have even shown state-of-the-art performance with just point clouds input (e.g. VoteNet). However, point cloud data have inherent limitations. They are sparse, lack color information and often suffer from sensor noise. Images, on the other hand, have high resolution and rich texture.…
▽ More
3D object detection has seen quick progress thanks to advances in deep learning on point clouds. A few recent works have even shown state-of-the-art performance with just point clouds input (e.g. VoteNet). However, point cloud data have inherent limitations. They are sparse, lack color information and often suffer from sensor noise. Images, on the other hand, have high resolution and rich texture. Thus they can complement the 3D geometry provided by point clouds. Yet how to effectively use image information to assist point cloud based detection is still an open question. In this work, we build on top of VoteNet and propose a 3D detection architecture called ImVoteNet specialized for RGB-D scenes. ImVoteNet is based on fusing 2D votes in images and 3D votes in point clouds. Compared to prior work on multi-modal detection, we explicitly extract both geometric and semantic features from the 2D images. We leverage camera parameters to lift these features to 3D. To improve the synergy of 2D-3D feature fusion, we also propose a multi-tower training scheme. We validate our model on the challenging SUN RGB-D dataset, advancing state-of-the-art results by 5.7 mAP. We also provide rich ablation studies to analyze the contribution of each design choice.
△ Less
Submitted 29 January, 2020;
originally announced January 2020.
-
Generalized Flexible Hybrid Cable-Driven Robot (HCDR): Modeling, Control, and Analysis
Authors:
Ronghuai Qi,
Amir Khajepour,
William W. Melek
Abstract:
This paper presents a generalized flexible Hybrid Cable-Driven Robot (HCDR). For the proposed HCDR, the derivation of the equations of motion and proof provide a very effective way to find items for generalized system modeling. The proposed dynamic modeling approach avoids the drawback of traditional methods and can be easily extended to other types of hybrid robots, such as a robot arm mounted on…
▽ More
This paper presents a generalized flexible Hybrid Cable-Driven Robot (HCDR). For the proposed HCDR, the derivation of the equations of motion and proof provide a very effective way to find items for generalized system modeling. The proposed dynamic modeling approach avoids the drawback of traditional methods and can be easily extended to other types of hybrid robots, such as a robot arm mounted on an aircraft platform.
Additionally, another goal of this paper is to develop integrated control systems to reduce vibrations and improve the accuracy and performance of the HCDR. To achieve this goal, redundancy resolution, stiffness optimization, and control strategies are studied. The proposed optimization problem and algorithm address the limitations of existing stiffness optimization approaches. Three types of control architecture are proposed, and their performances (i.e., reducing undesirable vibrations and trajectory tracking errors, especially for the end-effector) are evaluated using several well-designed case studies. Results show that the fully integrated control strategy can improve the tracking performance of the end-effector significantly.
△ Less
Submitted 3 April, 2020; v1 submitted 14 November, 2019;
originally announced November 2019.
-
Deep Hough Voting for 3D Object Detection in Point Clouds
Authors:
Charles R. Qi,
Or Litany,
Kaiming He,
Leonidas J. Guibas
Abstract:
Current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids (i.e., to voxel grids or to bird's eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect objects in point clouds. In this work, we return to first principles…
▽ More
Current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids (i.e., to voxel grids or to bird's eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect objects in point clouds. In this work, we return to first principles to construct a 3D detection pipeline for point cloud data and as generic as possible. However, due to the sparse nature of the data -- samples from 2D manifolds in 3D space -- we face a major challenge when directly predicting bounding box parameters from scene points: a 3D object centroid can be far from any surface point thus hard to regress accurately in one step. To address the challenge, we propose VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting. Our model achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency. Remarkably, VoteNet outperforms previous methods by using purely geometric information without relying on color images.
△ Less
Submitted 22 August, 2019; v1 submitted 21 April, 2019;
originally announced April 2019.
-
KPConv: Flexible and Deformable Convolution for Point Clouds
Authors:
Hugues Thomas,
Charles R. Qi,
Jean-Emmanuel Deschaud,
Beatriz Marcotegui,
François Goulette,
Leonidas J. Guibas
Abstract:
We present Kernel Point Convolution (KPConv), a new design of point convolution, i.e. that operates on point clouds without any intermediate representation. The convolution weights of KPConv are located in Euclidean space by kernel points, and applied to the input points close to them. Its capacity to use any number of kernel points gives KPConv more flexibility than fixed grid convolutions. Furth…
▽ More
We present Kernel Point Convolution (KPConv), a new design of point convolution, i.e. that operates on point clouds without any intermediate representation. The convolution weights of KPConv are located in Euclidean space by kernel points, and applied to the input points close to them. Its capacity to use any number of kernel points gives KPConv more flexibility than fixed grid convolutions. Furthermore, these locations are continuous in space and can be learned by the network. Therefore, KPConv can be extended to deformable convolutions that learn to adapt kernel points to local geometry. Thanks to a regular subsampling strategy, KPConv is also efficient and robust to varying densities. Whether they use deformable KPConv for complex tasks, or rigid KPconv for simpler tasks, our networks outperform state-of-the-art classification and segmentation approaches on several datasets. We also offer ablation studies and visualizations to provide understanding of what has been learned by KPConv and to validate the descriptive power of deformable KPConv.
△ Less
Submitted 19 August, 2019; v1 submitted 18 April, 2019;
originally announced April 2019.
-
Generating 3D Adversarial Point Clouds
Authors:
Chong Xiang,
Charles R. Qi,
Bo Li
Abstract:
Deep neural networks are known to be vulnerable to adversarial examples which are carefully crafted instances to cause the models to make wrong predictions. While adversarial examples for 2D images and CNNs have been extensively studied, less attention has been paid to 3D data such as point clouds. Given many safety-critical 3D applications such as autonomous driving, it is important to study how…
▽ More
Deep neural networks are known to be vulnerable to adversarial examples which are carefully crafted instances to cause the models to make wrong predictions. While adversarial examples for 2D images and CNNs have been extensively studied, less attention has been paid to 3D data such as point clouds. Given many safety-critical 3D applications such as autonomous driving, it is important to study how adversarial point clouds could affect current deep 3D models. In this work, we propose several novel algorithms to craft adversarial point clouds against PointNet, a widely used deep neural network for point cloud processing. Our algorithms work in two ways: adversarial point perturbation and adversarial point generation. For point perturbation, we shift existing points negligibly. For point generation, we generate either a set of independent and scattered points or a small number (1-3) of point clusters with meaningful shapes such as balls and airplanes which could be hidden in the human psyche. In addition, we formulate six perturbation measurement metrics tailored to the attacks in point clouds and conduct extensive experiments to evaluate the proposed algorithms on the ModelNet40 3D shape classification dataset. Overall, our attack algorithms achieve a success rate higher than 99% for all targeted attacks
△ Less
Submitted 12 July, 2019; v1 submitted 19 September, 2018;
originally announced September 2018.
-
FlowNet3D: Learning Scene Flow in 3D Point Clouds
Authors:
Xingyu Liu,
Charles R. Qi,
Leonidas J. Guibas
Abstract:
Many applications in robotics and human-computer interaction can benefit from understanding 3D motion of points in a dynamic environment, widely noted as scene flow. While most previous methods focus on stereo and RGB-D images as input, few try to estimate scene flow directly from point clouds. In this work, we propose a novel deep neural network named $FlowNet3D$ that learns scene flow from point…
▽ More
Many applications in robotics and human-computer interaction can benefit from understanding 3D motion of points in a dynamic environment, widely noted as scene flow. While most previous methods focus on stereo and RGB-D images as input, few try to estimate scene flow directly from point clouds. In this work, we propose a novel deep neural network named $FlowNet3D$ that learns scene flow from point clouds in an end-to-end fashion. Our network simultaneously learns deep hierarchical features of point clouds and flow embeddings that represent point motions, supported by two newly proposed learning layers for point sets. We evaluate the network on both challenging synthetic data from FlyingThings3D and real Lidar scans from KITTI. Trained on synthetic data only, our network successfully generalizes to real scans, outperforming various baselines and showing competitive results to the prior art. We also demonstrate two applications of our scene flow output (scan registration and motion segmentation) to show its potential wide use cases.
△ Less
Submitted 21 July, 2019; v1 submitted 4 June, 2018;
originally announced June 2018.
-
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
Authors:
Zhihao Jia,
Sina Lin,
Charles R. Qi,
Alex Aiken
Abstract:
The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks. Current approaches parallelize training onto multiple devices by applying a single parallelization strategy (e.g., data or model parallelism) to all layers in a network. Although easy to reason about, these approaches result in suboptimal runtime performance in large-scale di…
▽ More
The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks. Current approaches parallelize training onto multiple devices by applying a single parallelization strategy (e.g., data or model parallelism) to all layers in a network. Although easy to reason about, these approaches result in suboptimal runtime performance in large-scale distributed training, since different layers in a network may prefer different parallelization strategies. In this paper, we propose layer-wise parallelism that allows each layer in a network to use an individual parallelization strategy. We jointly optimize how each layer is parallelized by solving a graph search problem. Our evaluation shows that layer-wise parallelism outperforms state-of-the-art approaches by increasing training throughput, reducing communication costs, achieving better scalability to multiple GPUs, while maintaining original network accuracy.
△ Less
Submitted 9 June, 2018; v1 submitted 13 February, 2018;
originally announced February 2018.
-
Frustum PointNets for 3D Object Detection from RGB-D Data
Authors:
Charles R. Qi,
Wei Liu,
Chenxia Wu,
Hao Su,
Leonidas J. Guibas
Abstract:
In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (re…
▽ More
In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability.
△ Less
Submitted 12 April, 2018; v1 submitted 22 November, 2017;
originally announced November 2017.
-
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
Authors:
Charles R. Qi,
Li Yi,
Hao Su,
Leonidas J. Guibas
Abstract:
Few prior works study deep learning on point sets. PointNet by Qi et al. is a pioneer in this direction. However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. In this work, we introduce a hierarchical neural network that applies PointNet recursively on…
▽ More
Few prior works study deep learning on point sets. PointNet by Qi et al. is a pioneer in this direction. However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. In this work, we introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales. With further observation that point sets are usually sampled with varying densities, which results in greatly decreased performance for networks trained on uniform densities, we propose novel set learning layers to adaptively combine features from multiple scales. Experiments show that our network called PointNet++ is able to learn deep point set features efficiently and robustly. In particular, results significantly better than state-of-the-art have been obtained on challenging benchmarks of 3D point clouds.
△ Less
Submitted 7 June, 2017;
originally announced June 2017.
-
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
Authors:
Charles R. Qi,
Hao Su,
Kaichun Mo,
Leonidas J. Guibas
Abstract:
Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds and well respects the permutation invariance of points i…
▽ More
Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds and well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.
△ Less
Submitted 10 April, 2017; v1 submitted 2 December, 2016;
originally announced December 2016.
-
Shape Completion using 3D-Encoder-Predictor CNNs and Shape Synthesis
Authors:
Angela Dai,
Charles Ruizhongtai Qi,
Matthias Nießner
Abstract:
We introduce a data-driven approach to complete partial 3D shapes through a combination of volumetric deep neural networks and 3D shape synthesis. From a partially-scanned input shape, our method first infers a low-resolution -- but complete -- output. To this end, we introduce a 3D-Encoder-Predictor Network (3D-EPN) which is composed of 3D convolutional layers. The network is trained to predict a…
▽ More
We introduce a data-driven approach to complete partial 3D shapes through a combination of volumetric deep neural networks and 3D shape synthesis. From a partially-scanned input shape, our method first infers a low-resolution -- but complete -- output. To this end, we introduce a 3D-Encoder-Predictor Network (3D-EPN) which is composed of 3D convolutional layers. The network is trained to predict and fill in missing data, and operates on an implicit surface representation that encodes both known and unknown space. This allows us to predict global structure in unknown areas at high accuracy. We then correlate these intermediary results with 3D geometry from a shape database at test time. In a final pass, we propose a patch-based 3D shape synthesis method that imposes the 3D geometry from these retrieved shapes as constraints on the coarsely-completed mesh. This synthesis process enables us to reconstruct fine-scale detail and generate high-resolution output while respecting the global mesh structure obtained by the 3D-EPN. Although our 3D-EPN outperforms state-of-the-art completion method, the main contribution in our work lies in the combination of a data-driven shape predictor and analytic 3D shape synthesis. In our results, we show extensive evaluations on a newly-introduced shape completion benchmark for both real-world and synthetic data.
△ Less
Submitted 11 April, 2017; v1 submitted 30 November, 2016;
originally announced December 2016.
-
FPNN: Field Probing Neural Networks for 3D Data
Authors:
Yangyan Li,
Soeren Pirk,
Hao Su,
Charles R. Qi,
Leonidas J. Guibas
Abstract:
Building discriminative representations for 3D data has been an important task in computer graphics and computer vision research. Convolutional Neural Networks (CNNs) have shown to operate on 2D images with great success for a variety of tasks. Lifting convolution operators to 3D (3DCNNs) seems like a plausible and promising next step. Unfortunately, the computational complexity of 3D CNNs grows c…
▽ More
Building discriminative representations for 3D data has been an important task in computer graphics and computer vision research. Convolutional Neural Networks (CNNs) have shown to operate on 2D images with great success for a variety of tasks. Lifting convolution operators to 3D (3DCNNs) seems like a plausible and promising next step. Unfortunately, the computational complexity of 3D CNNs grows cubically with respect to voxel resolution. Moreover, since most 3D geometry representations are boundary based, occupied regions do not increase proportionately with the size of the discretization, resulting in wasted computation. In this work, we represent 3D spaces as volumetric fields, and propose a novel design that employs field probing filters to efficiently extract features from them. Each field probing filter is a set of probing points --- sensors that perceive the space. Our learning algorithm optimizes not only the weights associated with the probing points, but also their locations, which deforms the shape of the probing filters and adaptively distributes them in 3D space. The optimized probing points sense the 3D space "intelligently", rather than operating blindly over the entire domain. We show that field probing is significantly more efficient than 3DCNNs, while providing state-of-the-art performance, on classification tasks for 3D object recognition benchmark datasets.
△ Less
Submitted 24 October, 2016; v1 submitted 20 May, 2016;
originally announced May 2016.
-
Volumetric and Multi-View CNNs for Object Classification on 3D Data
Authors:
Charles R. Qi,
Hao Su,
Matthias Niessner,
Angela Dai,
Mengyuan Yan,
Leonidas J. Guibas
Abstract:
3D shape models are becoming widely available and easier to capture, making available 3D information crucial for progress in object classification. Current state-of-the-art methods rely on CNNs to address this problem. Recently, we witness two types of CNNs being developed: CNNs based upon volumetric representations versus CNNs based upon multi-view representations. Empirical results from these tw…
▽ More
3D shape models are becoming widely available and easier to capture, making available 3D information crucial for progress in object classification. Current state-of-the-art methods rely on CNNs to address this problem. Recently, we witness two types of CNNs being developed: CNNs based upon volumetric representations versus CNNs based upon multi-view representations. Empirical results from these two types of CNNs exhibit a large gap, indicating that existing volumetric CNN architectures and approaches are unable to fully exploit the power of 3D representations. In this paper, we aim to improve both volumetric CNNs and multi-view CNNs according to extensive analysis of existing approaches. To this end, we introduce two distinct network architectures of volumetric CNNs. In addition, we examine multi-view CNNs, where we introduce multi-resolution filtering in 3D. Overall, we are able to outperform current state-of-the-art methods for both volumetric CNNs and multi-view CNNs. We provide extensive experiments designed to evaluate underlying design choices, thus providing a better understanding of the space of methods available for object classification on 3D data.
△ Less
Submitted 29 April, 2016; v1 submitted 12 April, 2016;
originally announced April 2016.
-
Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views
Authors:
Hao Su,
Charles R. Qi,
Yangyan Li,
Leonidas Guibas
Abstract:
Object viewpoint estimation from 2D images is an essential task in computer vision. However, two issues hinder its progress: scarcity of training data with viewpoint annotations, and a lack of powerful features. Inspired by the growing availability of 3D models, we propose a framework to address both issues by combining render-based image synthesis and CNNs. We believe that 3D models have the pote…
▽ More
Object viewpoint estimation from 2D images is an essential task in computer vision. However, two issues hinder its progress: scarcity of training data with viewpoint annotations, and a lack of powerful features. Inspired by the growing availability of 3D models, we propose a framework to address both issues by combining render-based image synthesis and CNNs. We believe that 3D models have the potential in generating a large number of images of high variation, which can be well exploited by deep CNN with a high learning capacity. Towards this goal, we propose a scalable and overfit-resistant image synthesis pipeline, together with a novel CNN specifically tailored for the viewpoint estimation task. Experimentally, we show that the viewpoint estimation from our pipeline can significantly outperform state-of-the-art methods on PASCAL 3D+ benchmark.
△ Less
Submitted 21 May, 2015;
originally announced May 2015.
-
High Level Path Planning with Uncertainty
Authors:
Runping Qi,
David L. Poole
Abstract:
For high level path planning, environments are usually modeled as distance graphs, and path planning problems are reduced to computing the shortest path in distance graphs. One major drawback of this modeling is the inability to model uncertainties, which are often encountered in practice. In this paper, a new tool, called U-yraph, is proposed for environment modeling. A U-graph is an extension…
▽ More
For high level path planning, environments are usually modeled as distance graphs, and path planning problems are reduced to computing the shortest path in distance graphs. One major drawback of this modeling is the inability to model uncertainties, which are often encountered in practice. In this paper, a new tool, called U-yraph, is proposed for environment modeling. A U-graph is an extension of distance graphs with the ability to handle a kind of uncertainty. By modeling an uncertain environment as a U-graph, and a navigation problem as a Markovian decision process, we can precisely define a new optimality criterion for navigation plans, and more importantly, we can come up with a general algorithm for computing optimal plans for navigation tasks.
△ Less
Submitted 20 March, 2013;
originally announced March 2013.