Skip to main content

Showing 1–50 of 190 results for author: Shi, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.05244  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

    Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

    Abstract: The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specializ… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: Evaluation plan of the SVDD Challenge @ SLT 2024

  2. arXiv:2405.03482  [pdf

    eess.SY

    Managing Renewable Energy Resources Using Equity-Market Risk Tools - the Efficient Frontiers

    Authors: Haim Grebel, Divya Vikas, Jim Shi

    Abstract: The energy market, and specifically the renewable sector carries volatility and risks, similar to the financial market. Here, we leverage on a well-established, return-risk approach, commonly used by equity portfolio-managers and apply it to energy resources. We visualize the relationship between the resources' costs and their risks in terms of efficient frontiers. We apply this analysis to public… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: 9 pages, 3 figures, 10 ref

  3. arXiv:2404.09385  [pdf, other

    eess.AS cs.CL eess.SP

    A Large-Scale Evaluation of Speech Foundation Models

    Authors: Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi Lee

    Abstract: The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work,… ▽ More

    Submitted 29 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

    Comments: The extended journal version for SUPERB and SUPERB-SG. Published in IEEE/ACM TASLP. The Arxiv version is preferred

  4. arXiv:2403.07938  [pdf, other

    cs.SD cs.AI cs.CV cs.LG cs.MM eess.AS

    Text-to-Audio Generation Synchronized with Videos

    Authors: Shentong Mo, Jing Shi, Yapeng Tian

    Abstract: In recent times, the focus on text-to-audio (TTA) generation has intensified, as researchers strive to synthesize audio from textual descriptions. However, most existing methods, though leveraging latent diffusion models to learn the correlation between audio and text embeddings, fall short when it comes to maintaining a seamless synchronization between the produced audio and its video. This often… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: arXiv admin note: text overlap with arXiv:2305.12903

  5. arXiv:2402.10505  [pdf, other

    eess.SY math.OC

    A Survey of Resilient Coordination for Cyber-Physical Systems Against Malicious Attacks

    Authors: Zirui Liao, Jian Shi, Yuwei Zhang, Shaoping Wang, Zhiyong Sun

    Abstract: Cyber-physical systems (CPSs) facilitate the integration of physical entities and cyber infrastructures through the utilization of pervasive computational resources and communication units, leading to improved efficiency, automation, and practical viability in both academia and industry. Due to its openness and distributed characteristics, a critical issue prevalent in CPSs is to guarantee resilie… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

    Comments: 35 pages, 7 figures, 5 tables

  6. arXiv:2402.02724  [pdf, other

    eess.IV cs.CV cs.LG

    FDNet: Frequency Domain Denoising Network For Cell Segmentation in Astrocytes Derived From Induced Pluripotent Stem Cells

    Authors: Haoran Li, Jiahua Shi, Huaming Chen, Bo Du, Simon Maksour, Gabrielle Phillips, Mirella Dottori, Jun Shen

    Abstract: Artificially generated induced pluripotent stem cells (iPSCs) from somatic cells play an important role for disease modeling and drug screening of neurodegenerative diseases. Astrocytes differentiated from iPSCs are important targets to investigate neuronal metabolism. The astrocyte differentiation progress can be monitored through the variations of morphology observed from microscopy images at di… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

    Comments: Accepted by The IEEE International Symposium on Biomedical Imaging (ISBI) 2024

  7. arXiv:2401.17619  [pdf, other

    cs.SD eess.AS

    Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2

    Authors: Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, Shinji Watanabe

    Abstract: In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability, a constraint less common in text-to-speech (TTS). This study proposes a new approach to address this data scarcity. We utilize an existing singing voice synthesizer for data augmentation and apply precise manual tuning to reduce unnatural voice synthesis. Our developme… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

  8. arXiv:2401.17230  [pdf, other

    cs.SD cs.AI eess.AS

    ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

    Authors: Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe

    Abstract: This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also… ▽ More

    Submitted 30 January, 2024; originally announced January 2024.

    Comments: 5 pages, 3 figures, 7 tables

  9. arXiv:2401.16658  [pdf, ps, other

    cs.CL eess.AS

    OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

    Authors: Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe

    Abstract: Recent studies have advocated for fully open foundation models to promote transparency and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) reproduced OpenAI's Whisper using publicly available data and open-source toolkits. With the aim of reproducing Whisper, the previous OWSM v1 through v3 models were still based on Transformer, which might lead to inferior performanc… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: Project webpage: https://www.wavlab.org/activities/2024/owsm/

  10. arXiv:2312.16998  [pdf, other

    eess.IV cs.CV

    Deep Unfolding Network with Spatial Alignment for multi-modal MRI reconstruction

    Authors: Hao Zhang, Qi Wang, Jun Shi, Shihui Ying, Zhijie Wen

    Abstract: Multi-modal Magnetic Resonance Imaging (MRI) offers complementary diagnostic information, but some modalities are limited by the long scanning time. To accelerate the whole acquisition process, MRI reconstruction of one modality from highly undersampled k-space data with another fully-sampled reference modality is an efficient solution. However, the misalignment between modalities, which is common… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

  11. arXiv:2312.15424  [pdf, other

    eess.SY

    Integrating Renewable Energy Sources as Reserve Providers: Modeling, Pricing, and Properties

    Authors: Wenli Wu, Ye Guo, Jiantao Shi

    Abstract: In pursuit of carbon neutrality, many countries have adopted renewable portfolio standards to facilitate the integration of renewable energy. However, increasing penetration of renewable energy resources will also pose higher requirements on system flexibility. Allowing renewable themselves to participate in the reserve market could be a viable solution. To this end, this paper proposes an optimal… ▽ More

    Submitted 24 December, 2023; originally announced December 2023.

    Comments: 13 pages, 5 figures

  12. arXiv:2312.06995  [pdf, other

    cs.CV eess.IV

    Transformer-based No-Reference Image Quality Assessment via Supervised Contrastive Learning

    Authors: Jinsong Shi, Pan Gao, Jie Qin

    Abstract: Image Quality Assessment (IQA) has long been a research hotspot in the field of image processing, especially No-Reference Image Quality Assessment (NR-IQA). Due to the powerful feature extraction ability, existing Convolution Neural Network (CNN) and Transformers based NR-IQA methods have achieved considerable progress. However, they still exhibit limited capability when facing unknown authentic d… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: Accepted by AAAI24

  13. arXiv:2312.06668  [pdf

    cs.CL cs.SD eess.AS

    Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus

    Authors: Yi-Hui Chou, Kalvin Chang, Meng-Ju Wu, Winston Ou, Alice Wen-Hsin Bi, Carol Yang, Bryan Y. Chen, Rong-Wei Pai, Po-Yen Yeh, Jo-Peng Chiang, Iu-Tshian Phoann, Winnie Chang, Chenxuan Cui, Noel Chen, Jiatong Shi

    Abstract: Taiwanese Hokkien is declining in use and status due to a language shift towards Mandarin in Taiwan. This is partly why it is a low resource language in NLP and speech research today. To ensure that the state of the art in speech processing does not leave Taiwanese Hokkien behind, we contribute a 1.5-hour dataset of Taiwanese Hokkien to ML-SUPERB's hidden set. Evaluating ML-SUPERB's suite of self-… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

    Comments: Accepted to ASRU 2023

  14. arXiv:2312.06466  [pdf, other

    cs.SD eess.AS

    Towards Domain-Specific Cross-Corpus Speech Emotion Recognition Approach

    Authors: Yan Zhao, Yuan Zong, Hailun Lian, Cheng Lu, Jingang Shi, Wenming Zheng

    Abstract: Cross-corpus speech emotion recognition (SER) poses a challenge due to feature distribution mismatch, potentially degrading the performance of established SER methods. In this paper, we tackle this challenge by proposing a novel transfer subspace learning method called acoustic knowledgeguided transfer linear regression (AKTLR). Unlike existing approaches, which often overlook domain-specific know… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

  15. arXiv:2312.03376  [pdf, other

    eess.SY

    Beacon-enabled TDMA Ultraviolet Communication Network System Design and Realization

    Authors: Yuchen Pan, Fei Long, Ping Li, Haotian Shi, Jiazhao Shi, Hanlin Xiao, Chen Gong, Zhengyuan Xu

    Abstract: Nonline of sight (NLOS) ultraviolet (UV) scattering communication can serve as a good candidate for outdoor optical wireless communication (OWC) in the cases of non-perfect transmitter-receiver alignment and radio silence. We design and demonstrate a NLOS UV scattering communication network system in this paper, where a beacon-enabled time division multiple access (TDMA) scheme is adopted. In our… ▽ More

    Submitted 15 April, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

  16. arXiv:2310.13208  [pdf

    eess.SY

    Online energy management system for a fuel cell/battery hybrid system with multiple fuel cell stacks

    Authors: Junzhe Shi, Ulf Jakob Flø Aarsnes, Dagfinn Nærheim, Scott Moura

    Abstract: In recent years, fuel cell/battery hybrid systems have attracted substantial attention due to their high energy density and low emissions. The online energy management system (EMS) is essential for these hybrid systems, tasked with controlling the energy flow and ensuring optimal system performance, encompassing fuel efficiency and mitigating fuel cell and battery degradation. This research propos… ▽ More

    Submitted 19 October, 2023; originally announced October 2023.

  17. arXiv:2310.05513  [pdf, other

    cs.SD cs.CL eess.AS

    Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond

    Authors: Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-Ping Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe

    Abstract: The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification. The challenge comprises a research track focused on applying ML-SUPERB to specific multilingual subjects, a Challenge Track for model submissions, and a New Language Track w… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: Accepted by ASRU

  18. arXiv:2310.05369  [pdf, other

    cs.SD eess.AS

    AdvSV: An Over-the-Air Adversarial Attack Dataset for Speaker Verification

    Authors: Li Wang, Jiaqi Li, Yuhao Luo, Jiahao Zheng, Lei Wang, Hao Li, Ke Xu, Chengfang Fang, Jie Shi, Zhizheng Wu

    Abstract: It is known that deep neural networks are vulnerable to adversarial attacks. Although Automatic Speaker Verification (ASV) built on top of deep neural networks exhibits robust performance in controlled scenarios, many studies confirm that ASV is vulnerable to adversarial attacks. The lack of a standard dataset is a bottleneck for further research, especially reproducible research. In this study, w… ▽ More

    Submitted 16 January, 2024; v1 submitted 8 October, 2023; originally announced October 2023.

    Comments: Accepted by ICASSP2024

  19. arXiv:2310.03938  [pdf, other

    cs.SD eess.AS

    EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Multilingual and Low Resource Scenarios

    Authors: Tejes Srivastava, Jiatong Shi, William Chen, Shinji Watanabe

    Abstract: Self-Supervised Learning (SSL) models have demonstrated exceptional performance in various speech tasks, particularly in low-resource and multilingual domains. Recent works show that fusing SSL models could achieve superior performance compared to using one SSL model. However, fusion models have increased model parameter size, leading to longer inference times. In this paper, we propose a novel ap… ▽ More

    Submitted 5 October, 2023; originally announced October 2023.

    Comments: 7 pages, 2 figures, 7 tables

  20. arXiv:2310.02720  [pdf, other

    cs.SD eess.AS

    Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction

    Authors: Jiatong Shi, Hirofumi Inaguma, Xutai Ma, Ilia Kulikov, Anna Sun

    Abstract: Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech signals. In contrast, this paper aims to incorporate multi-resolution information into speech self-supervised representation learning. We introduce a SSL model that l… ▽ More

    Submitted 30 January, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted at ICLR2024 as spotlight

  21. arXiv:2310.00704  [pdf, other

    cs.SD eess.AS

    UniAudio: An Audio Foundation Model Toward Universal Audio Generation

    Authors: Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, Shinji Watanabe, Helen Meng

    Abstract: Large Language models (LLM) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LLM techniques to generate multiple types of audio (including speech, sounds, music, and singing) with given input conditions. UniAudio 1) first tokenizes all types of target audio along with other con… ▽ More

    Submitted 11 December, 2023; v1 submitted 1 October, 2023; originally announced October 2023.

  22. arXiv:2309.15800  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

    Authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang

    Abstract: Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning repre… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Submitted to IEEE ICASSP 2024

  23. arXiv:2309.15317  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

    Authors: William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe

    Abstract: Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already limited to few research groups due to its resource usage. We show that more powerful techniques can actually lead to more efficient pre-training, opening SSL to more… ▽ More

    Submitted 27 September, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted to ASRU 2023

  24. arXiv:2309.13876  [pdf, other

    cs.CL cs.SD eess.AS

    Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

    Authors: Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe

    Abstract: Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessib… ▽ More

    Submitted 24 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at ASRU 2023

  25. arXiv:2309.13755  [pdf, other

    eess.SY

    Efficient Recursive Data-enabled Predictive Control (Extended Version)

    Authors: Jicheng Shi, Yingzhao Lian, Colin N. Jones

    Abstract: In the field of model predictive control, Data-enabled Predictive Control (DeePC) offers direct predictive control, bypassing traditional modeling. However, challenges emerge with increased computational demand due to recursive data updates. This paper introduces a novel recursive updating algorithm for DeePC. It emphasizes the use of Singular Value Decomposition (SVD) for efficient low-dimensiona… ▽ More

    Submitted 24 March, 2024; v1 submitted 24 September, 2023; originally announced September 2023.

  26. arXiv:2309.09776  [pdf, other

    eess.IV

    MAD: Meta Adversarial Defense Benchmark

    Authors: X. Peng, D. Zhou, G. Sun, J. Shi, L. Wu

    Abstract: Adversarial training (AT) is a prominent technique employed by deep learning models to defend against adversarial attacks, and to some extent, enhance model robustness. However, there are three main drawbacks of the existing AT-based defense methods: expensive computational cost, low generalization ability, and the dilemma between the original model and the defense model. To this end, we propose a… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: 12 pages, 11 figures,IEEE Transactions on Neural Networks and Learning Systems

  27. arXiv:2309.09510  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

    Authors: Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-yi Lee

    Abstract: Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for bui… ▽ More

    Submitted 22 March, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: To appear in the proceedings of ICASSP 2024

  28. arXiv:2309.00494  [pdf, other

    eess.IV cs.CV cs.LG

    Multi-stage Deep Learning Artifact Reduction for Computed Tomography

    Authors: Jiayang Shi, Daniel M. Pelt, K. Joost Batenburg

    Abstract: In Computed Tomography (CT), an image of the interior structure of an object is computed from a set of acquired projection images. The quality of these reconstructed images is essential for accurate analysis, but this quality can be degraded by a variety of imaging artifacts. To improve reconstruction quality, the acquired projection images are often processed by a pipeline consisting of multiple… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

  29. arXiv:2308.04112  [pdf, other

    eess.SY

    Multi-Interval Rolling-Window Joint Dispatch and Pricing of Energy and Reserve under Uncertainty

    Authors: Jiantao Shi, Ye Guo, Wenchuan Wu, Hongbin Sun

    Abstract: In this paper, the intra-day multi-interval rolling-window joint dispatch and pricing of energy and reserve is studied under increasing volatile and uncertain renewable generations. A look-ahead energy-reserve co-optimization model is proposed for the rolling-window dispatch, where possible contingencies and load/renewable forecast errors over the look-ahead window are modeled as several scenario… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

  30. arXiv:2308.02867  [pdf, other

    cs.SD eess.AS

    A Systematic Exploration of Joint-training for Singing Voice Synthesis

    Authors: Yuning Wu, Yifeng Yu, Jiatong Shi, Tao Qian, Qin Jin

    Abstract: There has been a growing interest in using end-to-end acoustic models for singing voice synthesis (SVS). Typically, these models require an additional vocoder to transform the generated acoustic features into the final waveform. However, since the acoustic model and the vocoder are not jointly optimized, a gap can exist between the two models, leading to suboptimal performance. Although a similar… ▽ More

    Submitted 5 August, 2023; originally announced August 2023.

  31. arXiv:2307.08866  [pdf, other

    eess.SY

    Adaptive Data-Driven Prediction in a Building Control Hierarchy: A Case Study of Demand Response in Switzerland

    Authors: Jicheng Shi, Yingzhao Lian, Christophe Salzmann, Colin N. Jones

    Abstract: By providing various services, such as Demand Response (DR), buildings can play a crucial role in the energy market due to their significant energy consumption. However, effectively commissioning buildings for such desired functionalities requires significant expert knowledge and design effort, considering the variations in building dynamics and intended use. In this study, we introduce an adaptiv… ▽ More

    Submitted 29 December, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

  32. arXiv:2307.06143  [pdf, other

    cs.CV eess.IV

    Learning Kernel-Modulated Neural Representation for Efficient Light Field Compression

    Authors: Jinglei Shi, Yihong Xu, Christine Guillemot

    Abstract: Light field is a type of image data that captures the 3D scene information by recording light rays emitted from a scene at various orientations. It offers a more immersive perception than classic 2D images but at the cost of huge data volume. In this paper, we draw inspiration from the visual characteristics of Sub-Aperture Images (SAIs) of light field and design a compact neural network represent… ▽ More

    Submitted 12 July, 2023; originally announced July 2023.

  33. arXiv:2307.01486  [pdf, other

    eess.IV cs.CV

    H-DenseFormer: An Efficient Hybrid Densely Connected Transformer for Multimodal Tumor Segmentation

    Authors: Jun Shi, Hongyu Kan, Shulan Ruan, Ziqi Zhu, Minfan Zhao, Liang Qiao, Zhaohui Wang, Hong An, Xudong Xue

    Abstract: Recently, deep learning methods have been widely used for tumor segmentation of multimodal medical images with promising results. However, most existing methods are limited by insufficient representational ability, specific modality number and high computational complexity. In this paper, we propose a hybrid densely connected network for tumor segmentation, named H-DenseFormer, which combines the… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

    Comments: 11 pages, 2 figures. This paper has been accepted by Medical Image Computing and Computer-Assisted Intervention(MICCAI) 2023

  34. arXiv:2306.14646  [pdf, other

    eess.IV cs.CV

    Multi-View Attention Learning for Residual Disease Prediction of Ovarian Cancer

    Authors: Xiangneng Gao, Shulan Ruan, Jun Shi, Guoqing Hu, Wei Wei

    Abstract: In the treatment of ovarian cancer, precise residual disease prediction is significant for clinical and surgical decision-making. However, traditional methods are either invasive (e.g., laparoscopy) or time-consuming (e.g., manual analysis). Recently, deep learning methods make many efforts in automatic analysis of medical images. Despite the remarkable progress, most of them underestimated the im… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

  35. arXiv:2306.14422  [pdf, other

    cs.SD cs.CL eess.AS

    The Singing Voice Conversion Challenge 2023

    Authors: Wen-Chin Huang, Lester Phillip Violeta, Songxiang Liu, Jiatong Shi, Tomoki Toda

    Abstract: We present the latest iteration of the voice conversion challenge (VCC) series, a bi-annual scientific event aiming to compare and understand different voice conversion (VC) systems based on a common dataset. This year we shifted our focus to singing voice conversion (SVC), thus named the challenge the Singing Voice Conversion Challenge (SVCC). A new database was constructed for two tasks, namely… ▽ More

    Submitted 6 July, 2023; v1 submitted 26 June, 2023; originally announced June 2023.

  36. arXiv:2306.09650  [pdf, other

    cs.IT eess.SP

    Reconfigurable Intelligent Surface Assisted Semantic Communication Systems

    Authors: Jiajia Shi, Tse-Tin Chan, Haoyuan Pan, Tat-Ming Lok

    Abstract: Semantic communication, which focuses on conveying the meaning of information rather than exact bit reconstruction, has gained considerable attention in recent years. Meanwhile, reconfigurable intelligent surface (RIS) is a promising technology that can achieve high spectral and energy efficiency by dynamically reflecting incident signals through programmable passive components. In this paper, we… ▽ More

    Submitted 29 June, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

  37. arXiv:2306.06982  [pdf

    eess.IV cs.CV cs.LG

    Weakly Supervised Lesion Detection and Diagnosis for Breast Cancers with Partially Annotated Ultrasound Images

    Authors: Jian Wang, Liang Qiao, Shichong Zhou, Jin Zhou, Jun Wang, Juncheng Li, Shihui Ying, Cai Chang, Jun Shi

    Abstract: Deep learning (DL) has proven highly effective for ultrasound-based computer-aided diagnosis (CAD) of breast cancers. In an automaticCAD system, lesion detection is critical for the following diagnosis. However, existing DL-based methods generally require voluminous manually-annotated region of interest (ROI) labels and class labels to train both the lesion detection and diagnosis models. In clini… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

  38. arXiv:2306.01084  [pdf, other

    cs.SD eess.AS

    Exploration on HuBERT with Multiple Resolutions

    Authors: Jiatong Shi, Yun Tang, Hirofumi Inaguma, Hongyu GOng, Juan Pino, Shinji Watanabe

    Abstract: Hidden-unit BERT (HuBERT) is a widely-used self-supervised learning (SSL) model in speech processing. However, we argue that its fixed 20ms resolution for hidden representations would not be optimal for various speech-processing tasks since their attributes (e.g., speaker characteristics and semantics) are based on different time scales. To address this limitation, we propose utilizing HuBERT repr… ▽ More

    Submitted 22 June, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech2023

  39. arXiv:2305.19972  [pdf, other

    eess.AS cs.AI cs.CL

    VILAS: Exploring the Effects of Vision and Language Context in Automatic Speech Recognition

    Authors: Ziyi Ni, Minglun Han, Feilong Chen, Linghui Meng, Jing Shi, Pin Lv, Bo Xu

    Abstract: Enhancing automatic speech recognition (ASR) performance by leveraging additional multimodal information has shown promising results in previous studies. However, most of these works have primarily focused on utilizing visual cues derived from human lip motions. In fact, context-dependent visual and linguistic cues can also benefit in many scenarios. In this paper, we first propose ViLaS (Vision a… ▽ More

    Submitted 18 December, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: Accepted to ICASSP 2024

  40. arXiv:2305.10615  [pdf, other

    cs.SD cs.CL eess.AS

    ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

    Authors: Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe

    Abstract: Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks. However, SUPERB largely considers English speech in its evaluation. This paper presents multilingual SUPERB (ML-SUPERB), covering 143 languages (ranging from high-resource to endangered), and considering both automatic… ▽ More

    Submitted 11 August, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: Accepted by Interspeech

  41. arXiv:2305.09353  [pdf, other

    cs.CV eess.IV

    Blind Image Quality Assessment via Transformer Predicted Error Map and Perceptual Quality Token

    Authors: Jinsong Shi, Pan Gao, Aljosa Smolic

    Abstract: Image quality assessment is a fundamental problem in the field of image processing, and due to the lack of reference images in most practical scenarios, no-reference image quality assessment (NR-IQA), has gained increasing attention recently. With the development of deep learning technology, many deep neural network-based NR-IQA methods have been developed, which try to learn the image quality bas… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

    Comments: Submitted to TMM

  42. arXiv:2305.07455  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Cascaded Unsupervised Speech Translation with Denoising Back-translation

    Authors: Yu-Kuan Fu, Liang-Hsuan Tseng, Jiatong Shi, Chen-An Li, Tsu-Yuan Hsu, Shinji Watanabe, Hung-yi Lee

    Abstract: Most of the speech translation models heavily rely on parallel data, which is hard to collect especially for low-resource languages. To tackle this issue, we propose to build a cascaded speech translation system without leveraging any kind of paired data. We use fully unpaired data to train our unsupervised systems and evaluate our results on CoVoST 2 and CVSS. The results show that our work is co… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

  43. arXiv:2305.04160  [pdf, other

    cs.CL cs.AI cs.CV eess.AS

    X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

    Authors: Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, Bo Xu

    Abstract: Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4, based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous visual language models. We attribute this to the use of more advanced LLMs compared with previous multimodal models. Unfortunately, the model architecture and training strategies of GPT-4 are unknown. To endow LLMs with multimod… ▽ More

    Submitted 21 May, 2023; v1 submitted 6 May, 2023; originally announced May 2023.

  44. arXiv:2305.02774  [pdf, other

    eess.IV cs.CV physics.med-ph

    Spatial and Modal Optimal Transport for Fast Cross-Modal MRI Reconstruction

    Authors: Qi Wang, Zhijie Wen, Jun Shi, Qian Wang, Dinggang Shen, Shihui Ying

    Abstract: Multi-modal magnetic resonance imaging (MRI) plays a crucial role in comprehensive disease diagnosis in clinical medicine. However, acquiring certain modalities, such as T2-weighted images (T2WIs), is time-consuming and prone to be with motion artifacts. It negatively impacts subsequent multi-modal image analysis. To address this issue, we propose an end-to-end deep learning framework that utilize… ▽ More

    Submitted 21 May, 2024; v1 submitted 4 May, 2023; originally announced May 2023.

  45. arXiv:2304.12995  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

    Authors: Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe

    Abstract: Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

  46. Fast MRI Reconstruction via Edge Attention

    Authors: Hanhui Yang, Juncheng Li, Lok Ming Lui, Shihui Ying, Jun Shi, Tieyong Zeng

    Abstract: Fast and accurate MRI reconstruction is a key concern in modern clinical practice. Recently, numerous Deep-Learning methods have been proposed for MRI reconstruction, however, they usually fail to reconstruct sharp details from the subsampled k-space data. To solve this problem, we propose a lightweight and accurate Edge Attention MRI Reconstruction Network (EAMRI) to reconstruct images with edge… ▽ More

    Submitted 22 April, 2023; originally announced April 2023.

    Comments: 10 figures, 5 tables

  47. arXiv:2304.06322   

    cs.CV eess.IV

    Learning-based Spatial and Angular Information Separation for Light Field Compression

    Authors: Jinglei Shi, Yihong Xu, Christine Guillemot

    Abstract: Light fields are a type of image data that capture both spatial and angular scene information by recording light rays emitted by a scene from different orientations. In this context, spatial information is defined as features that remain static regardless of perspectives, while angular information refers to features that vary between viewpoints. We propose a novel neural network that, by design, c… ▽ More

    Submitted 6 September, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: The authors would like to withdraw this paper, as it has been superseded by arXiv:2307.06143

  48. arXiv:2304.04618  [pdf, other

    cs.SD cs.CL eess.AS

    Enhancing Speech-to-Speech Translation with Multiple TTS Targets

    Authors: Jiatong Shi, Yun Tang, Ann Lee, Hirofumi Inaguma, Changhan Wang, Juan Pino, Shinji Watanabe

    Abstract: It has been known that direct speech-to-speech translation (S2ST) models usually suffer from the data scarcity issue because of the limited existing parallel materials for both source and target speech. Therefore to train a direct S2ST system, previous works usually utilize text-to-speech (TTS) systems to generate samples in the target language by augmenting the data from speech-to-text translatio… ▽ More

    Submitted 10 April, 2023; originally announced April 2023.

  49. arXiv:2304.04596  [pdf, other

    cs.SD cs.CL eess.AS

    ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

    Authors: Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick Fernandes, Dan Berrebbi, Tomoki Hayashi, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Soumi Maiti, Juan Pino, Shinji Watanabe

    Abstract: ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-… ▽ More

    Submitted 6 July, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

    Comments: ACL 2023; System Demonstration

  50. arXiv:2303.14349  [pdf, other

    eess.IV cs.LG

    Causal Image Synthesis of Brain MR in 3D

    Authors: Yujia Li, Jiong Shi, S. Kevin Zhou

    Abstract: Clinical decision making requires counterfactual reasoning based on a factual medical image and thus necessitates causal image synthesis. To this end, we present a novel method for modeling the causality between demographic variables, clinical indices and brain MR images for Alzheimer's Diseases. Specifically, we leverage a structural causal model to depict the causality and a styled generator to… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: 11 pages