Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- surveyMay 2024JUST ACCEPTED
Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and Reliability
Rapid progress in the CMOS technology for the past 25 years has increased the vulnerability of processors towards faults. Subsequently, focus of computer architects shifted towards designing fault-tolerance methods for processor architectures. ...
- research-articleApril 2024
Energy Management for Fault-tolerant (m,k)-constrained Real-time Systems That Use Standby-Sparing
ACM Transactions on Embedded Computing Systems (TECS), Volume 23, Issue 3Article No.: 36, pp 1–36https://doi.org/10.1145/3648365Fault tolerance, energy management, and quality of service (QoS) are essential aspects for the design of real-time embedded systems. In this work, we focus on exploring methods that can simultaneously address the above three critical issues under standby-...
- research-articleApril 2024
Optimal Deployment of Cloud-native Applications with Fault-Tolerance and Time-Critical End-to-End Constraints
UCC '23: Proceedings of the IEEE/ACM 16th International Conference on Utility and Cloud ComputingDecember 2023, Article No.: 14, pp 1–10https://doi.org/10.1145/3603166.3632139Cloud environments are becoming increasingly interesting to host time-critical use cases with far more stringent latency requirements than conventional cloud-native applications, such as smart industrial control systems or cloud-enabled autonomous ...
- research-articleApril 2024
BNN-Flip: Enhancing the Fault Tolerance and Security of Compute-in-Memory Enabled Binary Neural Network Accelerators
ASPDAC '24: Proceedings of the 29th Asia and South Pacific Design Automation ConferenceJanuary 2024, pp 146–152https://doi.org/10.1109/ASP-DAC58780.2024.10473947Compute-in-memory based binary neural networks or CiM-BNNs offer high energy/area efficiency for the design of edge deep neural network (DNN) accelerators, with only a mild accuracy reduction. However, for successful deployment, the design of CiM-BNNs ...
- ArticleMarch 2024
Recovery of Real-Time Clusters with the Division of Computing Resources into the Execution of Functional Queries and the Restoration of Data Generated Since the Last Backup
Distributed Computer and Communication Networks: Control, Computation, CommunicationsSep 2023, pp 236–250https://doi.org/10.1007/978-3-031-50482-2_19AbstractThe possibilities of increasing the readiness of fault-tolerant cluster systems for the timely execution of functional requests are investigated. Duplicated systems containing two computers and two two-input memory nodes are considered as cluster ...
-
- research-articleMarch 2024
Wireless Sensor Networks: Target-Barrier Coverage with Static and Mobile Sensors
ICNCC '23: Proceedings of the 2023 12th International Conference on Networks, Communication and ComputingDecember 2023, pp 1–5https://doi.org/10.1145/3638837.3638838This paper investigates the issue of target-barrier coverage in wireless sensor networks that utilize both static and mobile sensors, resulting in enhanced fault tolerance and reliability. The main objective is to establish a reliable and dependable ...
- research-articleFebruary 2024
Energy-Constrained Scheduling for Weakly Hard Real-Time Systems Using Standby-Sparing
ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 29, Issue 2Article No.: 29, pp 1–35https://doi.org/10.1145/3631587For real-time embedded systems, QoS (Quality of Service), fault tolerance, and energy budget constraint are among the primary design concerns. In this research, we investigate the problem of energy constrained standby-sparing for both periodic and ...
- research-articleJanuary 2024JUST ACCEPTED
Experimentation and Implementation of BFT++ Cyber-attack Resilience Mechanism for Cyber Physical Systems
Cyber-physical systems (CPS) are used in various safety-critical domains such as robotics, industrial manufacturing systems, and power systems. Faults and cyber attacks have been shown to cause safety violations, which can damage the system and endanger ...
- surveyJanuary 2024
Reaching Consensus in the Byzantine Empire: A Comprehensive Review of BFT Consensus Algorithms
- Gengrui Zhang,
- Fei Pan,
- Yunhao Mao,
- Sofia Tijanic,
- Michael Dang’ana,
- Shashank Motepalli,
- Shiquan Zhang,
- Hans-Arno Jacobsen
ACM Computing Surveys (CSUR), Volume 56, Issue 5Article No.: 134, pp 1–41https://doi.org/10.1145/3636553Byzantine fault-tolerant (BFT) consensus algorithms are at the core of providing safety and liveness guarantees for distributed systems that must operate in the presence of arbitrary failures. Recently, numerous new BFT algorithms have been proposed, not ...
- research-articleJanuary 2024JUST ACCEPTED
On Cyber-Physical Fault Resilience in Data Communication: A Case From A LoRaWAN Network Systems Design
Systems offering fault-resilient, energy-efficient, soft real-time data communication have wide applications in Industrial Internet-of-Things (IIoT). While there have been extensive studies for fault resilience in real-time embedded systems, ...
- research-articleDecember 2023
Partial Network Partitioning
- Basil Alkhatib,
- Sreeharsha Udayashankar,
- Sara Qunaibi,
- Ahmed Alquraan,
- Mohammed Alfatafta,
- Wael Al-Manasrah,
- Alex Depoutovitch,
- Samer Al-Kiswany
ACM Transactions on Computer Systems (TOCS), Volume 41, Issue 1-4Article No.: 1, pp 1–34https://doi.org/10.1145/3576192We present an extensive study focused on partial network partitioning. Partial network partitions disrupt the communication between some but not all nodes in a cluster. First, we conduct a comprehensive study of system failures caused by this fault in 13 ...
- research-articleDecember 2023
gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure Codes
ACM Transactions on Architecture and Code Optimization (TACO), Volume 20, Issue 4Article No.: 51, pp 1–25https://doi.org/10.1145/3625005Erasure codes are widely deployed in modern storage systems, leading to frequent usage of their encoding/decoding operations. The encoding/decoding process for erasure codes is generally carried out using the parity-check matrix approach. However, this ...
- research-articleNovember 2023
Mars Attacks!: Software Protection Against Space Radiation
HotNets '23: Proceedings of the 22nd ACM Workshop on Hot Topics in NetworksNovember 2023, pp 245–253https://doi.org/10.1145/3626111.3628199Due to their low cost and the need to run computationally-intensive algorithms locally, satellites and spacecraft are increasingly employing off-the-shelf computing hardware. However, hardware in space is exposed to significantly higher amounts of ...
- research-articleNovember 2023
UNION: Fault-tolerant Cooperative Computing in Opportunistic Mobile Edge Cloud
ACM Transactions on Internet Technology (TOIT), Volume 23, Issue 4Article No.: 59, pp 1–27https://doi.org/10.1145/3617994Opportunistic Mobile Edge Cloud in which opportunistically connected mobile devices run in a cooperative way to augment the capability of a single device has become a timely and essential topic due to its widespread prospect under resource-constrained ...
- research-articleNovember 2023
Elastic deep learning through resilient collective operations
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and AnalysisNovember 2023, pp 44–50https://doi.org/10.1145/3624062.3626080A robust solution that incorporates fault tolerance and elastic scaling capabilities for distributed deep learning. Taking advantage of MPI resilient capabilities, aka. User-Level Failure Mitigation (ULFM), this novel approach promotes efficient and ...
- research-articleOctober 2023
Evicting for the greater good: The case for Reactive Checkpointing in serverless computing
WORDS '23: Proceedings of the 4th Workshop on Resource Disaggregation and ServerlessOctober 2023, pp 44–50https://doi.org/10.1145/3605181.3626289Evictable cloud resources give providers extra flexibility in managing their infrastructure and help reduce resource idleness. Due to its user-transparent and ephemeral nature, serverless computing appears to be a logical fit to make use of this type ...
- research-articleOctober 2023
Understanding Silent Data Corruptions in a Large Production CPU Population
SOSP '23: Proceedings of the 29th Symposium on Operating Systems PrinciplesOctober 2023, pp 216–230https://doi.org/10.1145/3600006.3613149Silent Data Corruption (SDC) in processors can lead to various application-level issues, such as incorrect calculations and even data loss. Since traditional techniques are not effective in detecting processor SDCs, it is very hard to address problems ...
- research-articleOctober 2023
GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
SOSP '23: Proceedings of the 29th Symposium on Operating Systems PrinciplesOctober 2023, pp 364–381https://doi.org/10.1145/3600006.3613145Large deep learning models have recently garnered substantial attention from both academia and industry. Nonetheless, frequent failures are observed during large model training due to large-scale resources involved and extended training time. Existing ...
- research-articleOctober 2023
An Intelligent Blockchain-based Secure Link Failure Recovery Framework for Software-defined Internet-of-Things
Journal of Grid Computing (SPJGC), Volume 21, Issue 4Dec 2023https://doi.org/10.1007/s10723-023-09693-8AbstractThe frequency of link failures in Internet-of-Things (IoT) network are more than the node failures. Hence, effective link recovery schemes are required for a seamless communication in the IoT. In contrast to traditional networking, the IoT is ...
- research-articleOctober 2023
Restorable Shortest Path Tiebreaking for Edge-Faulty Graphs
Journal of the ACM (JACM), Volume 70, Issue 5Article No.: 28, pp 1–24https://doi.org/10.1145/3603542The restoration lemma by Afek et al. [3] proves that, in an undirected unweighted graph, any replacement shortest path avoiding a failing edge can be expressed as the concatenation of two original shortest paths. However, the lemma is tiebreaking-...