-
Security Analysis of Filecoin's Expected Consensus in the Byzantine vs Honest Model
Authors:
Xuechao Wang,
Sarah Azouvi,
Marko Vukolić
Abstract:
Filecoin is the largest storage-based open-source blockchain, both by storage capacity (>11EiB) and market capitalization. This paper provides the first formal security analysis of Filecoin's consensus (ordering) protocol, Expected Consensus (EC). Specifically, we show that EC is secure against an arbitrary adversary that controls a fraction $β$ of the total storage for $βm< 1- e^{-(1-β)m}$, where…
▽ More
Filecoin is the largest storage-based open-source blockchain, both by storage capacity (>11EiB) and market capitalization. This paper provides the first formal security analysis of Filecoin's consensus (ordering) protocol, Expected Consensus (EC). Specifically, we show that EC is secure against an arbitrary adversary that controls a fraction $β$ of the total storage for $βm< 1- e^{-(1-β)m}$, where $m$ is a parameter that corresponds to the expected number of blocks per round, currently $m=5$ in Filecoin. We then present an attack, the $n$-split attack, where an adversary splits the honest miners between multiple chains, and show that it is successful for $βm \ge 1- e^{-(1-β)m}$, thus proving that $βm= 1- e^{-(1-β)m}$ is the tight security threshold of EC. This corresponds roughly to an adversary with $20\%$ of the total storage pledged to the chain. Finally, we propose two improvements to EC security that would increase this threshold. One of these two fixes is being implemented as a Filecoin Improvement Proposal (FIP).
△ Less
Submitted 14 August, 2023;
originally announced August 2023.
-
Modeling Resources in Permissionless Longest-chain Total-order Broadcast
Authors:
Sarah Azouvi,
Christian Cachin,
Duc V. Le,
Marko Vukolic,
Luca Zanolini
Abstract:
Blockchain protocols implement total-order broadcast in a permissionless setting, where processes can freely join and leave. In such a setting, to safeguard against Sybil attacks, correct processes rely on cryptographic proofs tied to a particular type of resource to make them eligible to order transactions. For example, in the case of Proof-of-Work (PoW), this resource is computation, and the pro…
▽ More
Blockchain protocols implement total-order broadcast in a permissionless setting, where processes can freely join and leave. In such a setting, to safeguard against Sybil attacks, correct processes rely on cryptographic proofs tied to a particular type of resource to make them eligible to order transactions. For example, in the case of Proof-of-Work (PoW), this resource is computation, and the proof is a solution to a computationally hard puzzle. Conversely, in Proof-of-Stake (PoS), the resource corresponds to the number of coins that every process in the system owns, and a secure lottery selects a process for participation proportionally to its coin holdings.
Although many resource-based blockchain protocols are formally proven secure in the literature, the existing security proofs fail to demonstrate why particular types of resources cause the blockchain protocols to be vulnerable to distinct classes of attacks. For instance, PoS systems are more vulnerable to long-range attacks, where an adversary corrupts past processes to re-write the history, than Proof-of-Work and Proof-of-Storage systems. Proof-of-Storage-based and Proof-of-Stake-based protocols are both more susceptible to private double-spending attacks than Proof-of-Work-based protocols; in this case, an adversary mines its chain in secret without sharing its blocks with the rest of the processes until the end of the attack.
In this paper, we formally characterize the properties of resources through an abstraction called resource allocator and give a framework for understanding longest-chain consensus protocols based on different underlying resources. In addition, we use this resource allocator to demonstrate security trade-offs between various resources focusing on well-known attacks (e.g., the long-range attack and nothing-at-stake attacks).
△ Less
Submitted 22 November, 2022;
originally announced November 2022.
-
Pikachu: Securing PoS Blockchains from Long-Range Attacks by Checkpointing into Bitcoin PoW using Taproot
Authors:
Sarah Azouvi,
Marko Vukolić
Abstract:
Blockchain systems based on a reusable resource, such as proof-of-stake (PoS), provide weaker security guarantees than those based on proof-of-work. Specifically, they are vulnerable to long-range attacks, where an adversary can corrupt prior participants in order to rewrite the full history of the chain. To prevent this attack on a PoS chain, we propose a protocol that checkpoints the state of th…
▽ More
Blockchain systems based on a reusable resource, such as proof-of-stake (PoS), provide weaker security guarantees than those based on proof-of-work. Specifically, they are vulnerable to long-range attacks, where an adversary can corrupt prior participants in order to rewrite the full history of the chain. To prevent this attack on a PoS chain, we propose a protocol that checkpoints the state of the PoS chain to a proof-of-work blockchain such as Bitcoin. Our checkpointing protocol hence does not rely on any central authority. Our work uses Schnorr signatures and leverages Bitcoin recent Taproot upgrade, allowing us to create a checkpointing transaction of constant size. We argue for the security of our protocol and present an open-source implementation that was tested on the Bitcoin testnet.
△ Less
Submitted 13 October, 2022; v1 submitted 10 August, 2022;
originally announced August 2022.
-
State-Machine Replication Scalability Made Simple (Extended Version)
Authors:
Chrysoula Stathakopoulou,
Matej Pavlovic,
Marko Vukolić
Abstract:
Consensus, state-machine replication (SMR) and total order broadcast (TOB) protocols are notorious for being poorly scalable with the number of participating nodes. Despite the recent race to reduce overall message complexity of leader-driven SMR/TOB protocols, scalability remains poor and the throughput is typically inversely proportional to the number of nodes. We present Insanely Scalable State…
▽ More
Consensus, state-machine replication (SMR) and total order broadcast (TOB) protocols are notorious for being poorly scalable with the number of participating nodes. Despite the recent race to reduce overall message complexity of leader-driven SMR/TOB protocols, scalability remains poor and the throughput is typically inversely proportional to the number of nodes. We present Insanely Scalable State-Machine Replication, a generic construction to turn leader-driven protocols into scalable multi-leader ones. For our scalable SMR construction we use a novel primitive called Sequenced (Total Order) Broadcast (SB) which we wrap around PBFT, HotStuff and Raft leader-driven protocols to make them scale. Our construction is general enough to accommodate most leader-driven ordering protocols (BFT or CFT) and make them scale. Our implementation improves the peak throughput of PBFT, HotStuff, and Raft by 37x, 56x, and 55x, respectively, at a scale of 128 nodes.
△ Less
Submitted 10 March, 2022;
originally announced March 2022.
-
BMS: Secure Decentralized Reconfiguration for Blockchain and BFT Systems
Authors:
Selma Steinhoff,
Chrysoula Stathakopoulou,
Matej Pavlovic,
Marko Vukolić
Abstract:
Reconfiguration of long-lived blockchain and Byzantine fault-tolerant (BFT) systems poses fundamental security challenges. In case of state-of-the-art Proof-of-Stake (PoS) blockchains, stake reconfiguration enables so-called long-range attacks, which can lead to forks. Similarly, permissioned blockchain systems, typically based on BFT, reconfigure internally, which makes them susceptible to a simi…
▽ More
Reconfiguration of long-lived blockchain and Byzantine fault-tolerant (BFT) systems poses fundamental security challenges. In case of state-of-the-art Proof-of-Stake (PoS) blockchains, stake reconfiguration enables so-called long-range attacks, which can lead to forks. Similarly, permissioned blockchain systems, typically based on BFT, reconfigure internally, which makes them susceptible to a similar "I still work here" attack.
In this work, we propose BMS (Blockchain/BFT Membership Service) offering a secure and dynamic reconfiguration service for BFT and blockchain systems, preventing long-range and similar attacks. In particular: (1) we propose a root BMS for permissioned blockchains, implemented as an Ethereum smart contract and evaluate it reconfiguring the recently proposed Mir-BFT protocol, (2) we discuss how our BMS extends to PoS blockchains and how it can reduce PoS stake unbonding time from weeks/months to the order of minutes, and (3) we discuss possible extensions of BMS to hierarchical deployments as well as to multiple root BMSs.
△ Less
Submitted 8 September, 2021;
originally announced September 2021.
-
Multi-Shard Private Transactions for Permissioned Blockchains
Authors:
Elli Androulaki,
Angelo De Caro,
Kaoutar Elkhiyaoui,
Christian Gorenflo,
Alessandro Sorniotti,
Marko Vukolic
Abstract:
Traditionally, blockchain systems involve sharing transaction information across all blockchain network participants. Clearly, this introduces barriers to the adoption of the technology by the enterprise world, where preserving the privacy of the business data is a necessity. Previous efforts to bring privacy and blockchains together either still leak partial information, are restricted in their f…
▽ More
Traditionally, blockchain systems involve sharing transaction information across all blockchain network participants. Clearly, this introduces barriers to the adoption of the technology by the enterprise world, where preserving the privacy of the business data is a necessity. Previous efforts to bring privacy and blockchains together either still leak partial information, are restricted in their functionality or use costly mechanisms like zk-SNARKs. In this paper, we propose the Multi-Shard Private Transaction (MSPT) protocol, a novel privacy-preserving protocol for permissioned blockchains, which relies only on simple cryptographic primitives and targeted dissemination of information to achieve atomicity and high performances.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
On the Interoperability of Decentralized Exposure Notification Systems
Authors:
Marko Vukolic
Abstract:
This report summarizes the requirements and proposes a high-level solution for interoperability across recently proposed COVID-19 exposure notification efforts. Our focus is on interoperability across exposure notification (EN) applications which are based on the decentralized Bluetooth Low Energy (BLE) protocol driven by Google/Apple Exposure Notifications API (including DP3T and similar protocol…
▽ More
This report summarizes the requirements and proposes a high-level solution for interoperability across recently proposed COVID-19 exposure notification efforts. Our focus is on interoperability across exposure notification (EN) applications which are based on the decentralized Bluetooth Low Energy (BLE) protocol driven by Google/Apple Exposure Notifications API (including DP3T and similar protocols). We distinguish different interoperability use cases, such as worldwide public EN interoperability, as well as interoperability in the enterprise EN systems. This report also proposes an API and a backend implementation architecture for EN interoperability. Finally, we propose using a permissioned blockchain-based solution for managing EN backend certificates and configurations (without storing any users' data on the blockchain) for helping address EN interoperability challenges across different vendors.
△ Less
Submitted 23 June, 2020;
originally announced June 2020.
-
Can 100 Machines Agree?
Authors:
Rachid Guerraoui,
Jad Hamza,
Dragos-Adrian Seredinschi,
Marko Vukolic
Abstract:
Agreement protocols have been typically deployed at small scale, e.g., using three to five machines. This is because these protocols seem to suffer from a sharp performance decay. More specifically, as the size of a deployment---i.e., degree of replication---increases, the protocol performance greatly decreases. There is not much experimental evidence for this decay in practice, however, notably f…
▽ More
Agreement protocols have been typically deployed at small scale, e.g., using three to five machines. This is because these protocols seem to suffer from a sharp performance decay. More specifically, as the size of a deployment---i.e., degree of replication---increases, the protocol performance greatly decreases. There is not much experimental evidence for this decay in practice, however, notably for larger system sizes, e.g., beyond a handful of machines.
In this paper we execute agreement protocols on up to 100 machines and observe on their performance decay. We consider well-known agreement protocols part of mature systems, such as Apache ZooKeeper, etcd, and BFT-Smart, as well as a chain and a novel ring-based agreement protocol which we implement ourselves.
We provide empirical evidence that current agreement protocols execute gracefully on 100 machines. We observe that throughput decay is initially sharp (consistent with previous observations); but intriguingly---as each system grows beyond a few tens of replicas---the decay dampens. For chain- and ring-based replication, this decay is slower than for the other systems. The positive takeaway from our evaluation is that mature agreement protocol implementations can sustain out-of-the-box 300 to 500 requests per second when executing on 100 replicas on a wide-area public cloud platform. Chain- and ring-based replication can reach between 4K and 11K (up to 20x improvements) depending on the fault assumptions.
△ Less
Submitted 18 November, 2019;
originally announced November 2019.
-
Mir-BFT: High-Throughput Robust BFT for Decentralized Networks
Authors:
Chrysoula Stathakopoulou,
Tudor David,
Matej Pavlovic,
Marko Vukolić
Abstract:
This paper presents Mir-BFT, a robust Byzantine fault-tolerant (BFT) total order broadcast protocol aimed at maximizing throughput on wide-area networks (WANs), targeting deployments in decentralized networks, such as permissioned and Proof-of-Stake permissionless blockchain systems.
Mir-BFT is the first BFT protocol that allows multiple leaders to propose request batches independently (i.e., pa…
▽ More
This paper presents Mir-BFT, a robust Byzantine fault-tolerant (BFT) total order broadcast protocol aimed at maximizing throughput on wide-area networks (WANs), targeting deployments in decentralized networks, such as permissioned and Proof-of-Stake permissionless blockchain systems.
Mir-BFT is the first BFT protocol that allows multiple leaders to propose request batches independently (i.e., parallel leaders), in a way that precludes request duplication attacks by malicious (Byzantine) clients, by rotating the assignment of a partitioned request hash space to leaders.
As this mechanism removes a single-leader bandwidth bottleneck and exposes a computation bottleneck related to authenticating clients even on a WAN, our protocol further boosts throughput using a client signature verification sharding optimization.
Our evaluation shows that Mir-BFT outperforms state-of-the-art and orders more than 60000 signed Bitcoin-sized (500-byte) transactions per second on a widely distributed 100 nodes, 1 Gbps WAN setup, with typical latencies of few seconds.
We also evaluate Mir-BFT under different crash and Byzantine faults, demonstrating its performance robustness.
Mir-BFT relies on classical BFT protocol constructs, which simplifies reasoning about its correctness.
Specifically, Mir-BFT is a generalization of the celebrated and scrutinized PBFT protocol. In a nutshell, Mir-BFT follows PBFT "safety-wise", with changes needed to accommodate novel features restricted to PBFT liveness.
△ Less
Submitted 22 January, 2021; v1 submitted 13 June, 2019;
originally announced June 2019.
-
StreamChain: Rethinking Blockchain for Datacenters
Authors:
Lucas Kuhring,
Zsolt István,
Alessandro Sorniotti,
Marko Vukolić
Abstract:
Permissioned blockchains promise secure decentralized data management in business-to-business use-cases. In contrast to Bitcoin and similar public blockchains which rely on Proof-of-Work for consensus and are deployed on thousands of geo-distributed nodes, business-to-business use-cases (such as supply chain management and banking) require significantly fewer nodes, cheaper consensus, and are ofte…
▽ More
Permissioned blockchains promise secure decentralized data management in business-to-business use-cases. In contrast to Bitcoin and similar public blockchains which rely on Proof-of-Work for consensus and are deployed on thousands of geo-distributed nodes, business-to-business use-cases (such as supply chain management and banking) require significantly fewer nodes, cheaper consensus, and are often deployed in datacenter-like environments with fast networking. However, permissioned blockchains often follow the architectural thinkining behind their WAN-oriented public relatives, which results in end-to-end latencies several orders of magnitude higher than necessary.
In this work, we propose StreamChain, a permissioned blockchain design that eliminates blocks in favor of processing transactions in a streaming fashion. This results in a drastically lower latency without reducing throughput or forfeiting reliability and security guarantees. To demonstrate the wide applicability of our design, we prototype StreamChain based on the Hyperledger Fabric, and show that it delivers latency two orders of magnitude lower than Fabric, while sustaining similar throughput. This performance makes StreamChain a potential alternative to traditional databases and, thanks to its streaming paradigm, enables further research around reducing latency through relying on modern hardware in datacenters.
△ Less
Submitted 10 February, 2020; v1 submitted 25 August, 2018;
originally announced August 2018.
-
Hyperledger Fabric: A Distributed Operating System for Permissioned Blockchains
Authors:
Elli Androulaki,
Artem Barger,
Vita Bortnikov,
Christian Cachin,
Konstantinos Christidis,
Angelo De Caro,
David Enyeart,
Christopher Ferris,
Gennady Laventman,
Yacov Manevich,
Srinivasan Muralidharan,
Chet Murthy,
Binh Nguyen,
Manish Sethi,
Gari Singh,
Keith Smith,
Alessandro Sorniotti,
Chrysoula Stathakopoulou,
Marko Vukolić,
Sharon Weed Cocco,
Jason Yellick
Abstract:
Fabric is a modular and extensible open-source system for deploying and operating permissioned blockchains and one of the Hyperledger projects hosted by the Linux Foundation (www.hyperledger.org).
Fabric is the first truly extensible blockchain system for running distributed applications. It supports modular consensus protocols, which allows the system to be tailored to particular use cases and…
▽ More
Fabric is a modular and extensible open-source system for deploying and operating permissioned blockchains and one of the Hyperledger projects hosted by the Linux Foundation (www.hyperledger.org).
Fabric is the first truly extensible blockchain system for running distributed applications. It supports modular consensus protocols, which allows the system to be tailored to particular use cases and trust models. Fabric is also the first blockchain system that runs distributed applications written in standard, general-purpose programming languages, without systemic dependency on a native cryptocurrency. This stands in sharp contrast to existing blockchain platforms that require "smart-contracts" to be written in domain-specific languages or rely on a cryptocurrency. Fabric realizes the permissioned model using a portable notion of membership, which may be integrated with industry-standard identity management. To support such flexibility, Fabric introduces an entirely novel blockchain design and revamps the way blockchains cope with non-determinism, resource exhaustion, and performance attacks.
This paper describes Fabric, its architecture, the rationale behind various design decisions, its most prominent implementation aspects, as well as its distributed application programming model. We further evaluate Fabric by implementing and benchmarking a Bitcoin-inspired digital currency. We show that Fabric achieves end-to-end throughput of more than 3500 transactions per second in certain popular deployment configurations, with sub-second latency, scaling well to over 100 peers.
△ Less
Submitted 17 April, 2018; v1 submitted 30 January, 2018;
originally announced January 2018.
-
A Byzantine Fault-Tolerant Ordering Service for the Hyperledger Fabric Blockchain Platform
Authors:
João Sousa,
Alysson Bessani,
Marko Vukolić
Abstract:
Hyperledger Fabric (HLF) is a flexible permissioned blockchain platform designed for business applications beyond the basic digital coin addressed by Bitcoin and other existing networks. A key property of HLF is its extensibility, and in particular the support for multiple ordering services for building the blockchain. Nonetheless, the version 1.0 was launched in early 2017 without an implementati…
▽ More
Hyperledger Fabric (HLF) is a flexible permissioned blockchain platform designed for business applications beyond the basic digital coin addressed by Bitcoin and other existing networks. A key property of HLF is its extensibility, and in particular the support for multiple ordering services for building the blockchain. Nonetheless, the version 1.0 was launched in early 2017 without an implementation of a Byzantine fault-tolerant (BFT) ordering service. To overcome this limitation, we designed, implemented, and evaluated a BFT ordering service for HLF on top of the BFT-SMaRt state machine replication/consensus library, implementing also optimizations for wide-area deployment. Our results show that HLF with our ordering service can achieve up to ten thousand transactions per second and write a transaction irrevocably in the blockchain in half a second, even with peers spread in different continents.
△ Less
Submitted 20 September, 2017;
originally announced September 2017.
-
Blockchain Consensus Protocols in the Wild
Authors:
Christian Cachin,
Marko Vukolić
Abstract:
A blockchain is a distributed ledger for recording transactions, maintained by many nodes without central authority through a distributed cryptographic protocol. All nodes validate the information to be appended to the blockchain, and a consensus protocol ensures that the nodes agree on a unique order in which entries are appended. Consensus protocols for tolerating Byzantine faults have received…
▽ More
A blockchain is a distributed ledger for recording transactions, maintained by many nodes without central authority through a distributed cryptographic protocol. All nodes validate the information to be appended to the blockchain, and a consensus protocol ensures that the nodes agree on a unique order in which entries are appended. Consensus protocols for tolerating Byzantine faults have received renewed attention because they also address blockchain systems. This work discusses the process of assessing and gaining confidence in the resilience of a consensus protocols exposed to faults and adversarial nodes. We advocate to follow the established practice in cryptography and computer security, relying on public reviews, detailed models, and formal proofs; the designers of several practical systems appear to be unaware of this. Moreover, we review the consensus protocols in some prominent permissioned blockchain platforms with respect to their fault models and resilience against attacks. The protocol comparison covers Hyperledger Fabric, Tendermint, Symbiont, R3~Corda, Iroha, Kadena, Chain, Quorum, MultiChain, Sawtooth Lake, Ripple, Stellar, and IOTA.
△ Less
Submitted 7 July, 2017; v1 submitted 6 July, 2017;
originally announced July 2017.
-
Bleach: A Distributed Stream Data Cleaning System
Authors:
Yongchao Tian,
Pietro Michiardi,
Marko Vukolic
Abstract:
In this paper we address the problem of rule-based stream data cleaning, which sets stringent requirements on latency, rule dynamics and ability to cope with the unbounded nature of data streams.
We design a system, called Bleach, which achieves real-time violation detection and data repair on a dirty data stream. Bleach relies on efficient, compact and distributed data structures to maintain th…
▽ More
In this paper we address the problem of rule-based stream data cleaning, which sets stringent requirements on latency, rule dynamics and ability to cope with the unbounded nature of data streams.
We design a system, called Bleach, which achieves real-time violation detection and data repair on a dirty data stream. Bleach relies on efficient, compact and distributed data structures to maintain the necessary state to repair data, using an incremental version of the equivalence class algorithm. Additionally, it supports rule dynamics and uses a "cumulative" sliding window operation to improve cleaning accuracy.
We evaluate a prototype of Bleach using a TPC-DS derived dirty data stream and observe its high throughput, low latency and high cleaning accuracy, even with rule dynamics. Experimental results indicate superior performance of Bleach compared to a baseline system built on the micro-batch streaming paradigm.
△ Less
Submitted 16 September, 2016;
originally announced September 2016.
-
DiNoDB: an Interactive-speed Query Engine for Ad-hoc Queries on Temporary Data
Authors:
Yongchao Tian,
Ioannis Alagiannis,
Erietta Liarou,
Anastasia Ailamaki,
Pietro Michiardi,
Marko Vukolic
Abstract:
As data sets grow in size, analytics applications struggle to get instant insight into large datasets. Modern applications involve heavy batch processing jobs over large volumes of data and at the same time require efficient ad-hoc interactive analytics on temporary data. Existing solutions, however, typically focus on one of these two aspects, largely ignoring the need for synergy between the two…
▽ More
As data sets grow in size, analytics applications struggle to get instant insight into large datasets. Modern applications involve heavy batch processing jobs over large volumes of data and at the same time require efficient ad-hoc interactive analytics on temporary data. Existing solutions, however, typically focus on one of these two aspects, largely ignoring the need for synergy between the two. Consequently, interactive queries need to re-iterate costly passes through the entire dataset (e.g., data loading) that may provide meaningful return on investment only when data is queried over a long period of time. In this paper, we propose DiNoDB, an interactive-speed query engine for ad-hoc queries on temporary data. DiNoDB avoids the expensive loading and transformation phase that characterizes both traditional RDBMSs and current interactive analytics solutions. It is tailored to modern workflows found in machine learning and data exploration use cases, which often involve iterations of cycles of batch and interactive analytics on data that is typically useful for a narrow processing window. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata that DiNoDB exploits to expedite the interactive queries. Our experimental analysis demonstrates that DiNoDB achieves very good performance for a wide range of ad-hoc queries compared to alternatives %such as Hive, Stado, SparkSQL and Impala.
△ Less
Submitted 16 September, 2016;
originally announced September 2016.
-
Non-determinism in Byzantine Fault-Tolerant Replication
Authors:
Christian Cachin,
Simon Schubert,
Marko Vukolić
Abstract:
Service replication distributes an application over many processes for tolerating faults, attacks, and misbehavior among a subset of the processes. The established state-machine replication paradigm inherently requires the application to be deterministic. This paper distinguishes three models for dealing with non-determinism in replicated services, where some processes are subject to faults and ar…
▽ More
Service replication distributes an application over many processes for tolerating faults, attacks, and misbehavior among a subset of the processes. The established state-machine replication paradigm inherently requires the application to be deterministic. This paper distinguishes three models for dealing with non-determinism in replicated services, where some processes are subject to faults and arbitrary behavior (so-called Byzantine faults): first, a modular approach that does not require any changes to the potentially non-deterministic application (and neither access to its internal data); second, a master-slave approach, in which ties are broken by a leader and the other processes validate the choices of the leader; and finally, a treatment of applications that use cryptography and secret keys. Cryptographic operations and secrets must be treated specially because they require strong randomness to satisfy their goals.
The paper also introduces two new protocols. The first uses the modular approach for filtering out non-de\-ter\-min\-istic operations in an application. It ensures that all correct processes produce the same outputs and that their internal states do not diverge. The second protocol implements cryptographically secure randomness generation with a verifiable random function and is appropriate for certain security models. All protocols are described in a generic way and do not assume a particular implementation of the underlying consensus primitive.
△ Less
Submitted 19 December, 2016; v1 submitted 23 March, 2016;
originally announced March 2016.
-
Consistency in Non-Transactional Distributed Storage Systems
Authors:
Paolo Viotti,
Marko Vukolić
Abstract:
Over the years, different meanings have been associated to the word consistency in the distributed systems community. While in the '80s "consistency" typically meant strong consistency, later defined also as linearizability, in recent years, with the advent of highly available and scalable systems, the notion of "consistency" has been at the same time both weakened and blurred.
In this paper we…
▽ More
Over the years, different meanings have been associated to the word consistency in the distributed systems community. While in the '80s "consistency" typically meant strong consistency, later defined also as linearizability, in recent years, with the advent of highly available and scalable systems, the notion of "consistency" has been at the same time both weakened and blurred.
In this paper we aim to fill the void in literature, by providing a structured and comprehensive overview of different consistency notions that appeared in distributed systems, and in particular storage systems research, in the last four decades. We overview more than 50 different consistency notions, ranging from linearizability to eventual and weak consistency, defining precisely many of these, in particular where the previous definitions were ambiguous. We further provide a partial order among different consistency predicates, ordering them by their semantic "strength", which we believe will reveal useful in future research. Finally, we map the consistency semantics to different practical systems and research prototypes.
The scope of this paper is restricted to non-transactional semantics, i.e., those that apply to single storage object operations. As such, our paper complements the existing surveys done in the context of transactional, database consistency semantics.
△ Less
Submitted 12 April, 2016; v1 submitted 1 December, 2015;
originally announced December 2015.
-
XFT: Practical Fault Tolerance Beyond Crashes
Authors:
Shengyun Liu,
Paolo Viotti,
Christian Cachin,
Vivien Quéma,
Marko Vukolić
Abstract:
Despite years of intensive research, Byzantine fault-tolerant (BFT) systems have not yet been adopted in practice. This is due to additional cost of BFT in terms of resources, protocol complexity and performance, compared with crash fault-tolerance (CFT). This overhead of BFT comes from the assumption of a powerful adversary that can fully control not only the Byzantine faulty machines, but at the…
▽ More
Despite years of intensive research, Byzantine fault-tolerant (BFT) systems have not yet been adopted in practice. This is due to additional cost of BFT in terms of resources, protocol complexity and performance, compared with crash fault-tolerance (CFT). This overhead of BFT comes from the assumption of a powerful adversary that can fully control not only the Byzantine faulty machines, but at the same time also the message delivery schedule across the entire network, effectively inducing communication asynchrony and partitioning otherwise correct machines at will. To many practitioners, however, such strong attacks appear irrelevant.
In this paper, we introduce cross fault tolerance or XFT, a novel approach to building reliable and secure distributed systems and apply it to the classical state-machine replication (SMR) problem. In short, an XFT SMR protocol provides the reliability guarantees of widely used asynchronous CFT SMR protocols such as Paxos and Raft, but also tolerates Byzantine faults in combination with network asynchrony, as long as a majority of replicas are correct and communicate synchronously. This allows the development of XFT systems at the price of CFT (already paid for in practice), yet with strictly stronger resilience than CFT --- sometimes even stronger than BFT itself.
As a showcase for XFT, we present XPaxos, the first XFT SMR protocol, and deploy it in a geo-replicated setting. Although it offers much stronger resilience than CFT SMR at no extra resource cost, the performance of XPaxos matches that of the state-of-the-art CFT protocols.
△ Less
Submitted 8 November, 2016; v1 submitted 20 February, 2015;
originally announced February 2015.
-
Erasure-Coded Byzantine Storage with Separate Metadata
Authors:
Elli Androulaki,
Christian Cachin,
Dan Dobre,
Marko Vukolic
Abstract:
Although many distributed storage protocols have been introduced, a solution that combines the strongest properties in terms of availability, consistency, fault-tolerance, storage complexity and the supported level of concurrency, has been elusive for a long time. Combining these properties is difficult, especially if the resulting solution is required to be efficient and incur low cost. We presen…
▽ More
Although many distributed storage protocols have been introduced, a solution that combines the strongest properties in terms of availability, consistency, fault-tolerance, storage complexity and the supported level of concurrency, has been elusive for a long time. Combining these properties is difficult, especially if the resulting solution is required to be efficient and incur low cost. We present AWE, the first erasure-coded distributed implementation of a multi-writer multi-reader read/write storage object that is, at the same time: (1) asynchronous, (2) wait-free, (3) atomic, (4) amnesic, (i.e., with data nodes storing a bounded number of values) and (5) Byzantine fault-tolerant (BFT) using the optimal number of nodes. Furthermore, AWE is efficient since it does not use public-key cryptography and requires data nodes that support only reads and writes, further reducing the cost of deployment and ownership of a distributed storage solution. Notably, AWE stores metadata separately from $k$-out-of-$n$ erasure-coded fragments. This enables AWE to be the first BFT protocol that uses as few as $2t+k$ data nodes to tolerate $t$ Byzantine nodes, for any $k \ge 1$.
△ Less
Submitted 20 February, 2014;
originally announced February 2014.
-
Asynchronous BFT Storage with 2t+1 Data Replicas
Authors:
Christian Cachin,
Dan Dobre,
Marko Vukolic
Abstract:
The cost of Byzantine Fault Tolerant (BFT) storage is the main concern preventing its adoption in practice. This cost stems from the need to maintain at least 3t+1 replicas in different storage servers in the asynchronous model, so that t Byzantine replica faults can be tolerated. In this paper, we present MDStore, the first fully asynchronous read/write BFT storage protocol that reduces the numbe…
▽ More
The cost of Byzantine Fault Tolerant (BFT) storage is the main concern preventing its adoption in practice. This cost stems from the need to maintain at least 3t+1 replicas in different storage servers in the asynchronous model, so that t Byzantine replica faults can be tolerated. In this paper, we present MDStore, the first fully asynchronous read/write BFT storage protocol that reduces the number of data replicas to as few as 2t+1, maintaining 3t+1 replicas of metadata at (possibly) different servers. At the heart of MDStore store is its metadata service that is built upon a new abstraction we call timestamped storage. Timestamped storage both allows for conditional writes (facilitating the implementation of a metadata service) and has consensus number one (making it implementable wait-free in an asynchronous system despite faults). In addition to its low data replication factor, MDStore offers very strong guarantees implementing multi-writer multi-reader atomic wait-free semantics and tolerating any number of Byzantine readers and crash-faulty writers. We further show that MDStore data replication overhead is optimal; namely, we prove a lower bound of 2t+1 on the number of data replicas that applies even to crash-tolerant storage with a fault-free metadata service oracle. Finally, we prove that separating data from metadata for reducing the cost of BFT storage is not possible without cryptographic assumptions. However, our MDStore protocol uses only lightweight cryptographic hash functions.
△ Less
Submitted 13 February, 2014; v1 submitted 21 May, 2013;
originally announced May 2013.
-
Proofs of Writing for Efficient and Robust Storage
Authors:
Dan Dobre,
Ghassan Karame,
Wenting Li,
Matthias Majuntke,
Neeraj Suri,
Marko Vukolic
Abstract:
We present PoWerStore, the first efficient robust storage protocol that achieves optimal latency without using digital signatures. PoWerStore's robustness comprises tolerating asynchrony, maximum number of Byzantine storage servers, any number of Byzantine readers and crash-faulty writers, and guaranteeing wait-freedom and linearizability of read/write operations. PoWerStore's efficiency stems fro…
▽ More
We present PoWerStore, the first efficient robust storage protocol that achieves optimal latency without using digital signatures. PoWerStore's robustness comprises tolerating asynchrony, maximum number of Byzantine storage servers, any number of Byzantine readers and crash-faulty writers, and guaranteeing wait-freedom and linearizability of read/write operations. PoWerStore's efficiency stems from combining lightweight authentication, erasure coding and metadata write-backs where readers write-back only metadata to achieve linearizability. At the heart of PoWerStore are Proofs of Writing (PoW): a novel storage technique based on lightweight cryptography. PoW enable reads and writes in the single-writer variant of PoWerStore to have latency of 2 rounds of communication between a client and storage servers in the worst-case (which we show optimal). We further present and implement a multi-writer PoWerStore variant featuring 3-round writes/reads where the third read round is invoked only under active attacks, and show that it outperforms existing robust storage protocols, including crash-tolerant ones.
△ Less
Submitted 24 December, 2012; v1 submitted 14 December, 2012;
originally announced December 2012.