Crash Fault Tolerance: A Key To Reliable System Design

What is Crash Fault Tolerance (CFT)?

In distributed systems, Crash Fault Tolerance (CFT) is a fault-tolerance mechanism designed to handle circumstances in which components malfunction by merely crashing or losing responsiveness. These are frequently called benign defects. Ensuring that the system can function properly even in the event that a specific number of its components crash is the fundamental idea behind CFT.

Important Features of CFT

Failure Model: Nodes in a Crash Fault Tolerance system are presumed to crash, which means they cease to function and respond, but they do not display malevolent behaviour like delivering false or misleading information.
Limited Malicious Behaviour: In general, malevolent or Byzantine errors are not handled by CFT systems. A compromised node or one that purposely acts badly can cause the system to malfunction.
Implementation Simplicity: Because it handles lesser failure scenarios, Crash Fault Tolerance is typically simpler and less computationally expensive to construct than Byzantine Fault Tolerance (BFT).
Common Usage: Crash Fault Tolerance is widely used in government infrastructure and business settings where hardware and software problems are the main causes of failures and components are presumed to be reliable.

Comparison with Byzantine Fault Tolerance (BFT)

Compared to BFT, Crash Fault Tolerance is a less strict type of fault tolerance. BFT systems can manage more complicated failures, such as arbitrary and malevolent behaviour where nodes may broadcast inconsistent signals or behave in an unpredictable manner, whereas Crash Fault Tolerance systems are made to withstand crashes.

There are additional differences in the minimum number of accurate nodes needed for consensus:

About half plus one (Th/2 + 1) of the total number of nodes is required for Crash Fault Tolerance to function properly for consensus.
To attain consensus, BFT, on the other hand, usually needs two-thirds plus one (2Th/3 + 1) of the nodes to be right.

When working with untrusted or potentially hostile nodes, BFT systems are required, despite their greater complexity and processing demands.

Importance and Benefits

The value and advantages of consensus algorithms, such as Crash Fault Tolerance

For a number of reasons, consensus algorithms like CFT are essential in distributed systems.

Data Consistency: They give a consistent picture of the system by ensuring that all nodes concur on the same data values or state.
Fault Tolerance: They are made to withstand a specific number of node failures without impairing the overall operation of the system.
Coordination and Synchronization: By making sure that all nodes base their judgements on the same information and guidelines, they help nodes coordinate with one another.
Scalability: As the number of nodes and transactions rises, they offer techniques that enable the system to efficiently reach consensus. For instance, the Crash Fault Tolerance method Raft is renowned for its excellent throughput in permissioned contexts, cheap cost, good energy efficiency, and high scalability.

Consensus Algorithms Implementing CFT

A number of widely used consensus techniques are thought to be crash-fault-tolerant:

Paxos: This family of protocols is renowned for its resilience, enabling consensus even in the face of crash defects and erratic communications. Systems like Microsoft’s Azure Storage and Google’s Chubby use it.
Raft: Raft, which was created to be simpler to use and more comprehensible than Paxos, elects a leader to oversee the replication of log entries to other nodes in order to reach consensus. It is extensively used in CockroachDB, Consul, etcd, and other systems.
Multi-Paxos: A Paxos extension in which a single leader manages several consensus rounds, increasing performance by lowering the overhead associated with leader elections.
Proof of Elapsed Time (PoET) CFT: This PoET variation is solely a Crash Fault Tolerance mechanism; it does not provide BFT and operates in a simulated Software Guard Extension (SGX) environment.

In permissioned blockchain networks, where users are well-known and well trusted, Crash Fault Tolerance algorithms are frequently employed, which lessens the possibility of criminal activity. For instance, Quorum supports Raft for situations requiring basic crash tolerance, while Hyperledger Fabric can use Raft as a crash fault-tolerant ordering service. CFT solutions usually use synchronized copies and data replication across network nodes to improve availability and fault tolerance.

Challenges for CFT Algorithms

There are various difficulties in putting consensus algorithms, like Crash Fault Tolerance, into practice.

Fault Tolerance: Creating algorithms that can smoothly deal with innocuous failures like network partitions and node crashes.
Scalability: Maintaining performance while handling a growing number of nodes and transactions, particularly with regard to message overhead and possible performance snags from centralized points like leaders.
Security: Despite not dealing with Byzantine faults, Crash Fault Tolerance must be resistant to DoS attacks, double-spending in blockchain environments, and Sybil attacks, which are mitigated by computational work or stake in some algorithms.
Synchronization: Network delay and exact clock synchronization make it challenging to ensure that every node has the same system state.
Configuration Management: Adjusting parameters like timeouts and message intervals precisely and managing dynamic changes in network membership (adding or deleting nodes) without interfering with the consensus process.

CFT in Waves Enterprise Blockchain Platform

The CFT consensus algorithm is used by the Waves Enterprise blockchain platform to guarantee network coherence, especially in business blockchains with high information exchange. It is intended to stop events that could interfere with company operations, such as block rollbacks and forks of the main blockchain.

Waves Enterprise’s Proof of Authority (PoA)-based CFT algorithm includes a vote step for mining round validators. This method ensures that a specific block will not be rolled back and that no parallel chain will form because more than half of the participants (validators) are familiar with and have validated the block. In order to accomplish this, a created block must be finalized, and block broadcasting is determined by the majority (50% + 1 vote) of round validators. Mining ceases until network cohesiveness is restored if this majority is not reached.

Parameters of CFT

The following are important CFT parameters and features in Waves Enterprise:

Time Dependency: Like PoA, the algorithm is time-dependent, determining the beginning and ending timings of each mining cycle based on a timestamp from the genesis block.
Configuration Parameters: Three additional CFT parameters are added to the consensus block in the node configuration file:
- max-validators (Vmax): The maximum number of validators that can be involved in a given round.
- finalization-timeout (tfin): The amount of time a miner must wait for the last block to be finished. Mining restarts and transactions are returned to the UTX pool if this time is surpassed.
- full-vote-set-timeout: An optional parameter that specifies how long a miner must wait for the round to conclude before receiving a complete set of votes from every validator.
Voting Process: Nodes with the miner role vote in each round, beginning at tsync (round start time + round length) and concluding by tend + tfin. Both validators’ and the current round miner’s votes are included. The max-validators option is used to define validators; if there are more active miners than Vmax, a pseudorandom selection mechanism may be used. If a vote is signed correctly and comes from an address on the list of validators for the current round, it is deemed valid.
Mining Features: CFT adds an extra fault tolerance mechanism, but the fundamental guidelines are the same as those of PoA. The mining attempt is deemed unsuccessful if the most recent block received is not completed (that is, a microblock with valid votes has not been applied to the state). The node resumes the mining round and returns transactions to the UTX pool if this state lasts longer than tstart + tfin. Working with finished blocks is strongly advised in order to prevent transaction rollbacks.
Synchronization: By continuously updating the list of active channels and restricting synchronization time, CFT distributes the burden evenly, in contrast to PoS and PoA, which choose the strongest network.

CFT-Forensics: Byzantine Accountability for CFT Protocols

Even if trusted components are assumed by CFT protocols, CFT consensus can still be broken by a single corrupt node. This is addressed by the accountability framework CFT-Forensics, which enables the identification of responsible components from node states with cryptographic integrity in the event that a corrupt node violates the protocol and compromises consensus safety.

Efficiency: Because CFT-Forensics is unique to the underlying CFT protocol, it functions at a fraction of the cost of the PeerReview protocol, which records a signed transcript of every message (resulting in considerable communication and storage overhead).
Provable Guarantees: For a unique set of “forensics-compliant” CFT protocols, including popular algorithms like Raft and multi-Paxos, CFT-Forensics offers verifiable accountability guarantees.
Real-world Application: The well-known nuRaft library has been extended with Raft-Forensics, an instantiation of CFT-Forensics for Raft. By attaining 87.8% of vanilla Raft’s peak throughput with 46% greater delay (an extra 44 ms) for 256-byte messages, experiments demonstrate that it adds minimal overhead to vanilla Raft. Additionally, it has been incorporated into the open-source central bank digital currency OpenCBDC, which in wide-area network trials shows 97.8% of Raft’s throughput with a 14.5% greater latency (an extra 326 ms).

Page Content

Tutorials