types of failures in distributed databse

Designing a reliable system that can recover from failures requires identifying the types of failures with which the system has to deal. In a distributed database system, we need to deal with four types of failures: transaction failures (aborts), site (system) failures, media (disk) failures, and communication line failures. Some of these are due to hardware and others are due to software.

1. Transaction Failures: Transactions can fail for a number of reasons. Failure can be due to an error in the transaction caused by incorrect input data as well as the detection of a present or potential deadlock. Furthermore, some concurrency control algorithms do not permit a transaction to proceed or even to wait if the data that they attempt to access are currently being accessed by another transaction. This might also be considered a failure.The usual approach to take in cases of transaction failure is to abort the transaction, thus resetting the database to its state prior to the start of this transaction.

2. Site (System) Failures: The reasons for system failure can be traced back to a hardware or to a software failure. The system failure is always assumed to result in the loss of main memory contents. Therefore, any part of the database that was in main memory buffers is lost as a result of a system failure. However, the database that is stored in secondary storage is assumed to be safe and correct. In distributed database terminology, system failures are typically referred to as site failures, since they result in the failed site being unreachable from other sites in the distributed system. We typically differentiate between partial and total failures in a distributed system. Total failure refers to the simultaneous failure of all sites in the distributed system; partial failure indicates the failure of only some sites while the others remain operational.

3. Media Failures: Media failure refers to the failures of the secondary storage devices that store the database. Such failures may be due to operating system errors,as well as to hardware faults such as head crashes or controller failures. The important point is that all or part of the database that is on the secondary storage is considered to be destroyed and inaccessible. Duplexing of disk storage and maintaining archival copies of the database are common techniques that deal with this sort of catastrophic problem. Media failures are frequently treated as problems local to one site and therefore not specifically addressed in the reliability mechanisms of distributed DBMSs.

4. Communication Failures There are a number of types of communication failures. The most common ones are the errors in the messages, improperly ordered messages, lost messages, and communication line failures. The first two errors are the responsibility of the computer network; we will not consider them further. Therefore, in our discussions of distributed DBMS reliability, we expect the underlying computer network hardware and software to ensure that two messages sent from a process at some originating site to another process at some destination site are delivered without error and in the order in which they were sent. Lost or undeliverable messages are typically the consequence of communication line failures or (destination) site failures. If a communication line fails, in addition to losing the message(s) in transit, it may also divide the network into two or more disjoint groups. This is called network partitioning. If the network is partitioned, the sites in each partition may continue to operate. In this case, executing transactions that access data stored in multiple partitions becomes a major issue.

1 Answer

The known errors or failures like software errors, hardware failures, hard disk failures, and power failures are very common in both centralized database system and distributed database systems. Apart from these common failures, a distributed database system may suffer from some of the failures as listed below;

1.Failure of a site -

Distributed database consists of two or more servers. These servers are otherwise called as site. Any of these sites might fail. Though it is a hardware or software failure, as a distributed system it must be treated differently.

2.Loss of messages -

The messages which are shared in between a set of sites might be lost. TCP/IP protocols are responsible to handle these losses.

3.Failure of a communication link -

A connection/communication links between a set of sites might be failed. In such case, the distributed database system may try to identify an alternate route to send the messages.

4.Network partition -

A distributed database system is said to be partitioned if it has two or more subsystems. A subsystem may be a set of one or more sites which has one connection to the other subsystems. For example, consider a distributed database system which manages sites at three different college campuses. Every campus may be internally having more sites. But they are connected to other campuses through a single connection. Now the problem is, if this connection is failed, the distributed database system cannot differentiate or diagnose the actual problem. The failure can be treated as Failure of a site, loss of messages, or a communication link failure.

Among all the failures discussed above, Failure of a site and Network partition need extra care when handling failures in a distributed database.

Please log in to add an answer.