What problems can occur in a distributed system due to the failure of link and partitioning of the network? What are the ways by which recovery can take place

306views

written 7.1 years ago by

• modified 7.1 years ago

Problems

The site where the transaction enters is designated as the controlling site. The controlling site sends messages to the sites where the data items are located to lock the items. Then it waits for confirmation. When all the sites have confirmed that they have locked the data items, transaction starts. If any site or communication link fails, the transaction has to wait until they have been repaired.

Though the implementation is simple, this approach has some drawbacks −

i. Pre-acquisition of locks requires a long time for communication delays. This increases the time required for transaction.

ii. In case of site or link failure, a transaction has to wait for a long time so that the sites recover. Meanwhile, in the running sites, the items are locked. This may prevent other transactions from executing.

iii. If the controlling site fails, it cannot communicate with the other sites. These sites continue to keep the locked data items in their locked state, thus resulting in blocking.

Recovery

i. Must guarantee atomicity and durability of transactions

ii. Failures include usual types, plus

a) loss of messages

b) site failure

c) link failure

iii. Network partitioning

a) Failure of links where network splits into groups of nodes that are isolated from other groups, but can communicate with one another

iv. Handling Node Failure

a) System flags node as failed

b) System aborts and rolls back affected transactions

c) System checks periodically to see if node has recovered, or node self-reports

d) After restart, failed node does local recovery

e) Failed node catches up to current state of DB, using system log of changes made while it was unavailable

ADD COMMENT EDIT