What is fault management? Describe five steps process in fault management.

1.1kviews

written 7.9 years ago by

Fault Management:

Fault in a network is normally associated with failure of a network component and subsequent loss of connectivity. Fault management involves a five-step process:

(1) Fault detection, (2) Fault location, (3) Restoration of service, (4) Identification of root cause of the problem, and (5) Problem resolution.

i. The fault should be detected as quickly as possible by the centralized management system, preferably before or at about the same time as when the users notice it.

ii. Fault location involves identifying where the problem is located. We distinguish this from problem isolation, although in practice it could be the same.

iii. The reason for doing this is that it is important to restore service to the users as quickly as possible, using alternative means.

iv. The restoration of service takes a higher priority over diagnosing the problem and fixing it.

v. Identification of the root cause of the problem could be a complex process, which we will go into greater depth soon.

vi. After identifying the source of the problem, a trouble ticket can be generated to resolve the problem.

vii. In an automated network operations center, the trouble ticket could be generated automatically by the NMS.

Fault Detection:

i. Fault detection is accomplished using either a polling scheme (the NMS polling management agents periodically for status) or by the generation of traps (management agents based on information from the network elements sending unsolicited alarms to the NMS).

ii. An application program in NMS generates the ping command periodically and waits for response. Connectivity is declared broken when a preset number of consecutive responses are not received.

iii. The frequency of pinging and the preset number for failure detection may be optimized for balance between traffic overhead and the rapidity with which failure is to be detected.

iv. The alternative detection scheme is to use traps. One of the advantages of traps is that failure detection is accomplished faster with less traffic overhead.

Fault Location and Isolation Techniques :

i. Fault location using a simple would be to detect all the network components that have failed. The origin of the problem could then be traced by walking down the topology tree where the problem starts.

ii. Thus, if an interface card on a router has failed; all managed components connected to that interface would indicate failure.

iii. After having located where the fault is, the next step is to isolate the fault (i.e. determine the source of the problem).

iv. First, we should delineate the problem between failure of the component and the physical link. Thus, in the above example, the interface card may be functioning well, but the link to the interface may be down. We need to use various diagnostic tools to isolate the cause.

v. Let us assume for the moment that the link is not the problem but that the interface card is. We then proceed to isolate the problem to the layer that is causing it. It is possible that excessive packet loss is causing disconnection.

vi. We can measure packet loss by pinging, if pinging can be used. We can query the various Management Information Base (MIB) parameters on the node itself or other related nodes to further localize the cause of the problem.

vii. For example, error rates calculated from the interface group parameters, ifInDiscards, ifInErrors, ifOutDiscards, and ifOutErrors with respect to the input and out-put packet rates, could help us isolate the problem in the interface card.

Service Restoration:

i. Whenever there is a service failure, it is NOC's responsibility to restore service as soon as possible. This involves detection and isolation of the problem causing the failure, and restoration of service.

ii. In several failure situations, the network will do this automatically. This network feature is called self-healing. In other situations NMS can detect failure of components and indicate with appropriate alarms.

iii. Restoration of service does not include fixing the cause of the problem. That responsibility usually rests with the I&M group.

iv. A trouble ticket is generated and followed up for resolution of the problem by the I&M group.

Root Cause Analysis (RCA) :

Root Cause Analysis (RCA) is a popular and often-used technique that helps people answer the question of why the problem occurred in the first place.

It seeks to identify the origin of a problem using a specific set of steps, with associated tools, to find the primary cause of the problem, so that you can:

Determine what happened.
Determine why it happened.
Figure out what to do to reduce the likelihood that it will happen again.

Problem Resolution:

Correcting the problem (indicates that the problem has been solved) by hardware & software techniques, managed objects are repaired or replaced, and operations returned to normal.

ADD COMMENT EDIT