PDF Chapter 001, The Failure Analysis and Troubleshooting System

Tasks covered in this chapter include:. System Error States Unexpected Reboots Information to Gather During Troubleshooting Familiarity with a wide variety of equipment, and experience with a particular machine's common failure modes can be invaluable when troubleshooting system problems. Establishing a systematic approach to investigating and solving a particular system's problems can help ensure that you can quickly identify and remedy most issues as they arise. The Netra server indicates and logs events and errors in a variety of ways. Depending on the system's configuration and software, certain types of errors are captured only temporarily.

Therefore, you must observe and record all available information immediately before you attempt any corrective action. POST, for instance, accumulates a list of failed components across resets. However, failed component information is cleared after a system reset. Similarly, the state of LEDs in a hung system is lost when the system reboots or resets. If you encounter any system problems that are not familiar to you, gather as much information as you can before you attempt any remedial actions.

The following task listing outlines a basic approach to information gathering. Gather as much information as you can about the system by reviewing and verifying the system's operating system, firmware, and hardware configuration. To accurately analyze error indications and messages, you or a Sun support services engineer must know the system's operating system and patch revision levels as well as the specific hardware configuration. See Recording Information About the System.

Compare the specifics of your situation to the latest published information about your system. Often, unfamiliar problems you encounter have been seen, diagnosed, and fixed by others. This information might help you avoid the unnecessary expense of replacing parts that are not actually failing. See Updated Troubleshooting Information for information sources. On the Netra server, the ALOM system controller provides you with access to a variety of system logs and other information about the system, even when the system is powered off. For more information about ALOM, see:.

Output from show-post-results and show-obdiag-results commands - From the ok prompt, issue the show-post-results command or show-obdiag-results command to view summaries of the results from the most recent POST and OpenBoot Diagnostics tests, respectively. The test results are saved across power cycles and provide an indication of which components passed and which components failed POST or OpenBoot Diagnostics tests. Be sure to check any network port LEDs for activity as you examine the system.

Any information about the state of the system from the LEDs is lost when the system is reset. The system controller also provides you access to boot log information from the latest system reset. For more information about the system console, refer to the Netra Server System Administration Guide.

See The Core Dump Process for more information. Recording Information About the System As part of your standard operating procedures, it is important to have the following information about your system readily available:. Solaris OS version Specific hardware configuration information Optional equipment and driver information Recent service records Having all of this information available and verified makes it easier for you to recognize any problems already identified by others. This information is also required if you contact Sun support or your authorized support provider.

It is vital to know the version and patch revision levels of the system's operating system, patch revision levels of the firmware, and your specific hardware configuration before you attempt to fix any problems.

Problems often occur after changes have been made to the system. Some errors are caused by hardware and software incompatibilities and interactions. If you have all system information available, you might be able to quickly fix a problem by simply updating the system's firmware. Knowing about recent upgrades or component replacements might help you avoid replacing components that are not faulty. When troubleshooting, it is important to understand what kind of error has occurred, to distinguish between real and apparent system hangs, and to respond appropriately to error conditions so as to preserve valuable information.

Depending on the severity of a system error, a Netra server might or might not respond to commands you issue to the system. Once you have gathered all available information, you can begin taking action. Your actions depend on the information you have already gathered and the state of the system. If your system appears to be hung, attempt multiple approaches to get the system to respond.

See Responding to System Hang States. Responding to System Hang States Troubleshooting a hanging system can be a difficult process because the root cause of the hang might be masked by false error indications from another part of the system. Therefore, it is important that you carefully examine all the information sources available to you before you attempt any remedy.

Also, it is helpful to understand the type of hang the system is experiencing. This hang state information is especially important to Sun support services engineers, should you contact them. A system soft hang can be characterized by any of the following symptoms:. New attempts to access the system fail. Some parts of the system appear to stop responding. You can drop the system into the OpenBoot ok prompt level. Some soft hangs might dissipate on their own, while others will require that the system be interrupted to gather information at the OpenBoot prompt level.

A soft hang should respond to a break signal that is sent through the system console. A system hard hang leaves the system unresponsive to a system break sequence. You will know that a system is in a hard hang state when you have attempted all the soft hang remedies with no success. Hardware Fatal Reset errors are the result of an "illegal" hardware state that is detected by the system. A hardware Fatal Reset error can either be a transient error or a hard error. A transient error causes intermittent failures. A hard error causes persistent failures that occur in the same way each time.

The Exception causes a loss of system integrity, which would jeopardize the system if Solaris software continued to operate. Typically, these are device driver problems that can be identified easily. You can obtain this information through SunSolve Online see Web Sites , or by contacting Sun or the third-party driver vendor. Recent service history of systems that encounter Fatal Reset errors or RED State Exceptions Capturing system console indications and messages at the time of the error can help you isolate the true cause of the error.

In some cases, the true cause of the original error might be masked by false error indications from another part of the system. For example, POST results shown by the output from the prtdiag command might indicate failed components, when, in fact, the "failed" components are not the actual cause of the Fatal Reset error. In most cases, a good component will actually report the Fatal Reset error. By analyzing the system console output at the time of the error, you can avoid replacing components based on these false error indications.

In addition, knowing the service history of a system experiencing transient errors can help you avoid repeatedly replacing "failed" components that do not fix the problem. Sometimes, a system might reboot unexpectedly. In that case, ensure that the reboot was not caused by a panic.

For example, L2-cache errors, which occur in user space not kernel space , might cause Solaris software to log the L2-cache failure data and reboot the system. The information logged might be sufficient to troubleshoot and correct the problem. If POST is not invoked during the reboot process, or if the system diagnostics level is not set to max , you might need to run system diagnostics at a higher level of coverage to determine the source of the reboot if the system message and system console files do not clearly indicate the source of the reboot.

This procedure assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Examine the ALOM event log. The ALOM event log shows system events such as reset events and LED indicator state changes that have occurred since the last system boot. Therefore, a single event might generate messages that appear to be logged at different times in different logs.

Examine system environment status. The showenvironment command reports much useful data such as temperature readings; state of system and component LEDs; motherboard voltages; and status of system disks, fans, motherboard circuit breakers, and CPU module DC-to-DC converters. When reviewing the complete output from the showenvironment command, check the state of all Service Required LEDs and verify that all components show a status of OK.

Examine the output of the prtdiag -v command. Any information from this command about the current state of the system is lost if the system is reset. The following are clear indications of a failing part:. ALOM environmental messages about a failing part, including a fan or power supply If there is no clear indication of a failing part, investigate the installed applications, the network, or the disk configuration. If you have clear indications that a part has failed or is failing, replace that part as soon as possible.

If the problem is a confirmed environmental failure, replace the fan or power supply as soon as possible. A system with a redundant configuration might still operate in a degraded state, but the stability and performance of the system will be affected. Since the system is still operational, attempt to isolate the fault using several methods and tools to ensure that the part you suspect as faulty really is causing the problems you are experiencing. See Isolating Faults in the System. For information about installing and replacing field-replaceable parts, refer to the Netra Server Service Manual xx.

Novice pilots are taught that their first responsibility in an emergency is to fly the airplane [Gaw09] ; troubleshooting is secondary to getting the plane and everyone on it safely onto the ground.

This approach is also applicable to computer systems: This realization is often quite unsettling and counterintuitive for new SREs, particularly those whose prior experience was in product development organizations. Ideally, a monitoring system is recording metrics for your system as discussed in Practical Alerting from Time-Series Data. Graphing time-series and operations on time-series can be an effective way to understand the behavior of specific pieces of a system and find correlations that might suggest where problems began.

Logging is another invaluable tool. Exporting information about each operation and about system state makes it possible to understand exactly what a process was doing at a given point in time. You may need to analyze system logs across one or many processes. Tracing requests through the whole stack using tools such as Dapper [Sig10] provides a very powerful way to understand how a distributed system is working, though varying use cases imply significantly different tracing designs [Sam14]. Exposing current state is the third trick in our toolbox.

Finally, you may even need to instrument a client to experiment with, in order to discover what a component is returning in response to requests. Ideally, components in a system have well-defined interfaces and perform known transformations from their input to their output in our example, given an input search text, a component might return output containing possible matches.

Injecting known test data in order to check that the resulting output is expected a form of black-box testing at each step can be especially effective, as can injecting data intended to probe possible causes of errors. Having a solid reproducible test case makes debugging much faster, and it may be possible to use the case in a non-production environment where more invasive or riskier techniques are available than would be possible in production. Dividing and conquering is a very useful general-purpose solution technique.

This strategy is also well-suited for use with data processing pipelines. In exceptionally large systems, proceeding linearly may be too slow; an alternative, bisection , splits the system in half and examines the communication paths between components on one side and the other. A malfunctioning system is often still trying to do something —just not the thing you want it to be doing.

Well-designed systems should have extensive production logging to track new version deployments and configuration changes at all layers of the stack, from the server binaries handling user traffic down to the packages installed on individual nodes in the cluster. While the generic tools described previously are helpful across a broad range of problem domains, you will likely find it helpful to build tools and systems to help with diagnosing your particular services. Google SREs spend much of their time building such tools. While many of these tools are necessarily specific to a given system, be sure to look for commonalities between services and teams to avoid duplicating effort.

C H A P T E R 7 - Troubleshooting Hardware Problems

Using the experimental method, we can try to rule in or rule out our hypotheses. For instance, suppose we think a problem is caused by either a network failure between an application logic server and a database server, or by the database refusing connections. Trying to connect to the database with the same credentials the application logic server uses can refute the second hypothesis, while pinging the database server may be able to refute the first, depending on network topology, firewall rules, and other factors. There are a number of considerations to keep in mind when designing tests which may be as simple as sending a ping or as complicated as removing traffic from a cluster and injecting specially formed requests to find a race condition:.

An ideal test should have mutually exclusive alternatives, so that it can rule one group of hypotheses in and rule another set out. In practice, this may be difficult to achieve. Consider the obvious first: An experiment may provide misleading results due to confounding factors. Active tests may have side effects that change future test results. For instance, allowing a process to use more CPUs may make operations faster, but might increase the likelihood of encountering data races.

Similarly, turning on verbose logging might make a latency problem even worse and confuse your results: Some tests may not be definitive, only suggestive.

Effective Troubleshooting

It can be very difficult to make race conditions or deadlocks happen in a timely and reproducible manner, so you may have to settle for less certain evidence that these are the causes. Take clear notes of what ideas you had, which tests you ran, and the results you saw. This includes new designs, heuristics, or human processes that fail to improve upon the systems they replace. Negative results should not be ignored or discounted.

Often a team has two seemingly reasonable designs but progress in one direction has to address vague and speculative questions about whether the other direction might be better. Experiments with negative results are conclusive. They tell us something certain about production, or the design space, or the performance limits of an existing system. They can help others determine whether their own experiments or designs are worthwhile.

When a subsequent development team decides to evaluate web servers, instead of starting from scratch, they can use this already well-documented negative result as a starting point to decide quickly whether a they need fewer than connections or b the lock contention problems have been resolved. Microbenchmarks, documented antipatterns, and project postmortems all fit this category.

You should consider the scope of the negative result when designing an experiment, because a broad or especially robust negative result will help your peers even more. Tools and methods can outlive the experiment and inform future work. As an example, benchmarking tools and load generators can result just as easily from a disconfirming experiment as a supporting one.

Many webmasters have benefited from the difficult, detail-oriented work that produced Apache Bench, a web server loadtest, even though its first results were likely disappointing.

Problem Report

Building tools for repeatable experiments can have indirect benefits as well: Accounting for negative results and statistical insignificance reduces the bias in our metrics and provides an example to others of how to maturely accept uncertainty. By publishing everything, you encourage others to do the same, and everyone in the industry collectively learns much more quickly. SRE has already learned this lesson with high-quality postmortems, which have had a large positive effect on production stability.

When you publish the results, those people do not have to design and run a similar experiment themselves. Many more experiments are simply unreported because people mistakenly believe that negative results are not progress. Encourage your peers by recognizing that negative results are part of thoughtful risk taking and that every well-designed experiment has merit.

Such a document is potentially either too heavily filtered, or the author was not rigorous in his or her methods. Definitively proving that a given factor caused a problem—by reproducing it at will—can be difficult to do in production systems; often, we can only find probable causal factors, for the following reasons:. Reproducing the problem in a live production system may not be an option , either because of the complexity of getting the system into a state where the failure can be triggered, or because further downtime may be unacceptable.

Having a nonproduction environment can mitigate these challenges, though at the cost of having another copy of the system to run. In other words, you need to write a postmortem although ideally, the system is alive at this point! Our investigation discovered that latency had indeed increased by nearly an order of magnitude as shown in Figure Simultaneously, the amount of CPU time Figure and number of serving processes Figure had nearly quadrupled.

Clearly something was wrong. It was time to start troubleshooting. Typically a sudden increase in latency and resource usage indicates either an increase in traffic sent to the system or a change in system configuration. However, we could easily rule out both of these possible causes:

Get PDF Chapter 001, The Failure Analysis and Troubleshooting System

C H A P T E R 7 - Troubleshooting Hardware Problems

Effective Troubleshooting

Problem Report