Debug

  1. You probably also want to make sure your kernels have magic sysrq support build in so you can (at least attempt to) do an emergency flush, and unmount of your filesystems when the box hangs followed by an emergency reboot... That's a lot better than just hitting the power button. sysrq can also be used to capture stack traces for running programs, get memory dumps etc. See Documentation/admin-guide/sysrq.rst for details.

  2. capture the Oops output.

  3. If the system will be in no state to write it to your logs or you may be in X and don't even see it get printed on the console or the system just hangs. Be prepared to capture bugs, either by running a serial console on another machine where you can record the Oopses when they happen or use netconsole (useful, but not as reliable as serial console in my experience; also requires wired ethernet--wireless won't do).

  4. If you have a printer, then console on line printer (parallel port, not USB) can also be useful.

  5. If you can do none of those, then taking a photo of the screen with the Oops on it or writing it down by hand (all of it) will have to do.

  6. A *kernel panic* is an action taken by an operating system upon detecting an internal fatal error from which it cannot safely recover and force the system to do a controlled system hang / reboot due to a detected run-time system malfunction (not necessarily an OOPS). The operation of the panic kernel feature may be controllable via run-time sysconfig settings such as hung task handling. This is a kernel panic.

  7. OOPS are due to the Kernel exception handler getting executed including macros such as BUG() which is defined as an invalid instruction. Each exception has a unique number. Some “oops”es are bad enough that the kernel decides to execute the panic feature to stop running immediately. This is a kernel crash optionally followed by invoking a panic.

    1. When a Kernel OOPS is encountered in a running kernel an OOPS message like ([ 67.994624] Internal error: Oops: 5 [#1] XXXXXXXXXXX) is displayed on the screen. The OOPS message contains the following: the values of the CPU registers, the address of the function that invoked the failure i.e PC, the stack, and the name of the current process executing. By using this OOPS statement, you can begin to debug the specific problem in the kernel. However, sometimes this OOPS message is insufficient.

    2. How to interpret oopses with System.map In Linux, the System.map file is a symbol table used by the kernel. The System.map is required when the address of a symbol name, or the symbol name of an address, is needed. It is especially useful for debugging kernel panics and kernel oopses. The kernel does the address-to-name translation itself when CONFIG_KALLSYMS is enabled so that tools like ksymoops are not required. Note: Addresses inside System.map may change from one build to the next or in another word new System.map is generated for each build of the kernel however it is must to have System.map of the same Linux kernel on which Kernel panics/oopses has been reported to debug the problem. Note in the kernel backtraces in the logs, the kernel finds the nearest symbol to the address being analysed. Not all function symbols are available because of inlining, static, and optimisation so sometimes the reported function name is not the location of the failure

Last updated