I decided to get back to the basics and see how much the field of diagnosing kernel crashes has improved, by looking at the Kdump utilities [2].
The problem with the older tools was the fact that you cant quite trust a kernel that just crashed. The questions you need to ask are if the kernel structures and data are intact and, if all the important drivers are still correctly initialized, so that the dumps can be written to the file system (local, NFS or scp). It is better to use an entity outside of the kernel to save the kernel dumps.
Kdump addresses these concerns by providing two components - a fast-booting mechanism that allows booting a Linux kernel from the context of an already running kernel without going through BIOS, called Kexec and capturing the crash dump from the context of a freshly booted kernel and not from the context of the crashed kernel, which guarantees that the necessary kernel drivers will be initialized, named Kdump (confusingly the same name as the main package).
In the following example I'll demonstrate a simple scenario on how to install, configure, crash and diagnose a kernel dump file with kdump and the crash [3] utility on CentOS 6.3.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[root@host1 ~]# uname -r | |
2.6.32-279.19.1.el6.x86_64 | |
[root@host1 ~]# cat /etc/redhat-release | |
CentOS release 6.3 (Final) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[root@host1 ~]# yum -y install kexec-tools | |
[root@host1 ~]# yum -y install kernel-debug | |
[root@host1 ~]# yum -y install kernel-debug-devel | |
[root@host1 ~]# wget http://debuginfo.centos.org/6/x86_64/kernel-debuginfo-2.6.32-279.19.1.el6.x86_64.rpm | |
[root@host1 ~]# wget http://debuginfo.centos.org/6/x86_64/kernel-debuginfo-common-x86_64-2.6.32-279.19.1.el6.x86_64.rpm | |
[root@host1 ~]# wget http://debuginfo.centos.org/6/x86_64/kernel-debug-debuginfo-2.6.32-279.19.1.el6.x86_64.rpm | |
[root@host1 ~]# rpm -Uvh kernel-debuginfo-common-x86_64-2.6.32-279.19.1.el6.x86_64.rpm | |
[root@host1 ~]# rpm -Uvh kernel-debuginfo-2.6.32-279.19.1.el6.x86_64.rpm | |
[root@host1 ~]# rpm -Uvh kernel-debug-debuginfo-2.6.32-279.19.1.el6.x86_64.rpm |
Next make sure that the kernel is compiled with support for saving crash dumps etc:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[root@host1 ~]# egrep 'KEXEC|CRASH|CAL_S' /boot/config-2.6.32-279.el6.x86_64 | |
CONFIG_KEXEC=y | |
CONFIG_KEXEC_AUTO_RESERVE=y | |
CONFIG_CRASH_DUMP=y | |
CONFIG_KEXEC_JUMP=y | |
CONFIG_PHYSICAL_START=0x1000000 | |
CONFIG_CRASH=y |
Another prerequisite for the crash kernel to be loaded at the time of the problem is to reserve some memory area for it. This needs to be configured in the original kernel's bootloader and can be done with the bootloader option crashkernel=XM[ @YM], where X is the size of the reserved area in megabytes and Y specifies the start address. If you omit the second parameter or set it to 0M, the system automatically selects an appropriate location. This is already done on CentOS and to verify it run:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[root@host1 ~]# grep -i crash /proc/iomem | |
03000000-0b0fffff : Crash kernel |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[root@host1 ~]# cat /etc/grub.conf | |
[truncated] | |
kernel /vmlinuz-2.6.32-279.19.1.el6.x86_64.debug ro root=/dev/mapper/vg_testbed-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto rd_LVM_LV=vg_testbed/lv_swap KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rd_LVM_LV=vg_testbed/lv_root rhgb quiet | |
[truncated] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[root@host1 ~]# grep crash /proc/cmdline | |
ro root=/dev/mapper/vg_testbed-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=129M@0M rd_LVM_LV=vg_testbed/lv_swap KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rd_LVM_LV=vg_testbed/lv_root rhgb quiet |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[root@host1 ~]# /etc/init.d/kdump start | |
[root@host1 ~]# chkconfig kdump on |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[root@host1 ~]# cat /etc/kdump.conf | |
[root@host1 ~]# cat /etc/sysconfig/kdump |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[root@host1 ~]# yum install -y crash |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[root@host1 ~]# echo 1 > /proc/sys/kernel/sysrq | |
[root@host1 ~]# echo c > /proc/sysrq-trigger |
Now observe the console and you should see the crash kernel dumping the memory of the production kernel and creating the vmcore file on the specified file system (by default the file will go in a sub-directory in /var/crash/).
To analyse what caused the crash we can run the crash utility like so:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[root@host1 ~]# crash /usr/lib/debug/lib/modules/2.6.32-279.19.1.el6.x86_64/vmlinux vmcore | |
crash 6.0.4-2.el6 | |
KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.19.1.el6.x86_64/vmlinux | |
DUMPFILE: vmcore [PARTIAL DUMP] | |
CPUS: 2 | |
DATE: Tue Jan 29 06:12:56 2013 | |
UPTIME: 00:15:58 | |
LOAD AVERAGE: 0.01, 0.13, 0.08 | |
TASKS: 111 | |
NODENAME: testbed | |
RELEASE: 2.6.32-279.19.1.el6.x86_64 | |
VERSION: #1 SMP Wed Dec 19 07:05:20 UTC 2012 | |
MACHINE: x86_64 (2128 Mhz) | |
MEMORY: 2 GB | |
PANIC: "Oops: 0002 [#1] SMP " (check log for details) | |
PID: 1295 | |
COMMAND: "bash" | |
TASK: ffff8800376c3540 [THREAD_INFO: ffff88007c820000] | |
CPU: 1 | |
STATE: TASK_RUNNING (PANIC) | |
crash> |
We can see a backtrace and the processes by running bt and ps respectively:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
crash> bt | |
PID: 1295 TASK: ffff8800376c3540 CPU: 1 COMMAND: "bash" | |
#0 [ffff88007c8219e0] machine_kexec at ffffffff81031f7b | |
#1 [ffff88007c821a40] crash_kexec at ffffffff810b8c22 | |
#2 [ffff88007c821b10] oops_end at ffffffff814ed6b0 | |
#3 [ffff88007c821b40] no_context at ffffffff81042a0b | |
#4 [ffff88007c821b90] __bad_area_nosemaphore at ffffffff81042c95 | |
#5 [ffff88007c821be0] bad_area at ffffffff81042dbe | |
#6 [ffff88007c821c10] __do_page_fault at ffffffff81043570 | |
#7 [ffff88007c821d30] do_page_fault at ffffffff814ef68e | |
#8 [ffff88007c821d60] page_fault at ffffffff814eca45 | |
[exception RIP: sysrq_handle_crash+22] | |
RIP: ffffffff81321a66 RSP: ffff88007c821e18 RFLAGS: 00010096 | |
RAX: 0000000000000010 RBX: 0000000000000063 RCX: 000000000000abf5 | |
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000063 | |
RBP: ffff88007c821e18 R8: 0000000000000000 R9: 0000000000000004 | |
R10: ffffffff8163cac0 R11: ffff8800373d269f R12: 0000000000000000 | |
R13: ffffffff81afb7a0 R14: 0000000000000286 R15: 0000000000000004 | |
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 | |
#9 [ffff88007c821e20] __handle_sysrq at ffffffff81321d22 | |
#10 [ffff88007c821e70] write_sysrq_trigger at ffffffff81321dde | |
#11 [ffff88007c821ea0] proc_reg_write at ffffffff811db48e | |
#12 [ffff88007c821ef0] vfs_write at ffffffff81176588 | |
#13 [ffff88007c821f30] sys_write at ffffffff81176e81 | |
#14 [ffff88007c821f80] system_call_fastpath at ffffffff8100b072 | |
RIP: 00007f61ecd44e60 RSP: 00007fff8c5f04f0 RFLAGS: 00010202 | |
RAX: 0000000000000001 RBX: ffffffff8100b072 RCX: 0000000000000063 | |
RDX: 0000000000000002 RSI: 00007f61ed63e000 RDI: 0000000000000001 | |
RBP: 00007f61ed63e000 R8: 000000000000000a R9: 00007f61ed63a700 | |
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000002 | |
R13: 00007f61ecff7780 R14: 0000000000000002 R15: 00007f61ecff7780 | |
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b | |
crash> |
For more information refer to the free Linux Kernel Crash Book by Igor Ljubuncic [4].
Resources:
[1] http://lkcd.sourceforge.net/
[2] http://lse.sourceforge.net/kdump/
[3] http://people.redhat.com/anderson/crash_whitepaper/
[4] https://docs.google.com/file/d/0B0x5iwJdUkt0LW54VVVxSlBPdE0/edit