Troubleshooting a Process That Does Not Respond to kill




A user begins rewinding a tape but realizes that the wrong tape is in the drive. The user tries to kill the job but must wait for the process to finish.
Why?
The mt command has made an ioctl call to the SCSI tape driver (st) and must wait for the driver to release the process back to user space so that use signals will be handled.
# mt -f /dev/st0 rewind
# ps -emo state,pid,ppid,pri,size,stime,time,comm,wchan | grep mt
D  9225  8916  24 112 20:46 00:00:00 mt             wait_for_completion
 
[root@atlorca2 root]# kill -9 9225
[root@atlorca2 root]# echo $?   # This produces the return code for the
previous command.  0 = success
0
[root@atlorca2 root]# ps -elf | grep 9225
0 D root     9225 8916  0  24   0   -    112 wait_f 20:46 pts/1
00:00:00 mt -f /dev/st0

The mt command has entered a wait channel, and after the code returns from the driver, the signal will be processed.
Let's check the pending signals:
cat /proc/9225/status
Name:   mt
State:  D (disk sleep)
Tgid:   9225
Pid:    9225
PPid:   8916
TracerPid:      0
Uid:    0       0      0     0
Gid:    0       0      0     0
FDSize: 256
Groups: 0 1 2 3 4 6 10
VmSize:    2800 kB
VmLck:        0 kB
VmRSS:      640 kB
VmData:      96 kB
VmStk:       16 kB
VmExe:       32 kB
VmLib:     2560 kB
SigPnd: 0000000000000100 <-- SigPnd is a bit mask which indicates the
value of the pending signal. Each byte accounts for 4 bits. In this
case, the pending signal has a value of 9, so the first bit on the 3rd
byte is set. This algorithm is detailed in linux/fs/proc/array.c under
the render_sigset_t() function. The following table illustrates this
function.
 
Signal    : 1 2 3 4 . 5 6 7 8 . 9 10 11 12 . 13 14 15 16
bit value : 1 2 4 8 . 1 2 4 8 . 1  2  4  8 . 1  2  4  8
 
kill -3 yields bit mask 0000000000000004
kill -9 yields bit mask 0000000000000100
 
ShdPnd: 0000000000000100
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 00000000fffffeff
CapEff: 00000000fffffeff

Troubleshooting the hung process involves these steps:
1.
Identify all the tasks (threads) for the program.
2.
Assess the hanging process. Is it easily reproducible?
3.
Assess the other things going on. What else is the machine doing? Check load and other applications' response time.