Im having a problem with a stalling Linux system and I have found sysstat/sar to report huge peaks in disk I/O utilization, average service time as well as average wait time at the time of the system stall.
How could I go about to determine which process is causing these peaks the next time it happen?
Is it possible to do with sar (ie: can I find this info from the alreade recorded sar files?
Output for “sar -d”, system stall happened around 12.58-13.01pm.
This is a follow-up question to a thread I started yesterday: Sudden peaks in load and disk block wait. I hope its ok that I created a new topic/question on the matter since I have not been able to resolve the problem yet.
asked Aug 12 ’10 at 7:48
You can use pidstat to print cumulative io statistics per process every 20 seconds with this command:
Each row will have follwing columns:
- PID – process ID
- kB_rd/s – Number of kilobytes the task has caused to be read from disk per second.
- kB_wr/s – Number of kilobytes the task has caused, or shall cause to be written to disk per second.
- kB_ccwr/s – Number of kilobytes whose writing to disk has been cancelled by the task. This may occur when the task truncates some dirty pagecache. In this case, some IO which another task has been accounted for will not be happening.
- Command – The command name of the task.
Output looks like this:
Nothing beats ongoing monitoring, you simply cannot get time-sensitive data back after the event.
There are a couple of things you might be able to check to implicate or eliminate however — /proc is your friend.
Fields 10, 11 are accumulated written sectors, and accumulated time (ms) writing. This will show your hot file-system partitions.
Those fields are PID, command and cumulative IO-wait ticks. This will show your hot processes, though only if they are still running. (You probably want to ignore your filesystem journalling threads.)
The usefulness of the above depends on uptime, the nature of your long running processes, and how your file systems are used.
Caveats: does not apply to pre-2.6 kernels, check your documentation if unsure.
(Now go and do your future-self a favour, install Munin/Nagios/Cacti/whatever 😉
answered Jan 11 ’13 at 22:52
Write the data to a compressed file that atop can read later in an interactive style. Take a reading (delta) every 10 seconds. do it 1080 times (3 hours; so if you forget about it the output file won’t run you out of disk) atop -a -w historical_everything.atop 10 1080
After bad thing happens again:
(even if it is still running in the background, it just appends every 10 seconds)
atop -r historical_everything.atop
Since you said IO, I would hit 3 keys: tdD
answered Oct 16 ’15 at 16:41
Use btrace. It’s easy to use, for example btrace /dev/sda. If the command is not available, it is probably available in package blktrace .
EDIT. Since the debugfs is not enabled in the kernel, you might try date /tmp/wtf ps -eo “cmd,pid,min_flt,maj_flt” /tmp/wtf or similar. Logging page faults is not of course at all the same than using btrace, but if you are lucky, it MAY give you some hint about the most disk hungry processes. I just tried that one on of my most I/O intensive servers and list included the processes I know are consuming lots of I/O.
answered Aug 12 ’10 at 8:02
Hello Janne, the kernel is unfortunately not compiled with debug file system, and its a live system so I am unable to recompile the kernel. Is there any other way to do this without recompiling? Avada Kedavra Aug 12 ’10 at 8:13
OK, I edited my reply a bit 🙂 Janne Pikkarainen Aug 12 ’10 at 8:27
Great, now we re getting somewhere! Im thinking about putting this into a cronjob and execute it concurrently with the sar cron job. Then, next time the server stalls I should be able to compare the rate of the page faults to see which process/processes has an increased rate of page faults. I guess I could be unlucky and see a raise in disk io for all processes during the stall, but its definately worth a good try. Thanks Janne! (i would vote up on your answere if i could :S) Avada Kedavra Aug 12 ’10 at 9:02