I have a cluster of systems running a particular kind of networking software that under heavy traffic and performing deep packet inspection remains relatively underutilized in terms of CPU load and memory consumption.
One day though the CPU spiked from 0.5 to 18 on a 12 core system on every node on that cluster, including the ones that were just passive fail-overs and that did not do any processing, just being a member of the same multicast domain. Nothing seemed out of the ordinary though, the traffic pattern was the same, no upgrades, updates or any system changes whatsoever. OSSEC did not report any file checksum changes, so I knew the systems were not compromised.
This is a perfect case to demonstrate a way to troubleshoot sudden and persistent CPU utilization spikes and how to look for memory leaks. Of course this will not apply to every case, but demonstrates the use of certain tool-set and the methodology behind approaching similar problems.
First I checked the load averages on the server:
Then I got the list of processes that consume the most CPU:
No surprises here, the main networking software is the one causing the load, as no other services are running on that server. I'll focus on the biggest consumer - PID 18354.
Next I performed a strace on that pid for about 10 seconds:
This returned lots of the following lines:
The only thing that stood out from this trace, was this particular failure to read file descriptor ID 11.
Then I checked to see what this file descriptor is:
So at least I learned that it’s a socket.
Got some more information about that particular socket:
Then I checked the memory consumption and any indicators for a memory leak for that PID:
Pretty high memory usage, each of the networking child processes havs close to that much mapped memory.
Short of few regions that are shared and uninitialized (rwxs- zero (deleted)) nothing else really stood out.
After this I checked to see if we have enough file descriptors for connection allocation:
We have plenty available.
Next I checked to see how much memory the kernel is configured to dedicate to TCP:
And compared the 'max' number with how much of that memory TCP actually uses:
The max number of pages currently allocated to TCP(9413) is less than what is reserved by the kernel (6961536).
Then I checked to see if there are too many orphaned sockets:
As you can see the current orphaned sockets (529) is less than the maximal number of TCP sockets not attached to any user file handle, displayed above (262144). So we are good on that front.
At this point I wanted to see the different states of the INET sockets so I wrote this quick script that shows the 4 important TCP states - ESTABLISHED, SYN_RECEIVED, TIME_WAIT and LAST_ACK and show me the top 5 users for them:
Which after running returned the following result (I masked the first octets of the IP addresses):
So nothing unusual here. TCP dump and kernel messages did not reveal anything unusual. Rebooting the software or the server did not help either.
At this point I ran out of ideas as to what to look for. I knew it's the networking software that was causing the high load and memory consumption as shown by the process list (or top), but since the software is closed source I could not go and look at the actual code.
The last thing I tried though was to start gathering memory consumption for all running processes and their children every minute for 5 days, by running the following through cron and storing it in a file along with the PIDs:
After 5 days I graphed the results with Graphite with all the processes memory usage superimposed on each other. There was a single process (the same one with the highest CPU usage) which was consuming an ever increasing memory over time indicative of a memory leak.
I sent the results to the product vendor and sure enough they found 2 bugs in their code, one that was not cleaning sockets and a memory leak.