Elevated number of dropped TCP connections to a listening remote network socket.
The symptom:
“SYNs to LISTEN sockets dropped” increments at a high rate:
Obtaining the baseline:
Starting from a lower level, let's check the size of the transmit queue on the network interface and make sure there aren’t any collisions:
Next, let’s check to see if the interface is dropping packets due to the transmit queue:
Finally, check for any fragmentation problems:
Moving up the stack, print the Accept Queue sizes for the listening service. Recv-Q shows the number of sockets in the Accept Queue, and Send-Q shows the backlog parameter:
Nothing really in the accept queue, let’s check how many connections are in SYN-RECV state for the receiving process in question:
Connections are moving to ESTABLISHED pretty quickly. Let’s make sure we have enough file descriptors available (the current number of allocated file handles, the number of unused but allocated file handles, the system-wide maximum):
Check for half-closed connections, waiting on FIN,ACK and total established connections:
No concerns here, based on the total number of connections. Let’s check the number of concurrent (NEW) connections:
Current rate is at 180 NEW connections per second. Observing the rate on a single node for a 24 hours period, we peak at about 250 connections per second. Checking the CPU and memory utilization shows a pretty idle system even during peak send times:
Finally, checking the counter for dropped SYN packets, shows an ever increasing number at a rate of about 20/sec:
The main reason for dropping SYN packets is when the SYN Queue is getting full. I was not able to see that in any of the above diagnostics. For better visibility let’s install some kernel hooks with SystemTap to print details on specifically what connections suffer due to Accept Queue overflow. This should help in identifying periodically hung applications that fail to accept() connections fast enough:
Unfortunately running the kernel hook for 24 hours did not yield any results. The SYN and ACCEPT queues were nearly empty, though the “SYNs to LISTEN sockets dropped” issue persisted.
Tuning the kernel for better network performance:
After each incremental change, I measured the rate of SYN errors and checked the SYN and Accept queue utilizations.
Increased the number of incoming connections backlog queue. This queue sets the maximum number of packets, queued on the INPUT side:
Increased the overall TCP memory, in pages (number of guaranteed pages for TCP, the threshold at which TCP should start to conserve pages, maximum number of allocatable pages):
Increased the core system socket read and write buffers absolute max, in bytes. The applications cannot request more than this value:
Increased the system socket read and write buffers (min, default and max size in bytes):
Ensured TCP window scaling is enabled:
Updated how many times to retry SYN connections. With the default the final timeout for an active TCP connection attempt will happen after 127 seconds:
And arguably most importantly I’ve increased the limit of the socket listen() backlog, the maximum value that net.ipv4.tcp_max_syn_backlog can take. The kernel documentation states that if this limit is reached SYN packets will be dropped:
Even though that huge number got accepted (the default varies by kernel version, from 128 to 4096) the queue can’t be more than 65535 it seems:
Increased the Listener queue length for unacknowledged SYN_RECV connection attempts. A SYN_RECV request socket consumes about 304 bytes of memory:
Checking to see how many connections are in SYN_RECV state after the above change:
Increased the number of times SYNACKs for a passive TCP connection attempt will be retransmitted. WIth the default the final timeout for a passive TCP connection will happen after 63seconds:
Finally, disabling the reuse of TCP connections (at the expense of increased number of TIME_WAIT connections and about 120MB of extra memory usage) yielded the best result, dropped SYN packets went down to about 3 per 15 minutes!