#The Incident
At 01:45 AM, netdata started screaming about CPU hitting 100%. All OpenResty workers were pegged. The nginx error logs showed Redis connection timeouts:
[error] *142191 [lua] access_by_lua(nginx.conf:180):44: Failed to connect to Redis for counter: timeout
Here’s how I traced it down.
#Establishing the network baseline
First thing - check if we’re getting hammered at the network level:
netstat -ant | grep SYN_RECV | wc -l
496
netstat -s | grep -i syn
16864284 SYN cookies sent
24096291 SYN cookies received
7468944 invalid SYN cookies received
32381 resets received for embryonic SYN_RECV sockets
149466373 SYNs to LISTEN sockets dropped
149 million SYNs dropped and 496 connections stuck in SYN_RECV. That’s bad. Check what the system limits are:
sysctl net.ipv4.tcp_max_syn_backlog
net.ipv4.tcp_max_syn_backlog = 3240000
sysctl net.core.somaxconn
net.core.somaxconn = 10000000
System limits are fine. But what about nginx?
ss -lnt 'sport = :80'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 512 511 0.0.0.0:80 0.0.0.0:*
There it is. Send-Q shows 511 - the default nginx backlog. With millions of incoming connections, 511 is nothing.
#Measuring actual traffic rates
Before assuming it’s an attack, measure what’s actually happening:
sar -n TCP 1 10
Linux 6.8.0-71-generic 11/21/25 _x86_64_ (48 CPU)
active/s passive/s iseg/s oseg/s
00:53:33 768.00 1027.00 37056.00 10202.00
00:53:34 2732.00 3093.00 46410.00 19514.00
00:53:35 2069.00 3122.00 48362.00 21623.00
00:53:36 1339.00 2247.00 47031.00 20671.00
...
Average: 1709.40 2802.50 47715.30 21809.50
~2800 passive connections/sec (incoming). High but not insane for a busy ad exchange. This isn’t a DDoS - we’re just not keeping up.
#Finding the actual bottleneck
Time for strace. Attach to a worker and see where it’s spending time:
strace -c -p $(pgrep -f 'nginx: worker' | head -1)
^C
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.57 1.883669 17604 107 107 connect
0.08 0.001581 14 108 socket
0.07 0.001409 24 58 close
0.06 0.001308 15 86 write
...
99.57% of time in failed connect() calls. 107 calls, 107 errors. Every single connect is failing.
This isn’t about accepting incoming connections - it’s about outbound connections. In our case, Lua code connecting to Redis on every request.
#The root cause
With 2000+ requests/sec hitting multiple backend services, each creating new connections, we’re burning through ephemeral ports. Each connection sits in TIME_WAIT for 60 seconds. Math:
- 2000 connections/sec
- 60 second TIME_WAIT
- = 120,000 sockets in TIME_WAIT
Check it:
netstat -ant | grep TIME_WAIT | wc -l
The workers can’t establish new connections because we’ve exhausted the port range.
#The fixes
#1. Increase the listen backlog and enable reuseport
listen 80 backlog=65535 reuseport;
The reuseport option is critical under high load - it gives each worker its own listen socket, eliminating lock contention. Without it, all workers fight over one socket.
#2. Enable keepalive to upstream services
Stop creating new connections for every request. For Lua/OpenResty Redis connections, use connection pooling:
local red = redis:new()
red:set_keepalive(10000, 100) -- 10s timeout, pool size 100
For HTTP upstreams, use nginx keepalive:
upstream backend {
server 127.0.0.1:8080;
keepalive 100;
}
#3. Tune TIME_WAIT handling
sysctl -w net.ipv4.tcp_tw_reuse=1
This lets the kernel reuse TIME_WAIT sockets for new outbound connections.
#After the fix
Traffic rates stayed the same, but CPU dropped from 100% to normal levels. The strace profile shifted from 99% failed connects to actual useful work - epoll_wait, recvfrom, write. The workers were finally doing their job instead of spinning on failed connection attempts.
#The debugging flow
When OpenResty/nginx is pegging CPU under load:
- Check network baseline -
netstat -s | grep -i synfor drops - Check listen backlog -
ss -lntto see if queue is full - Measure traffic -
sar -n TCP 1 10for connection rates - Find the bottleneck -
strace -c -p <worker_pid>to see where time goes - Follow the syscalls - failed connects point to outbound issues, not inbound
The symptom was “can’t accept connections” but the cause was “can’t make outbound connections to Redis.” Without strace, I would’ve kept tuning the wrong thing.