The Incident

At 01:45 AM, netdata started screaming about CPU hitting 100%. All OpenResty workers were pegged. The nginx error logs showed Redis connection timeouts:

[error] *142191 [lua] access_by_lua(nginx.conf:180):44: Failed to connect to Redis for counter: timeout

Here’s how I traced it down.

Establishing the network baseline

First thing - check if we’re getting hammered at the network level:

netstat -ant | grep SYN_RECV | wc -l
496

netstat -s | grep -i syn
    16864284 SYN cookies sent
    24096291 SYN cookies received
    7468944 invalid SYN cookies received
    32381 resets received for embryonic SYN_RECV sockets
    149466373 SYNs to LISTEN sockets dropped

149 million SYNs dropped and 496 connections stuck in SYN_RECV. That’s bad. Check what the system limits are:

sysctl net.ipv4.tcp_max_syn_backlog
net.ipv4.tcp_max_syn_backlog = 3240000

sysctl net.core.somaxconn
net.core.somaxconn = 10000000

System limits are fine. But what about nginx?

ss -lnt 'sport = :80'
State    Recv-Q    Send-Q    Local Address:Port    Peer Address:Port
LISTEN   512       511       0.0.0.0:80             0.0.0.0:*

There it is. Send-Q shows 511 - the default nginx backlog. With millions of incoming connections, 511 is nothing.

Measuring actual traffic rates

Before assuming it’s an attack, measure what’s actually happening:

sar -n TCP 1 10
Linux 6.8.0-71-generic    11/21/25    _x86_64_    (48 CPU)

             active/s passive/s    iseg/s    oseg/s
00:53:33       768.00   1027.00  37056.00  10202.00
00:53:34      2732.00   3093.00  46410.00  19514.00
00:53:35      2069.00   3122.00  48362.00  21623.00
00:53:36      1339.00   2247.00  47031.00  20671.00
...
Average:      1709.40   2802.50  47715.30  21809.50

~2800 passive connections/sec (incoming). High but not insane for a busy ad exchange. This isn’t a DDoS - we’re just not keeping up.

Finding the actual bottleneck

Time for strace. Attach to a worker and see where it’s spending time:

strace -c -p $(pgrep -f 'nginx: worker' | head -1)
^C
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.57    1.883669       17604       107       107 connect
  0.08    0.001581          14       108           socket
  0.07    0.001409          24        58           close
  0.06    0.001308          15        86           write
...

99.57% of time in failed connect() calls. 107 calls, 107 errors. Every single connect is failing.

This isn’t about accepting incoming connections - it’s about outbound connections. In our case, Lua code connecting to Redis on every request.

The root cause

With 2000+ requests/sec hitting multiple backend services, each creating new connections, we’re burning through ephemeral ports. Each connection sits in TIME_WAIT for 60 seconds. Math:

  • 2000 connections/sec
  • 60 second TIME_WAIT
  • = 120,000 sockets in TIME_WAIT

Check it:

netstat -ant | grep TIME_WAIT | wc -l

The workers can’t establish new connections because we’ve exhausted the port range.

The fixes

1. Increase the listen backlog and enable reuseport

listen 80 backlog=65535 reuseport;

The reuseport option is critical under high load - it gives each worker its own listen socket, eliminating lock contention. Without it, all workers fight over one socket.

2. Enable keepalive to upstream services

Stop creating new connections for every request. For Lua/OpenResty Redis connections, use connection pooling:

local red = redis:new()
red:set_keepalive(10000, 100)  -- 10s timeout, pool size 100

For HTTP upstreams, use nginx keepalive:

upstream backend {
    server 127.0.0.1:8080;
    keepalive 100;
}

3. Tune TIME_WAIT handling

sysctl -w net.ipv4.tcp_tw_reuse=1

This lets the kernel reuse TIME_WAIT sockets for new outbound connections.

After the fix

Traffic rates stayed the same, but CPU dropped from 100% to normal levels. The strace profile shifted from 99% failed connects to actual useful work - epoll_wait, recvfrom, write. The workers were finally doing their job instead of spinning on failed connection attempts.

The debugging flow

When OpenResty/nginx is pegging CPU under load:

  1. Check network baseline - netstat -s | grep -i syn for drops
  2. Check listen backlog - ss -lnt to see if queue is full
  3. Measure traffic - sar -n TCP 1 10 for connection rates
  4. Find the bottleneck - strace -c -p <worker_pid> to see where time goes
  5. Follow the syscalls - failed connects point to outbound issues, not inbound

The symptom was “can’t accept connections” but the cause was “can’t make outbound connections to Redis.” Without strace, I would’ve kept tuning the wrong thing.