Our ingestion pipeline started stalling. Disk consumption skyrocketed, drain rates dropped close to zero. Fluent Bit instances were accepting data and queueing it, but nothing was making it to S3. No errors in the logs - just silent accumulation.
Note: This was on Fluent Bit 3.2.6. The behavior might be fixed in newer releases.
#Checking drain rates
First I measured if data was actually flowing. I sampled the backlog directory size before and after a 30-second interval:
before=$(du -sb /var/cache/fluent-bit-8880 | cut -f1)
sleep 30
after=$(du -sb /var/cache/fluent-bit-8880 | cut -f1)
change=$((after - before))
If change < 0, data is draining. If change == 0 but after > 0, the
instance might be stuck. If change > 0, the backlog is growing faster than
it’s draining.
I wrapped this into a script to check all instances at once:
PORT BEFORE AFTER CHANGE STATUS
------ -------------- -------------- -------------- --------
8880 2416640000 2416640000 0 STUCK?
8881 3768320000 3768320000 0 STUCK?
8882 2592768000 2590000000 -2768000 DRAINING
8883 2527232000 2527232000 0 STUCK?
8884 1740800000 1738000000 -2800000 DRAINING
...
Several instances with zero change and large backlogs.
#Confirming processing was stuck
Zero drain rate could mean no new data is coming in. To confirm the instance was actually stuck, I queried the Fluent Bit HTTP API and checked if processed records were increasing:
curl -s "http://127.0.0.1:2030/api/v1/metrics" | \
jq '.output["s3.0"].proc_records'
I ran this twice with a 30-second gap. proc_records wasn’t increasing - the
instance wasn’t processing anything.
I wrapped this into a script that runs two consecutive samples and only marks an instance stuck if neither showed progress:
Checking for stuck instances via API (2x 30s samples)...
Sample 1/2...
Sample 2/2...
STUCK: 8880 8881 8883
...
#Inspecting the buffered data
Once I knew which instances were stuck, I looked at what was actually in the backlog. Fluent Bit stores chunks as msgpack. I skipped the 28-byte header and unpacked:
import msgpack
with open("/var/cache/fluent-bit-8880/http.0/chunk.flb", "rb") as f:
data = f.read()
unpacker = msgpack.Unpacker(raw=False)
unpacker.feed(data[28:])
for item in unpacker:
if isinstance(item, list) and len(item) == 2:
record = item[1]
print(record.get('domain', '(empty)'))
The stuck chunks had records with empty domain field. This was significant
because our config uses tag_key domain to route records - the domain value
becomes the tag that determines which output picks up the record. Without a
valid domain, the record can’t be routed. It gets queued by the input but no
output ever claims it.
#Verifying by reproduction
Suspecting bad data wasn’t enough. I proved it by moving suspect chunks to a healthy instance:
mv /var/cache/fluent-bit-8880/http.0/bad-chunk.flb \
/var/cache/fluent-bit-8882/http.0/
After restarting 8882, it stopped uploading too. That confirmed the chunks themselves were poisonous.
#The fix
I cleared the bad chunks and added defensive filtering to reject malformed records at ingestion:
[FILTER]
Name grep
Match *
Regex domain ^[a-zA-Z0-9.-]+$
Better to drop malformed records explicitly than let them silently poison your workers.
#Takeaways
Silent stalls are worse than crashes because there’s no error to alert on. When investigating stuck pipelines:
- Check drain rates to identify which instances aren’t flowing
- Confirm via the API that processing is actually stuck
- Inspect the buffered data to find what’s different about stuck chunks
- Prove your hypothesis by reproduction
- Add defensive filters for required fields at the ingestion boundary