Our ingestion pipeline started stalling. Disk consumption skyrocketed, drain rates dropped close to zero. Fluent Bit instances were accepting data and queueing it, but nothing was making it to S3. No errors in the logs - just silent accumulation.

Note: This was on Fluent Bit 3.2.6. The behavior might be fixed in newer releases.

Checking drain rates

First I measured if data was actually flowing. I sampled the backlog directory size before and after a 30-second interval:

before=$(du -sb /var/cache/fluent-bit-8880 | cut -f1)
sleep 30
after=$(du -sb /var/cache/fluent-bit-8880 | cut -f1)
change=$((after - before))

If change < 0, data is draining. If change == 0 but after > 0, the instance might be stuck. If change > 0, the backlog is growing faster than it’s draining.

I wrapped this into a script to check all instances at once:

PORT    BEFORE          AFTER           CHANGE          STATUS
------  --------------  --------------  --------------  --------
8880    2416640000      2416640000      0               STUCK?
8881    3768320000      3768320000      0               STUCK?
8882    2592768000      2590000000      -2768000        DRAINING
8883    2527232000      2527232000      0               STUCK?
8884    1740800000      1738000000      -2800000        DRAINING
...

Several instances with zero change and large backlogs.

Confirming processing was stuck

Zero drain rate could mean no new data is coming in. To confirm the instance was actually stuck, I queried the Fluent Bit HTTP API and checked if processed records were increasing:

curl -s "http://127.0.0.1:2030/api/v1/metrics" | \
    jq '.output["s3.0"].proc_records'

I ran this twice with a 30-second gap. proc_records wasn’t increasing - the instance wasn’t processing anything.

I wrapped this into a script that runs two consecutive samples and only marks an instance stuck if neither showed progress:

Checking for stuck instances via API (2x 30s samples)...
Sample 1/2...
Sample 2/2...

STUCK: 8880 8881 8883
...

Inspecting the buffered data

Once I knew which instances were stuck, I looked at what was actually in the backlog. Fluent Bit stores chunks as msgpack. I skipped the 28-byte header and unpacked:

import msgpack

with open("/var/cache/fluent-bit-8880/http.0/chunk.flb", "rb") as f:
    data = f.read()

unpacker = msgpack.Unpacker(raw=False)
unpacker.feed(data[28:])

for item in unpacker:
    if isinstance(item, list) and len(item) == 2:
        record = item[1]
        print(record.get('domain', '(empty)'))

The stuck chunks had records with empty domain field. This was significant because our config uses tag_key domain to route records - the domain value becomes the tag that determines which output picks up the record. Without a valid domain, the record can’t be routed. It gets queued by the input but no output ever claims it.

Verifying by reproduction

Suspecting bad data wasn’t enough. I proved it by moving suspect chunks to a healthy instance:

mv /var/cache/fluent-bit-8880/http.0/bad-chunk.flb \
   /var/cache/fluent-bit-8882/http.0/

After restarting 8882, it stopped uploading too. That confirmed the chunks themselves were poisonous.

The fix

I cleared the bad chunks and added defensive filtering to reject malformed records at ingestion:

[FILTER]
    Name    grep
    Match   *
    Regex   domain ^[a-zA-Z0-9.-]+$

Better to drop malformed records explicitly than let them silently poison your workers.

Takeaways

Silent stalls are worse than crashes because there’s no error to alert on. When investigating stuck pipelines:

  1. Check drain rates to identify which instances aren’t flowing
  2. Confirm via the API that processing is actually stuck
  3. Inspect the buffered data to find what’s different about stuck chunks
  4. Prove your hypothesis by reproduction
  5. Add defensive filters for required fields at the ingestion boundary