kafka Production Issues

 

Below is a practical, production‑focused checklist of common Kafka issues and their solutions, written from a DevOps / SRE / Cloud production perspective. This is aligned with real incidents seen in Kafka clusters and commonly documented production failure modes. [confluent.io], [klogic.io]
 

1. Consumer Lag (Most Common Issue)

Symptoms

  • Increasing lag in kafka-consumer-groups.sh
  • Delayed processing / SLA breach
  • Frequent consumer rebalances

Root Causes

  • Consumers slower than producers
  • Too few consumers vs partitions
  • Downstream system (DB / API) slow
  • Large message size or burst traffic

Solutions

  • Scale consumers (max = number of partitions)
  • Optimize consumer logic (batch processing)
  • Increase:
    • max.poll.records
    • fetch.min.bytes
  • Reduce downstream dependency latency
  • Add partitions if required (carefully)

    👉 This is almost always a consumer-side stability or performance issue, not just a Kafka issue.

    Below is a structured production-grade troubleshooting + solution guide.


    Common Root Causes

    🔹 1. Consumer Processing Is Slow

  • Heavy transformation logic

  • Blocking I/O (DB calls, REST calls)

  • Synchronous processing

🔹 2. Insufficient Consumers

  • Not enough partitions

  • Not enough consumer instances

🔹 3. Max Poll Misconfiguration

If processing time > max.poll.interval.ms, Kafka thinks consumer is dead → rebalance

✅ Fixes

✔ Increase Consumer Parallelism

  • Add more partitions

  • Add more consumer instances

  • Use multi-threaded processing inside consumer


    Tune Consumer Config 

    max.poll.records=500
    max.poll.interval.ms=900000
    session.timeout.ms=15000
    heartbeat.interval.ms=5000
    fetch.min.bytes=1
    fetch.max.wait.ms=500

    If processing takes long → increase:

    max.poll.interval.ms

    File Location : /opt/kafka/config/consumer.properties




2. Broker CPU / Memory / Disk Exhaustion

Symptoms

  • High CPU (>80%)
  • JVM GC pauses
  • Disk I/O wait
  • Request timeouts 

Root Causes

  • Uneven partition distribution
  • Too many small messages
  • Inadequate disk IOPS
  • Incorrect JVM heap sizing

Solutions

  • Rebalance partitions across brokers
  • Increase:
    • num.network.threads
    • num.io.threads
  • Tune JVM heap (usually 25–30% RAM)
  • Use SSD / high‑IOPS disks
  • Enable compression (lz4 / zstd)

3. Under‑Replicated / Offline Partitions

Symptoms

  • UnderReplicatedPartitions > 0
  • Producers blocked (acks=all)
  • Partition leaders unavailable

Root Causes

  • Broker failure
  • Network partition
  • ISR shrink due to slow replicas

Solutions

  • Fix broker connectivity
  • Increase disk/network capacity
  • Ensure:
    • replication.factor ≥ 3
    • min.insync.replicas = 2
  • Keep:
    • unclean.leader.election.enable=false

4. Message Loss / Duplicate Messages

Symptoms

  • Missing events
  • Duplicate processing
  • Inconsistent offsets

Root Causes

  • Producer acks=1 or acks=0
  • Retries without idempotence
  • Consumer auto‑commit issues

Solutions

Producer

  • acks=all
  • enable.idempotence=true
  • Proper retry configuration

Consumer

  • Disable auto commit
  • Commit offsets after successful processing

5. Frequent Consumer Rebalancing

Symptoms

  • Rebalances every few minutes
  • Consumers constantly restarting

Root Causes

  • Long processing time
  • max.poll.interval.ms exceeded
  • Unstable consumer pods / VMs

Solutions

  • Increase max.poll.interval.ms
  • Reduce processing time per poll
  • Use cooperative rebalancing
  • Avoid frequent consumer restarts

6. Disk Full / Retention Misconfiguration

Symptoms

  • Brokers crash
  • No new messages accepted
  • Log directory errors

Root Causes

  • Retention too high
  • Unexpected traffic spikes
  • No disk alerts

Solutions

  • Set:
    • log.retention.hours
    • log.retention.bytes
  • Enable log cleanup (delete or compact)
  • Add disk monitoring & alerts
  • Separate Kafka data disks

Comments