kafka Production Issues

Below is a practical, production‑focused checklist of common Kafka issues and their solutions, written from a DevOps / SRE / Cloud production perspective. This is aligned with real incidents seen in Kafka clusters and commonly documented production failure modes. [confluent.io], [klogic.io]

1. Consumer Lag (Most Common Issue)

Symptoms

Increasing lag in kafka-consumer-groups.sh
Delayed processing / SLA breach
Frequent consumer rebalances

Root Causes

Consumers slower than producers
Too few consumers vs partitions
Downstream system (DB / API) slow
Large message size or burst traffic

Solutions

Scale consumers (max = number of partitions)
Optimize consumer logic (batch processing)
Increase:
- max.poll.records
- fetch.min.bytes
Reduce downstream dependency latency
Add partitions if required (carefully)

👉 This is almost always a consumer-side stability or performance issue, not just a Kafka issue.

Below is a structured production-grade troubleshooting + solution guide.

Common Root Causes

🔹 1. Consumer Processing Is Slow
Heavy transformation logic
Blocking I/O (DB calls, REST calls)
Synchronous processing

🔹 2. Insufficient Consumers

Not enough partitions
Not enough consumer instances

🔹 3. Max Poll Misconfiguration

If processing time > max.poll.interval.ms, Kafka thinks consumer is dead → rebalance

✅ Fixes

✔ Increase Consumer Parallelism

Add more partitions
Add more consumer instances
Use multi-threaded processing inside consumer

Tune Consumer Config
max.poll.records=500
max.poll.interval.ms=900000
session.timeout.ms=15000
heartbeat.interval.ms=5000
fetch.min.bytes=1
fetch.max.wait.ms=500
If processing takes long → increase:
max.poll.interval.ms
File Location : /opt/kafka/config/consumer.properties

2. Broker CPU / Memory / Disk Exhaustion

Symptoms

High CPU (>80%)
JVM GC pauses
Disk I/O wait
Request timeouts

Root Causes

Uneven partition distribution
Too many small messages
Inadequate disk IOPS
Incorrect JVM heap sizing

Solutions

Rebalance partitions across brokers
Increase:
- num.network.threads
- num.io.threads
Tune JVM heap (usually 25–30% RAM)
Use SSD / high‑IOPS disks
Enable compression (lz4 / zstd)

3. Under‑Replicated / Offline Partitions

Symptoms

UnderReplicatedPartitions > 0
Producers blocked (acks=all)
Partition leaders unavailable

Root Causes

Broker failure
Network partition
ISR shrink due to slow replicas

Solutions

Fix broker connectivity
Increase disk/network capacity
Ensure:
- replication.factor ≥ 3
- min.insync.replicas = 2
Keep:
- unclean.leader.election.enable=false

4. Message Loss / Duplicate Messages

Symptoms

Missing events
Duplicate processing
Inconsistent offsets

Root Causes

Producer acks=1 or acks=0
Retries without idempotence
Consumer auto‑commit issues

Solutions

Producer

acks=all
enable.idempotence=true
Proper retry configuration

Consumer

Disable auto commit
Commit offsets after successful processing

5. Frequent Consumer Rebalancing

Symptoms

Rebalances every few minutes
Consumers constantly restarting

Root Causes

Long processing time
max.poll.interval.ms exceeded
Unstable consumer pods / VMs

Solutions

Increase max.poll.interval.ms
Reduce processing time per poll
Use cooperative rebalancing
Avoid frequent consumer restarts

6. Disk Full / Retention Misconfiguration

Symptoms

Brokers crash
No new messages accepted
Log directory errors

Root Causes

Retention too high
Unexpected traffic spikes
No disk alerts

Solutions

Set:
- log.retention.hours
- log.retention.bytes
Enable log cleanup (delete or compact)
Add disk monitoring & alerts
Separate Kafka data disks

Search This Blog

KAFKA

kafka Production Issues

1. Consumer Lag (Most Common Issue)

Symptoms

Root Causes

Solutions

Common Root Causes

🔹 1. Consumer Processing Is Slow

🔹 2. Insufficient Consumers

🔹 3. Max Poll Misconfiguration

✅ Fixes

✔ Increase Consumer Parallelism

2. Broker CPU / Memory / Disk Exhaustion

Symptoms

Root Causes

Solutions

3. Under‑Replicated / Offline Partitions

Symptoms

Root Causes

Solutions

4. Message Loss / Duplicate Messages

Symptoms

Root Causes

Solutions

5. Frequent Consumer Rebalancing

Symptoms

Root Causes

Solutions

6. Disk Full / Retention Misconfiguration

Symptoms

Root Causes

Solutions

Comments

Post a Comment