kafka Production Issues
1. Consumer Lag (Most Common Issue)
Symptoms
- Increasing lag in
kafka-consumer-groups.sh - Delayed processing / SLA breach
- Frequent consumer rebalances
Root Causes
- Consumers slower than producers
- Too few consumers vs partitions
- Downstream system (DB / API) slow
- Large message size or burst traffic
Solutions
- Scale consumers (max = number of partitions)
- Optimize consumer logic (batch processing)
- Increase:
max.poll.recordsfetch.min.bytes
- Reduce downstream dependency latency
- Add partitions if required (carefully)
👉 This is almost always a consumer-side stability or performance issue, not just a Kafka issue.
Below is a structured production-grade troubleshooting + solution guide.
Common Root Causes
🔹 1. Consumer Processing Is Slow
-
Heavy transformation logic
-
Blocking I/O (DB calls, REST calls)
-
Synchronous processing
🔹 2. Insufficient Consumers
-
Not enough partitions
-
Not enough consumer instances
🔹 3. Max Poll Misconfiguration
If processing time > max.poll.interval.ms, Kafka thinks consumer is dead → rebalance
✅ Fixes
✔ Increase Consumer Parallelism
-
Add more partitions
-
Add more consumer instances
Use multi-threaded processing inside consumer
Tune Consumer Config
max.poll.records=500
max.poll.interval.ms=900000
session.timeout.ms=15000
heartbeat.interval.ms=5000
fetch.min.bytes=1
fetch.max.wait.ms=500If processing takes long → increase:
max.poll.interval.ms
File Location : /opt/kafka/config/consumer.properties
2. Broker CPU / Memory / Disk Exhaustion
Symptoms
- High CPU (>80%)
- JVM GC pauses
- Disk I/O wait
- Request timeouts
Root Causes
- Uneven partition distribution
- Too many small messages
- Inadequate disk IOPS
- Incorrect JVM heap sizing
Solutions
- Rebalance partitions across brokers
- Increase:
num.network.threadsnum.io.threads
- Tune JVM heap (usually 25–30% RAM)
- Use SSD / high‑IOPS disks
- Enable compression (lz4 / zstd)
3. Under‑Replicated / Offline Partitions
Symptoms
UnderReplicatedPartitions > 0- Producers blocked (acks=all)
- Partition leaders unavailable
Root Causes
- Broker failure
- Network partition
- ISR shrink due to slow replicas
Solutions
- Fix broker connectivity
- Increase disk/network capacity
- Ensure:
replication.factor ≥ 3min.insync.replicas = 2
- Keep:
unclean.leader.election.enable=false
4. Message Loss / Duplicate Messages
Symptoms
- Missing events
- Duplicate processing
- Inconsistent offsets
Root Causes
- Producer
acks=1oracks=0 - Retries without idempotence
- Consumer auto‑commit issues
Solutions
Producer
acks=allenable.idempotence=true- Proper retry configuration
Consumer
- Disable auto commit
- Commit offsets after successful processing
5. Frequent Consumer Rebalancing
Symptoms
- Rebalances every few minutes
- Consumers constantly restarting
Root Causes
- Long processing time
max.poll.interval.msexceeded- Unstable consumer pods / VMs
Solutions
- Increase
max.poll.interval.ms - Reduce processing time per poll
- Use cooperative rebalancing
- Avoid frequent consumer restarts
6. Disk Full / Retention Misconfiguration
Symptoms
- Brokers crash
- No new messages accepted
- Log directory errors
Root Causes
- Retention too high
- Unexpected traffic spikes
- No disk alerts
Solutions
- Set:
log.retention.hourslog.retention.bytes
- Enable log cleanup (
deleteorcompact) - Add disk monitoring & alerts
- Separate Kafka data disks
Comments
Post a Comment