Kafka Consumer Groups in Production: Lag, Rebalancing, and the Mistakes That Hurt - Writing

Most Kafka problems in production are not Kafka problems. They are consumer problems nobody thought about at design time.

You added Kafka to your architecture. Messages are flowing. Everything looks fine.

Then traffic spikes. Or you deploy a new version. Or a single consumer slows down. And suddenly your lag graph looks like a mountain range and your on-call rotation gets a lot less boring.

Here’s the principle that stops that from happening.

Consumer lag is a symptom, not a metric

Consumer lag is the number of messages in a partition that have been produced but not yet consumed. High lag means your consumers are falling behind producers.

Most teams treat lag as an alerting threshold. “Alert if lag > 10,000.” That’s better than nothing, but it misses the point. Lag tells you that something is wrong. It does not tell you why.

Lag grows for three reasons: the consumer is slow, the consumer is stuck, or the consumer stopped altogether. Each has a different fix. Treating them all as “lag is high, scale up consumers” is how you end up with more consumers that are equally slow.

The useful companion metric is consumer lag rate: how fast lag is growing, not just its current value. Lag at 50,000 and stable is a different problem from lag at 2,000 and doubling every minute.

Key detail to remember: Monitor the rate of change, not just the absolute number. A consumer that falls behind during a traffic spike and recovers is healthy. A consumer that falls behind and never catches up has a problem you need to find.

What rebalancing actually does, and why deployments break it

A consumer group is a set of consumers that share the work of reading from a topic. Kafka assigns partitions to consumers within the group. Within a single group, one consumer owns a partition at any given time (different groups can read the same partition independently).

When the group membership changes (a consumer joins, leaves, or crashes), Kafka redistributes the partition assignments. That redistribution is a rebalance.

During a rebalance, consumption stops. Every consumer in the group pauses until the new assignment is complete. On a topic with moderate traffic this is a blip. On a high-throughput topic with many partitions, this is a service degradation you’ll see in your dashboards.

The part that surprises teams: rolling deployments trigger rebalances. You restart one pod, Kafka sees a consumer leave. A few seconds later the new pod starts, Kafka sees a consumer join. Two rebalances per pod in your rollout. On a 20-pod deployment, that’s 40 rebalances.

I have seen teams spend weeks tuning their Kafka setup and then undo all of it with an aggressive deployment strategy that triggers constant rebalancing.

There is a feature designed exactly for this: static membership (KIP-345). Set group.instance.id to a stable identifier per pod (the StatefulSet name, for example) and Kafka treats short-lived restarts within session.timeout.ms as the same member returning. No rebalance, no partition shuffle. Combine it with a graceful shutdown that commits offsets before exit and most rolling-deploy lag spikes disappear.

The configuration knobs that matter

These four properties control how quickly Kafka decides a consumer is dead and triggers a rebalance. Getting them wrong is the single most common source of unnecessary rebalancing.

session.timeout.ms: How long Kafka waits without a heartbeat before declaring a consumer dead. Default is 45 seconds. Too low and healthy-but-busy consumers get evicted. Too high and truly dead consumers keep their partitions for too long.

heartbeat.interval.ms: How often the consumer sends a heartbeat to the group coordinator. Should be set to roughly one third of session.timeout.ms. This is not about the consumer’s health; it’s about the broker knowing the consumer is still alive.

max.poll.interval.ms: The maximum time between two poll() calls before Kafka considers the consumer dead. This is the one teams get wrong most often. If your processing logic takes longer than this value, Kafka evicts the consumer mid-batch, you get a rebalance, and the messages get reprocessed by the next consumer. Hello, duplicates, which is why every consumer in production needs an idempotency strategy on the receiving end.

max.poll.records: How many records are returned per poll() call. Reducing this gives each poll cycle less work to do, which makes it easier to stay within max.poll.interval.ms. The trade-off is throughput.

A starting configuration that works for most backend services:

session.timeout.ms=45000
heartbeat.interval.ms=15000
max.poll.interval.ms=300000
max.poll.records=500

Adjust max.poll.interval.ms based on your actual processing time. Measure it before you set it.

How to monitor consumer health without drowning in metrics

You do not need 30 dashboards. You need four signals:

Lag per partition: not just total group lag. A partition with zero consumers assigned is invisible in aggregate lag until it becomes a crisis.

Rebalance rate: how often your consumer group is rebalancing. More than once per deployment window is a signal to investigate.

Poll interval duration: how long your application actually takes between poll() calls. If this is approaching max.poll.interval.ms, you are one slow database query away from a rebalance cascade.

Consumer count vs partition count: if you have more consumers than partitions, some consumers are idle. If you have fewer, some consumers are handling multiple partitions. Neither is wrong, but both affect how rebalances play out.

Most teams use Kafka’s built-in kafka-consumer-groups.sh or a metrics exporter like the Kafka JMX exporter feeding into Prometheus. The specific tool matters less than having these four signals visible and alerted on.

If you are exporting JMX, the metrics that map to the four signals above are:

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,topic=*,partition=* → records-lag (per-partition lag)
kafka.consumer:type=consumer-coordinator-metrics,client-id=* → rebalance-rate-per-hour and last-rebalance-seconds-ago
kafka.consumer:type=consumer-coordinator-metrics,client-id=* → time-between-poll-avg and time-between-poll-max (compare against max.poll.interval.ms)
kafka.consumer:type=consumer-coordinator-metrics,client-id=* → assigned-partitions (compare with topic partition count)

The rebalancing strategy that stopped unnecessary disruptions

The default rebalancing strategy in Kafka is eager rebalancing: on any membership change, all consumers drop their partitions, wait for the new assignment, and resume. Full stop.

Since Kafka 2.4, there is an alternative: cooperative sticky rebalancing (also called incremental cooperative rebalancing). Instead of dropping everything, only the partitions that need to move actually move. Consumers that keep their assignments never stop.

To enable it:

partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor

The impact on rolling deployments is significant. Instead of stopping all consumers during each pod restart, only the partitions being redistributed pause. The rest of the group keeps processing.

One caveat: if you are migrating from the default RangeAssignor, do a two-step rollout. Run with both strategies listed during the transition, then remove the old one after all consumers have restarted. Skipping this step causes a full rebalance anyway.

Looking ahead, KIP-848 introduces a new consumer rebalance protocol that moves the assignment work to the broker side and removes the eager/cooperative distinction altogether. It is opt-in in recent Kafka releases and will eventually become the default. Worth tracking if you are on Kafka 4.x.

Shutting down without triggering a rebalance you didn’t need to

The cheapest mitigation for deploy-time rebalances is also the most overlooked: shut the consumer down on purpose, instead of letting the JVM disappear.

public class GracefulConsumer {
    private final KafkaConsumer<String, String> consumer;
    private final AtomicBoolean running = new AtomicBoolean(true);

    public void run() {
        Runtime.getRuntime().addShutdownHook(new Thread(() -> {
            running.set(false);
            consumer.wakeup();
        }));

        try {
            while (running.get()) {
                ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(1));
                process(records);
                consumer.commitSync();
            }
        } catch (WakeupException ignored) {
            // expected on shutdown
        } finally {
            consumer.close(Duration.ofSeconds(30));
        }
    }
}

consumer.close() sends a LeaveGroup request and commits final offsets. Combined with group.instance.id, the broker waits for the configured session timeout before redistributing partitions, giving the new pod a chance to rejoin first.

The principle that ties it together

Consumer lag is not a Kafka problem to solve with more partitions or more consumers. It is a signal about your consumer’s ability to keep up.

Rebalancing is not a failure. It is Kafka doing its job. Unnecessary rebalancing (caused by misconfigured timeouts, slow processing, or thoughtless deployments) is entirely preventable.

Get the configuration right. Instrument the four signals. Switch to cooperative rebalancing. Then your on-call rotation goes back to being boring, which is exactly what you want.