Skip to content

Bug---some-users'-features-are-not-updated

Title: Some users' features are not updated while others are

Summary:

Very rarely (like once per year), we see some users' features are updated but some other users' features are not updated. Or you may feel there is a bug in risk which skip events unexpectedly. The reason is we have two threads in risk-features-data-service are processing events from kinesis data stream. When one thread stops and another thread is working. We see this result.

We use two threads because we have two shards in kinesis data stream. We use two shards in kinesis data stream because there is IO throughput restriction on shard level.

Each thread in the risk-features-data-service processes events from a shard specified when the service is started. When a thread stops working, all events in that specified shard are not processed.

Why thread stops

Technically, The thread does not stop. We call Kinesis driver to fetch event from kinesis stream. The method call does not return so it looks like the thread stops.

Solution

We include heartbeats from both thread to Liveness Health Probes. If any thread fails to report heartbeat for a while (a configuration value which is 120 minutes now), risk-features-data-service report Unhealthy to Kubernetes and get restarted.