RabbitMQ Connectivity Failure in dev¶
Technical Issue Report: RabbitMQ Connectivity Failure in Development Environment¶
Issue Summary¶
The RabbitMQ cluster in our development environment experienced a critical issue where it was rejecting connections. Initial observations indicated an anomalous state, with two out of five pods being non-operational while the remaining three appeared active. Despite this, the cluster should have maintained functionality, suggesting underlying complications that required further investigation.
Detailed Analysis¶
Upon inspecting the cluster pods, it was noted that two pods were down. Theoretically, the remaining three pods should have sustained the cluster operations. However, the cluster unexpectedly began rejecting incoming connections.
Logs were examined to determine the cause of the pod failures, revealing that the pods were being evicted due to Out of Memory (OOM) errors. These pods would restart but then quickly fail again during the RabbitMQ startup process because of the same OOM issues.
Resolution Steps Taken¶
-
Pod Deletion and Cluster State Anomaly:
- We deleted one problematic pod (
rabbitmq-1), which inadvertently resulted in a split-brain scenario within the cluster. The noderabbitmq-1was effectively isolated from the cluster, yet the remaining nodes still considered it as part of the cluster. This division created two distinct clusters:- One cluster consisting solely of the isolated
rabbitmq-1pod. - Another cluster composed of the three operational pods.
- One cluster consisting solely of the isolated
- We deleted one problematic pod (
-
Cluster Synchronization Efforts:
- On the operational nodes:
- Executed
rabbitmqctl cluster_statusto verify current cluster membership. - Performed
rabbitmqctl forget_cluster_nodeto remove the stale node entry.
- Executed
- On the isolated
rabbitmq-1pod:- Checked cluster status with
rabbitmqctl cluster_status. - Stopped the RabbitMQ application using
rabbitmqctl stop_app. - Reset the node configuration with
rabbitmqctl reset. - Rejoined the node to the cluster using
rabbitmqctl join_cluster. - Started the RabbitMQ application with
rabbitmqctl app_start.
- Checked cluster status with
- On the operational nodes:
-
Replicate queues over multiple nodes
- Existing queues do not automatically replicate to the new nodes. You might want to delete the queue. This would recreate the queues on all nodes
- Or use this command to replicate queues(all or matching patterns) across given nodes
rabbitmq-queues grow rabbit@minority-rabbitmq-1.minority-rabbitmq-headless.rabbitmq.svc.cluster.local all --queue-pattern batch-feature-update-event-handler
Outstanding Concerns¶
- Cluster Redundancy and Fault Tolerance:
- Despite having three operational pods, the cluster failed to serve connections. The root cause of this failure mode needs further analysis to ensure redundancy and fault tolerance.
- Proper Pod Management in Kubernetes:
- The method used for restarting pods (
kubectl delete pod) may not be appropriate for maintaining stable cluster states. For future interventions, akubectl rollout restartmight be more suitable to ensure smoother transitions and consistent cluster performance.
- The method used for restarting pods (