RabbitMQ Connectivity Failure in dev¶

Technical Issue Report: RabbitMQ Connectivity Failure in Development Environment¶

Issue Summary¶

The RabbitMQ cluster in our development environment experienced a critical issue where it was rejecting connections. Initial observations indicated an anomalous state, with two out of five pods being non-operational while the remaining three appeared active. Despite this, the cluster should have maintained functionality, suggesting underlying complications that required further investigation.

Detailed Analysis¶

Upon inspecting the cluster pods, it was noted that two pods were down. Theoretically, the remaining three pods should have sustained the cluster operations. However, the cluster unexpectedly began rejecting incoming connections.

Logs were examined to determine the cause of the pod failures, revealing that the pods were being evicted due to Out of Memory (OOM) errors. These pods would restart but then quickly fail again during the RabbitMQ startup process because of the same OOM issues.

Resolution Steps Taken¶

Pod Deletion and Cluster State Anomaly:
- We deleted one problematic pod (rabbitmq-1), which inadvertently resulted in a split-brain scenario within the cluster. The node rabbitmq-1 was effectively isolated from the cluster, yet the remaining nodes still considered it as part of the cluster. This division created two distinct clusters:
  - One cluster consisting solely of the isolated rabbitmq-1 pod.
  - Another cluster composed of the three operational pods.
Cluster Synchronization Efforts:
- On the operational nodes:
  - Executed rabbitmqctl cluster_status to verify current cluster membership.
  - Performed rabbitmqctl forget_cluster_node to remove the stale node entry.
- On the isolated rabbitmq-1 pod:
  - Checked cluster status with rabbitmqctl cluster_status.
  - Stopped the RabbitMQ application using rabbitmqctl stop_app.
  - Reset the node configuration with rabbitmqctl reset.
  - Rejoined the node to the cluster using rabbitmqctl join_cluster.
  - Started the RabbitMQ application with rabbitmqctl app_start.
Replicate queues over multiple nodes
- Existing queues do not automatically replicate to the new nodes. You might want to delete the queue. This would recreate the queues on all nodes
- Or use this command to replicate queues(all or matching patterns) across given nodes
rabbitmq-queues grow rabbit@minority-rabbitmq-1.minority-rabbitmq-headless.rabbitmq.svc.cluster.local all --queue-pattern batch-feature-update-event-handler

Outstanding Concerns¶

Cluster Redundancy and Fault Tolerance:
- Despite having three operational pods, the cluster failed to serve connections. The root cause of this failure mode needs further analysis to ensure redundancy and fault tolerance.
Proper Pod Management in Kubernetes:
- The method used for restarting pods (kubectl delete pod) may not be appropriate for maintaining stable cluster states. For future interventions, a kubectl rollout restart might be more suitable to ensure smoother transitions and consistent cluster performance.