Rabbitmq upgrade issue¶

What is the issue - We are facing issue with when we upgrade rabbitmq from 3.12 to 3.13(i.e. helm chart version 13 to 14). Details of the issue are defined in this github issue

So to resolve this we need to persistence disk.

Pros of ephemeral disk¶

Lower Latency: Typically, ephemeral storage offers lower latency since the storage is local to the node, providing faster read and write operations.

Simplicity: Easier to configure and manage as it does not require persistent volume claims or storage classes, simplifying the overall Kubernetes setup.

Cost-Effective: Generally, ephemeral storage is less expensive as it utilizes local node storage which is often free or cheaper than persistent storage solutions

Pros of persistence Disk

Resilience: Enhances the resilience of RabbitMQ by allowing data to persist across node failures, making it suitable for production environments.

Node Affinity: Seamless upgrades for rabbitmq

Next steps -¶

We have installed another rabbitmq cluster(green) on dev and stage, this time with persistent volume. Now compare its performance with old cluster(blue)
Traffic from these environment now goes to the new cluster. We will let it run for a couple of days and compare the latencies using below histograms

Environment	Disk Type	Iops	Histogram Link
Production	Ephemeral	38000	https://minorityv2.kb.us-east4.gcp.elastic-cloud.com:9243/app/r/s/3urEs
Staging	Persistence Premium v2	15000	https://minorityv2.kb.us-east4.gcp.elastic-cloud.com:9243/app/r/s/U2oy0

Rabbitmq Load Test Results:-¶

Currently on peak hours we have about 500-600 messages per second published to RabbitMQ and service bus combined. For the test We generated 5000 per second EOUT on green cluster and below are the results logs:-

java -jar perf-test.jar -x 1 -y 3 -u "testload3" --id "test-12" -q 100 --rate 5000 --quorum-queue --queue-args x-quorum-initial-group-size=5 --uri "amqp://user:password@minority-rabbitmq-persist.rabbitmq:5672" --time 120 --size 3000 -ms

id: test-12, time 116.000 s, sent: 5002 msg/s, received: 5003 msg/s, min/median/75th/95th/99th consumer latency: 4/6/7/12/19 ms
id: test-12, time 117.000 s, sent: 5005 msg/s, received: 5000 msg/s, min/median/75th/95th/99th consumer latency: 4/6/6/8/12 ms
id: test-12, time 118.000 s, sent: 5000 msg/s, received: 5006 msg/s, min/median/75th/95th/99th consumer latency: 4/6/6/8/10 ms
id: test-12, time 119.000 s, sent: 5005 msg/s, received: 4967 msg/s, min/median/75th/95th/99th consumer latency: 4/6/7/9/11 ms
id: test-12, time 120.000 s, sent: 5005 msg/s, received: 5034 msg/s, min/median/75th/95th/99th consumer latency: 4/6/10/32/40 ms
test stopped (Reached time limit)
id: test-12, sending rate avg: 5007 msg/s
id: test-12, receiving rate avg: 5007 msg/s

id: test-12, *****consumer latency min/median/75th/95th/99th 4/6/6/10/22 ms*****

Test results - id: test-12, consumer latency min/median/75th/95th/99th 4/6/6/10/22 ms

95th percentile of consumers take 10 ms

99th percentile take 22 ms

Pending Test -
Since it difficult to generate the load on stage as production using backend apis, I will generate a random load of 5000-6000 messages per second using the perf tool and then run regressions to monitor latency on histograms

Additional finding - Azure makes sure that PV has a node affinity to nodes which are in same availability zone. I have also verified manually.