Don't be fooled!
There is a fundamental part of the Patroni architecture that is often grossly overlooked or misunderstood, the role and sizing of the Distributed Consensus Store. In the context of Patroni, this is typically etcd.
Patroni uses etcd to elect the primary, register cluster members, and ensure that only one node believes it is the leader at any given time, a concept known as a quorum. If you come from the mindset that the number of etcd nodes you need is simply based on the number of Postgres nodes divided by two, plus one, you have been profoundly misled.
etcd nodes = ( postgres nodes / 2 ) + 1
This rule of thumb is a common source of confusion and instability! The size of your etcd cluster is independent of your Postgres node count and is governed only by the need to maintain a reliable quorum for etcd itself.
Understanding why and how to size your etcd cluster correctly is essential for true high availability, and you must read the following to learn the proper methodology.
Why your Patroni Node Count doesn't Determine your etcd Quaorum
The core misunderstanding is failing to distinguish between the Patroni cluster (your data layer) and the etcd cluster (your consensus layer).
Patroni/Postgres Nodes (Data):
etcd Nodes (Consensus):
The availability of your Patroni cluster relies entirely on the availability of its etcd quorum. If the etcd cluster loses its quorum, Patroni cannot safely elect a new Primary or switch roles, even if the underlying Postgres data nodes are healthy.
On a side note, this heavy dependency on etcd and the Patroni layer managing Postgres, is why I favor pgPool in some cases.
The Correct etcd Quorum Sizing Rule
The sizing of the etcd cluster is based on the concept of fault tolerance, defined by the number of simultaneous etcd node failures.
Lets take the common misconception of a scenario where you have a Postgres cluster of 3 database servers managed by Patroni. Most likely, you placed the etcd service on each of the Postgres database server. You probably think that you just need 3 etcd nodes. Why not use the Postgres servers to host them. After all, the etcd footprint is fairly light. No big deal.
Ceiling of ( 3 / 2 ) + 1 = 3
Well, if more than one of your Postgres servers were to go down, you would be in a crisis trying to find out why you cannot reach the last database server out of 3.
The fact is, you have to take into account how many etcd node failures you are willing to tolerate in order to do a proper calculation.
If you have etcd running on the 3 database servers, and 2 of the database servers go down, you have just lost 2 of your etcd nodes leaving you with just 1. Well, 1 won't cut it for a quorum.
To survive 2 failures, you need to have a system where the remaining nodes can still form a majority.
The correct formula for determining the number of etcd nodes needed to survive 2 out of 3 etcd node failures is as follow:
N = ( 2 * F ) + 1
If F = 2 (two failures), then N = (2 * 2) + 1 = 5
You need to add enough extra nodes to your quorum so that even when two are taken away, you still have the minimum quorum number left over.
Lets break it down.
Simply send us an email to support@postgressolutions.com with details of howmany attendees, what type of training and we will get back to you.
JT
How to attend this training class? is there any link to subscribe or attend? what is the deatils regarding this training? Please can someone updates? Thanks