Primary log-based replication¶
Reads must return data written by any write which completed (where the client could possibly have received a commit message). There are lots of ways to handle this, but ceph’s architecture makes it easy for everyone at any map epoch to know who the primary is. Thus, the easy answer is to route all writes for a particular pg through a single ordering primary and then out to the replicas. Though we only actually need to serialize writes on a single object (and even then, the partial ordering only really needs to provide an ordering between writes on overlapping regions), we might as well serialize writes on the whole PG since it lets us represent the current state of the PG using two numbers: the epoch of the map on the primary in which the most recent write started (this is a bit stranger than it might seem since map distribution itself is asyncronous – see Peering and the concept of interval changes) and an increasing per-pg version number – this is referred to in the code with type eversion_t and stored as pg_info_t::last_update. Furthermore, we maintain a log of “recent” operations extending back at least far enough to include any unstable writes (writes which have been started but not committed) and objects which aren’t uptodate locally (see recovery and backfill). In practice, the log will extend much further (osd_pg_min_log_entries when clean, osd_pg_max_log_entries when not clean) because it’s handy for quickly performing recovery.
Using this log, as long as we talk to a non-empty subset of the OSDs which must have accepted any completed writes from the most recent interval in which we accepted writes, we can determine a conservative log which must contain any write which has been reported to a client as committed. There is some freedom here, we can choose any log entry between the oldest head remembered by an element of that set (any newer cannot have completed without that log containing it) and the newest head remembered (clearly, all writes in the log were started, so it’s fine for us to remember them) as the new head. This is the main point of divergence between replicated pools and ec pools in PG/PrimaryLogPG: replicated pools try to choose the newest valid option to avoid the client needing to replay those operations and instead recover the other copies. EC pools instead try to choose the oldest option available to them.
The reason for this gets to the heart of the rest of the differences in implementation: one copy will not generally be enough to reconstruct an ec object. Indeed, there are encodings where some log combinations would leave unrecoverable objects (as with a 4+2 encoding where 3 of the replicas remember a write, but the other 3 do not – we don’t have 3 copies of either version). For this reason, log entries representing unstable writes (writes not yet committed to the client) must be rollbackable using only local information on ec pools. Log entries in general may therefore be rollbackable (and in that case, via a delayed application or via a set of instructions for rolling back an inplace update) or not. Replicated pool log entries are never able to be rolled back.
For more details, see PGLog.h/cc, osd_types.h:pg_log_t, osd_types.h:pg_log_entry_t, and peering in general.