本文档介绍了与 YugaByte如何使用 Raft 处理日志复制和一致性的相关概念。
背景
DocDB 中的每个表都被分片成一组tablets。每个tablet由一组tablet-peers组成,每个tablet-peers都存储属于该tablet的数据的一个副本。tablet-peer 之间的数据复制使用raft协议,是高度一致的。
Raft基本知识
先来介绍一些Raft的基本知识。
状态
首先,Raft算法中维护三种状态:
- LEADER
- FOLLOWER
- Candidate
当具体操作时,集群中有一个Leader,剩下的节点均为Followers。Followers只响应来自Leader或者Candidate的请求,而Leader响应来自Client的请求,如果有Client请求发送至了Followers,会被转发至Leader。
Raft协议将一个Leader的任期称作一个Term,Term是任意长度的时间片,Term的Number是顺序递增的。
而Yugabyte的实现中,单个config,在稳定状态下,即当没有peers故障时,主要有两种类型的peers:leader peer和follower peers。所有写请求都直接发送到leader,然后再将它们复制到大多数的follower。follower peers在稳定状态下完全被动,只接收leader发来的数据并回复。只有当leader进程停止,follower peers才会激活并将其中之一选举为新的leader。
另外YugaByte 明确指定了一个额外的learner角色,这是一个无投票权的角色。
// A peer in a configuration.
message RaftPeerPB {
// The possible roles for peers.
enum Role {
// Indicates this node is a follower in the configuration, i.e. that it participates
// in majorities and accepts Consensus::Update() calls.
FOLLOWER = 0;
// Indicates this node is the current leader of the configuration, i.e. that it
// participates in majorities and accepts Consensus::Append() calls.
LEADER = 1;
// New peers joining a quorum will be in this role for both PRE_VOTER and PRE_OBSERVER
// while the tablet data is being remote bootstrapped. The peer does not participate
// in starting elections or majorities.
LEARNER = 2;
// Indicates that this node is not a participant of the configuration, i.e. does
// not accept Consensus::Update() or Consensus::Update() and cannot
// participate in elections or majorities. This is usually the role of a node
// that leaves the configuration.
NON_PARTICIPANT = 3;
// This peer is a read (async) replica and gets informed of the quorum write
// activity and provides time-line consistent reads.
READ_REPLICA = 4;
UNKNOWN_ROLE = 7;
};
下图说明了可能的状态更改:
+------------+
| NON_PART +---+
+-----+------+ |
Exist. RaftConfig? | |
+-----v------+ |
| LEARNER + | New RaftConfig?
+-----+------+ |
| |
+-----v------+ |
+-->+ FOLLOW. +<--+
| +-----+------+
| |
| +-----v------+
Step Down +<--+ CANDIDATE |
^ +-----+------+
| |
| +-----v------+
+<--+ LEADER |
+------------+
此外,所有状态都可以转换为 NON_PARTICIPANT,在配置更改和/或对等方超时/死亡。
The write ahead log (WAL)
在前面文章WriteAheadLog-介绍介绍RocksDB时,介绍过DocsDB为了避免功能的重复,抛弃了RocksDBde WAL功能,使用raft的WAL。
在 YugaByte 中,Quorum 是一组协作流程,其目的是保持对给定数据集(例如tablet)的一致的、复制的操作日志。而这个日志,也扮演着tablet的预写日志 (WAL) 的角色。
WAL主要的功能是当RocksDB异常退出后,能够恢复出错前的内存中(memtable)数据。
WAL 由 log.cc 实现。
协议通信
Raft协议通信主要使用两种类型的RPC:
- AppendEntries RPC:复制Log和发送心跳
- RequestVote RPC:Candidate发起投票
在Yugabyte的实现中,使用UpdateConsensus替换了AppendEntries,RequestVot继续保留。
// A Raft implementation.
service ConsensusService {
// Analogous to AppendEntries in Raft, but only used for followers.
rpc UpdateConsensus(ConsensusRequestPB) returns (ConsensusResponsePB);
// RequestVote() from Raft.
rpc RequestConsensusVote(VoteRequestPB) returns (VoteResponsePB);
// Implements all of the one-by-one config change operations, including
// AddServer() and RemoveServer() from the Raft specification, as well as
// an operation to change the role of a server between VOTER and PRE_VOTER.
// An OK response means the operation was successful.
rpc ChangeConfig(ChangeConfigRequestPB) returns (ChangeConfigResponsePB);
rpc GetNodeInstance(GetNodeInstanceRequestPB) returns (GetNodeInstanceResponsePB);
// Force this node to run a leader election.
rpc RunLeaderElection(RunLeaderElectionRequestPB) returns (RunLeaderElectionResponsePB);
// Notify originator about lost election, so it could reset its timeout.
rpc LeaderElectionLost(LeaderElectionLostRequestPB) returns (LeaderElectionLostResponsePB);
// Force this node to step down as leader.
rpc LeaderStepDown(LeaderStepDownRequestPB) returns (LeaderStepDownResponsePB);
// Get the latest committed or received opid on the server.
rpc GetLastOpId(GetLastOpIdRequestPB) returns (GetLastOpIdResponsePB);
// Returns the committed Consensus state.
rpc GetConsensusState(GetConsensusStateRequestPB) returns (GetConsensusStateResponsePB);
// Instruct this server to remotely bootstrap a tablet from another host.
rpc StartRemoteBootstrap(StartRemoteBootstrapRequestPB) returns (StartRemoteBootstrapResponsePB);
}
划分
围绕状态管理和日志,Raft基本上可以分为三个子问题:
- 领导人选举
- 日志复制
- 安全性
后面会围绕这3个问题一一介绍。
关于raft
更多可以查看 https://github.com/maemual/raft-zh_cn/blob/master/raft-zh_cn.md