本文档介绍了与 YugaByte如何使用 Raft 处理日志复制和一致性的相关概念。

背景

DocDB 中的每个表都被分片成一组tablets。每个tablet由一组tablet-peers组成,每个tablet-peers都存储属于该tablet的数据的一个副本。tablet-peer 之间的数据复制使用raft协议,是高度一致的。

Raft基本知识

先来介绍一些Raft的基本知识。

状态

首先,Raft算法中维护三种状态:

  • LEADER
  • FOLLOWER
  • Candidate

当具体操作时,集群中有一个Leader,剩下的节点均为Followers。Followers只响应来自Leader或者Candidate的请求,而Leader响应来自Client的请求,如果有Client请求发送至了Followers,会被转发至Leader。

Raft协议将一个Leader的任期称作一个Term,Term是任意长度的时间片,Term的Number是顺序递增的。

而Yugabyte的实现中,单个config,在稳定状态下,即当没有peers故障时,主要有两种类型的peers:leader peer和follower peers。所有写请求都直接发送到leader,然后再将它们复制到大多数的follower。follower peers在稳定状态下完全被动,只接收leader发来的数据并回复。只有当leader进程停止,follower peers才会激活并将其中之一选举为新的leader。

另外YugaByte 明确指定了一个额外的learner角色,这是一个无投票权的角色。

// A peer in a configuration.
message RaftPeerPB {
  // The possible roles for peers.
  enum Role {
    // Indicates this node is a follower in the configuration, i.e. that it participates
    // in majorities and accepts Consensus::Update() calls.
    FOLLOWER = 0;

    // Indicates this node is the current leader of the configuration, i.e. that it
    // participates in majorities and accepts Consensus::Append() calls.
    LEADER = 1;

    // New peers joining a quorum will be in this role for both PRE_VOTER and PRE_OBSERVER
    // while the tablet data is being remote bootstrapped. The peer does not participate
    // in starting elections or majorities.
    LEARNER = 2;

    // Indicates that this node is not a participant of the configuration, i.e. does
    // not accept Consensus::Update() or Consensus::Update() and cannot
    // participate in elections or majorities. This is usually the role of a node
    // that leaves the configuration.
    NON_PARTICIPANT = 3;

    // This peer is a read (async) replica and gets informed of the quorum write
    // activity and provides time-line consistent reads.
    READ_REPLICA = 4;

    UNKNOWN_ROLE = 7;
  };

下图说明了可能的状态更改:

                 +------------+
                 |  NON_PART  +---+
                 +-----+------+   |
   Exist. RaftConfig?  |          |
                 +-----v------+   |
                 |  LEARNER   +   | New RaftConfig?
                 +-----+------+   |
                       |          |
                 +-----v------+   |
             +-->+  FOLLOW.   +<--+
             |   +-----+------+
             |         |
             |   +-----v------+
  Step Down  +<--+ CANDIDATE  |
             ^   +-----+------+
             |         |
             |   +-----v------+
             +<--+   LEADER   |
                 +------------+

此外,所有状态都可以转换为 NON_PARTICIPANT,在配置更改和/或对等方超时/死亡。

The write ahead log (WAL)

在前面文章WriteAheadLog-介绍介绍RocksDB时,介绍过DocsDB为了避免功能的重复,抛弃了RocksDBde WAL功能,使用raft的WAL。

在 YugaByte 中,Quorum 是一组协作流程,其目的是保持对给定数据集(例如tablet)的一致的、复制的操作日志。而这个日志,也扮演着tablet的预写日志 (WAL) 的角色。

WAL主要的功能是当RocksDB异常退出后,能够恢复出错前的内存中(memtable)数据。

WAL 由 log.cc 实现。

协议通信

Raft协议通信主要使用两种类型的RPC:

  • AppendEntries RPC:复制Log和发送心跳
  • RequestVote RPC:Candidate发起投票

在Yugabyte的实现中,使用UpdateConsensus替换了AppendEntries,RequestVot继续保留。

// A Raft implementation.
service ConsensusService {
  // Analogous to AppendEntries in Raft, but only used for followers.
  rpc UpdateConsensus(ConsensusRequestPB) returns (ConsensusResponsePB);

  // RequestVote() from Raft.
  rpc RequestConsensusVote(VoteRequestPB) returns (VoteResponsePB);

  // Implements all of the one-by-one config change operations, including
  // AddServer() and RemoveServer() from the Raft specification, as well as
  // an operation to change the role of a server between VOTER and PRE_VOTER.
  // An OK response means the operation was successful.
  rpc ChangeConfig(ChangeConfigRequestPB) returns (ChangeConfigResponsePB);

  rpc GetNodeInstance(GetNodeInstanceRequestPB) returns (GetNodeInstanceResponsePB);

  // Force this node to run a leader election.
  rpc RunLeaderElection(RunLeaderElectionRequestPB) returns (RunLeaderElectionResponsePB);

  // Notify originator about lost election, so it could reset its timeout.
  rpc LeaderElectionLost(LeaderElectionLostRequestPB) returns (LeaderElectionLostResponsePB);

  // Force this node to step down as leader.
  rpc LeaderStepDown(LeaderStepDownRequestPB) returns (LeaderStepDownResponsePB);

  // Get the latest committed or received opid on the server.
  rpc GetLastOpId(GetLastOpIdRequestPB) returns (GetLastOpIdResponsePB);

  // Returns the committed Consensus state.
  rpc GetConsensusState(GetConsensusStateRequestPB) returns (GetConsensusStateResponsePB);

  // Instruct this server to remotely bootstrap a tablet from another host.
  rpc StartRemoteBootstrap(StartRemoteBootstrapRequestPB) returns (StartRemoteBootstrapResponsePB);
}

划分

围绕状态管理和日志,Raft基本上可以分为三个子问题:

  • 领导人选举
  • 日志复制
  • 安全性

后面会围绕这3个问题一一介绍。

关于raft 更多可以查看 https://github.com/maemual/raft-zh_cn/blob/master/raft-zh_cn.md

Copyright © itrunner.cn 2020 all right reserved,powered by Gitbook该文章修订时间: 2022-08-28 07:44:16

results matching ""

    No results matching ""