Skip to content

Netsy System Design Overview

Terminology

  • Record: a single key-value data entry, also known as an “etcd record”, which has a revision integer, key string, value blob, and other metadata. Unique on revision, as each revision produces exactly one Record.
  • KV Data: the collection of Records, also known as “etcd records”.
  • Node: a single Netsy process. Each node has an identifier or “Node ID”. The Node ID must be lowercase alphanumeric characters and hyphens only, with no leading, trailing, or consecutive hyphens, and a maximum of 32 characters.
  • Cluster: a collection of Nodes for a given KV Data store. Each Cluster has a “Cluster ID” following the same naming/validation rules as Node ID.
  • Primary: the Cluster Node which handles all etcd transaction requests (write operations).
  • Replica: all Cluster Nodes except for the Primary. Must proxy any transaction request (write operation) to the Primary.
  • Client: a consumer of the etcd API subset in the Netsy API, also known as an “etcd client” e.g. kube-apiserver or etcdctl.
  • Peer: a consumer of the Netsy API which is a Node.
  • Elector: the Cluster Node responsible for leader election of the Primary. Can be the same Node as the Primary or a Replica.
  • Heartbeat: a message sent by a Node to the Elector and/or Primary containing its current Node State (Health State, Primary State, and latest Revision).
  • Receipt: a message sent by a Replica to the Primary confirming that a Record has been durably committed to the Replica’s local SQLite database. Every Receipt embeds a Heartbeat.
  • Revision: a monotonically increasing integer assigned by the Primary to each Record. Every write operation (create, update, delete) produces a new Record at the next revision number.
  • Committed Revision: the highest Revision that has been confirmed committed by the Primary (quorum met or written to object storage).
  • Initial: a logical message sent by the Primary to a Replica when a Follow stream is first established, carrying the current Committed and Compaction Revision.
  • Commit: a logical message sent by the Primary to Replicas on the replication stream, advancing the Committed Revision and allowing Replicas to serve Records up to that Revision to Clients and Watches.
  • Compact: a logical message sent by the Primary to Replicas on the replication stream, confirming the current Compaction Revision for watch-admission gating and per-Node compaction execution.
  • Watch: a long-lived subscription to key/key-range changes that streams ordered updates (puts/deletes) in real time as they are committed (at or below the committed_revision).
  • Bind Address: a host+port string used for binding a gRPC server to a given IP or hostname and port e.g. 0.0.0.0:2378
  • Advertise Address: a host+port string used for a Client or Peer to connect to the Node or Cluster e.g. 172.16.0.1:2378 or etcd.example.com:2378
  • Service Discovery: how each Node learns about all other Nodes in a Cluster, including their Advertise addresses for Peer connections.
  • Member ID: a stable numeric etcd-compatible member identifier assigned by the Elector and stored in object storage separately from active Node registrations.

Cluster State

There are three components referred to as the “Cluster State” communicated to all Nodes:

  • Current Elector Node

  • Current Primary Node

  • The total Node Count / number of registered Nodes, used for majority quorum calculation

Node States

Each Node has three state fields which can be read via the Peer API:

  1. Health State:

    • Loading during its initial startup and database backfill process.

    • Healthy when it has completed Loading and is not Degraded.

    • Degraded when it has failed to send any Receipt or Heartbeat after 1 immediate retry (self-degraded), when the Elector or Primary has detected 2 consecutive missed Heartbeats from the Node (Elector-degraded), or when a Replica receives a committed_revision from the Primary that is higher than its own latest revision and has not caught up within 2 seconds.

    A Node should be considered “unhealthy” if it has been in the Loading or Degraded state after a timeout.

  2. Elector State:

    • Leader: the Node has been elected and is currently the Elector.

    • Follower: the Node is not the Elector.

  3. Primary State:

    • Replica: the Node is a Replica and has not been elected Primary by the Elector since it started.

    • Starting: immediately after being elected Primary, while performing “preflight checks” before becoming Active. Must not accept new writes while in this state.

    • Active: able to accept writes (provided its Chunk Buffer is not full).

    • Draining: needing to shutdown or consistently failing to write data (or Chunk Buffer is full). Must not accept new writes while in this state.

    After a Primary finishes Draining, the Node process gives up its Primary leadership with the Elector, and restarts into a Replica Primary State (with a Loading Health State).

Requirements

  • Every Node stores a copy of all KV Data in a local SQLite database. SQLite must be configured for durability to ensure that a committed transaction is actually persisted to disk before a Receipt is sent (to guarantee quorum transactions). The required configuration is:
    • PRAGMA journal_mode=WAL — WAL (Write-Ahead Logging) for concurrent read/write performance. Must be set once when the database is opened.
    • PRAGMA synchronous=FULL — ensures WAL writes are fsynced to disk before reporting commit success. This is critical: without it, a crash after commit but before fsync could lose data that was already receipted.
  • Replicas can answer range (read) requests directly, but proxy any transaction (write) requests to the Primary.
    • Replicas must only serve records up to the committed_revision received from the Primary. Records above this revision are tentative (from in-progress or rolled-back transactions) and must not be visible to clients.
    • If a Replica Health State is Degraded, it must continue to serve range (read) requests, which may return stale data. If stricter read consistency guarantees are required in the future, a Replica in Degraded state may instead reject or proxy read requests.
    • An active Primary cannot be in a Degraded state — if a Primary degrades, it transitions to Draining, gives up leadership, and restarts as a Replica (see Graceful Shutdown).
  • Each Node has an unencrypted HTTP endpoint /health for health-checking the Netsy process, with health determined by the Health State, which can be used by systems like Kubernetes or ASGs for health-checking the process.
  • The Primary writes data to its SQLite database, object storage, and all Replicas.
    • In a Cluster without enough Healthy Replicas to meet the configured quorum threshold, writes are synchronous to the object storage bucket, and therefore it is the canonical system-of-record.
    • Where there are enough Healthy Replicas to meet the quorum threshold, a transaction can be committed when those Replicas confirm Receipt of the Record, and writes to object storage are asynchronous/buffered. See Quorum Configuration for details.
    • The Primary sends PrimaryMessage values to Replicas on the replication stream, where initial carries the current Committed Revision and current Compaction Revision for a newly established stream, record is treated as a logical Record message, commit advances the current Committed Revision, and compact advances the current Compaction Revision. Records above the Committed Revision are tentative and, if a rollback occurs, will be overwritten by a new transaction from the same Primary.
    • Data is sent to Replicas via gRPC streams, and is stored in object storage using a custom Netsy Data File format.
    • Replicas must not accept data from the Primary unless its Primary State is not Replica (must be Starting, Active, or Draining).
    • The Primary must accept connections from Replicas when its Primary State is Starting, Active, or Draining.
  • The Elector is the only Node which can perform leader election for the Cluster to determine which Node is the Primary.
    • Determining which Node is the Elector uses a separate leader election process to the one which determines which Node is the Primary. This may be referred to as a two-tier/dual-layer leader election system: one for the etcd writer/Primary role, one for the Elector role.
    • This two-tier approach is used because an audit must be conducted during leader election of the Primary to ensure whichever Replica is elected has the current latest-known etcd revision number, to prevent data loss.
    • The Elector leader election process uses s3lect and uses object storage for coordination, whereas the Primary leader election process is handled by the Elector itself.
  • Mutual TLS (mTLS) is used for authenticating/authorising connections to Node gRPC servers.
    • All TLS certificates in a Cluster share a single CA.
    • Peer certificates must contain OU=peer, CN=node_id, and O=cluster_id. Node ID’s are embedded into client ceritifcates to prevent impersonation.
    • Client certificates must contain OU=client, CN=client_id, and O=cluster_id.
    • During the authentication flow, the server verifies O matches its own Cluster ID, preventing cross-cluster connections if a CA is re-used across clusters.
  • Each Node has:
    • A server certificate (used on Client and Peer gRPC servers) and a client certificate (used for outbound Peer connections).
    • The CN (Node ID) is validated during loading/startup to ensure it matches the Node configuration.
    • The server certificate SANs are validated during loading/startup to ensure they cover the configured Client, Peer, and election advertise addresses.
    • A Bind address and an Advertise address for Client Node/Cluster connections and Peer Node connections.
  • For Service Discovery, each Node registers itself by writing a file in object storage under the nodes/ prefix containing its Advertise addresses.
    • The Elector also maintains a durable members.json file containing the Cluster ID and stable etcd member_id -> node_id mappings used for cluster topology APIs such as MemberList.

Further Reading