Netsy System Design Overview
Terminology
- Record: a single key-value data entry, also known as an “etcd record”, which has a revision integer, key string, value blob, and other metadata. Unique on revision, as each revision produces exactly one Record.
- KV Data: the collection of Records, also known as “etcd records”.
- Node: a single Netsy process. Each node has an identifier or “Node ID”. The Node ID must be lowercase alphanumeric characters and hyphens only, with no leading, trailing, or consecutive hyphens, and a maximum of 32 characters.
- Cluster: a collection of Nodes for a given KV Data store. Each Cluster has a “Cluster ID” following the same naming/validation rules as Node ID.
- Primary: the Cluster Node which handles all etcd transaction requests (write operations).
- Replica: all Cluster Nodes except for the Primary. Must proxy any transaction request (write operation) to the Primary.
- Client: a consumer of the etcd API subset in the Netsy API, also known as an “etcd client” e.g.
kube-apiserveroretcdctl. - Peer: a consumer of the Netsy API which is a Node.
- Elector: the Cluster Node responsible for leader election of the Primary. Can be the same Node as the Primary or a Replica.
- Heartbeat: a message sent by a Node to the Elector and/or Primary containing its current Node State (Health State, Primary State, and latest Revision).
- Receipt: a message sent by a Replica to the Primary confirming that a Record has been durably committed to the Replica’s local SQLite database. Every Receipt embeds a Heartbeat.
- Revision: a monotonically increasing integer assigned by the Primary to each Record. Every write operation (create, update, delete) produces a new Record at the next revision number.
- Committed Revision: the highest Revision that has been confirmed committed by the Primary (quorum met or written to object storage).
- Initial: a logical message sent by the Primary to a Replica when a
Followstream is first established, carrying the current Committed and Compaction Revision. - Commit: a logical message sent by the Primary to Replicas on the replication stream, advancing the Committed Revision and allowing Replicas to serve Records up to that Revision to Clients and Watches.
- Compact: a logical message sent by the Primary to Replicas on the replication stream, confirming the current Compaction Revision for watch-admission gating and per-Node compaction execution.
- Watch: a long-lived subscription to key/key-range changes that streams ordered updates (puts/deletes) in real time as they are committed (at or below the
committed_revision). - Bind Address: a host+port string used for binding a gRPC server to a given IP or hostname and port e.g.
0.0.0.0:2378 - Advertise Address: a host+port string used for a Client or Peer to connect to the Node or Cluster e.g.
172.16.0.1:2378oretcd.example.com:2378 - Service Discovery: how each Node learns about all other Nodes in a Cluster, including their Advertise addresses for Peer connections.
- Member ID: a stable numeric etcd-compatible member identifier assigned by the Elector and stored in object storage separately from active Node registrations.
Cluster State
There are three components referred to as the “Cluster State” communicated to all Nodes:
Current
ElectorNodeCurrent
PrimaryNodeThe total
Node Count/ number of registered Nodes, used for majority quorum calculation
Node States
Each Node has three state fields which can be read via the Peer API:
Health State:
Loadingduring its initial startup and database backfill process.Healthywhen it has completedLoadingand is notDegraded.Degradedwhen it has failed to send any Receipt or Heartbeat after 1 immediate retry (self-degraded), when the Elector or Primary has detected 2 consecutive missed Heartbeats from the Node (Elector-degraded), or when a Replica receives acommitted_revisionfrom the Primary that is higher than its own latest revision and has not caught up within 2 seconds.
A Node should be considered “unhealthy” if it has been in the
LoadingorDegradedstate after a timeout.Elector State:
Leader: the Node has been elected and is currently the Elector.Follower: the Node is not the Elector.
Primary State:
Replica: the Node is a Replica and has not been elected Primary by the Elector since it started.Starting: immediately after being elected Primary, while performing “preflight checks” before becoming Active. Must not accept new writes while in this state.Active: able to accept writes (provided its Chunk Buffer is not full).Draining: needing to shutdown or consistently failing to write data (or Chunk Buffer is full). Must not accept new writes while in this state.
After a Primary finishes Draining, the Node process gives up its Primary leadership with the Elector, and restarts into a
ReplicaPrimary State (with a Loading Health State).
Requirements
- Every Node stores a copy of all KV Data in a local SQLite database. SQLite must be configured for durability to ensure that a committed transaction is actually persisted to disk before a Receipt is sent (to guarantee quorum transactions). The required configuration is:
PRAGMA journal_mode=WAL— WAL (Write-Ahead Logging) for concurrent read/write performance. Must be set once when the database is opened.PRAGMA synchronous=FULL— ensures WAL writes are fsynced to disk before reporting commit success. This is critical: without it, a crash after commit but before fsync could lose data that was already receipted.
- Replicas can answer range (read) requests directly, but proxy any transaction (write) requests to the Primary.
- Replicas must only serve records up to the
committed_revisionreceived from the Primary. Records above this revision are tentative (from in-progress or rolled-back transactions) and must not be visible to clients. - If a Replica Health State is
Degraded, it must continue to serve range (read) requests, which may return stale data. If stricter read consistency guarantees are required in the future, a Replica inDegradedstate may instead reject or proxy read requests. - An active Primary cannot be in a
Degradedstate — if a Primary degrades, it transitions toDraining, gives up leadership, and restarts as aReplica(see Graceful Shutdown).
- Replicas must only serve records up to the
- Each Node has an unencrypted HTTP endpoint
/healthfor health-checking the Netsy process, with health determined by the Health State, which can be used by systems like Kubernetes or ASGs for health-checking the process. - The Primary writes data to its SQLite database, object storage, and all Replicas.
- In a Cluster without enough
HealthyReplicas to meet the configured quorum threshold, writes are synchronous to the object storage bucket, and therefore it is the canonical system-of-record. - Where there are enough
HealthyReplicas to meet the quorum threshold, a transaction can be committed when those Replicas confirm Receipt of the Record, and writes to object storage are asynchronous/buffered. See Quorum Configuration for details. - The Primary sends
PrimaryMessagevalues to Replicas on the replication stream, whereinitialcarries the current Committed Revision and current Compaction Revision for a newly established stream,recordis treated as a logical Record message,commitadvances the current Committed Revision, andcompactadvances the current Compaction Revision. Records above the Committed Revision are tentative and, if a rollback occurs, will be overwritten by a new transaction from the same Primary. - Data is sent to Replicas via gRPC streams, and is stored in object storage using a custom Netsy Data File format.
- Replicas must not accept data from the Primary unless its Primary State is not
Replica(must beStarting,Active, orDraining). - The Primary must accept connections from Replicas when its Primary State is
Starting,Active, orDraining.
- In a Cluster without enough
- The Elector is the only Node which can perform leader election for the Cluster to determine which Node is the Primary.
- Determining which Node is the Elector uses a separate leader election process to the one which determines which Node is the Primary. This may be referred to as a two-tier/dual-layer leader election system: one for the etcd writer/Primary role, one for the Elector role.
- This two-tier approach is used because an audit must be conducted during leader election of the Primary to ensure whichever Replica is elected has the current latest-known etcd revision number, to prevent data loss.
- The Elector leader election process uses s3lect and uses object storage for coordination, whereas the Primary leader election process is handled by the Elector itself.
- Mutual TLS (mTLS) is used for authenticating/authorising connections to Node gRPC servers.
- All TLS certificates in a Cluster share a single CA.
- Peer certificates must contain
OU=peer,CN=node_id, andO=cluster_id. Node ID’s are embedded into client ceritifcates to prevent impersonation. - Client certificates must contain
OU=client,CN=client_id, andO=cluster_id. - During the authentication flow, the server verifies
Omatches its own Cluster ID, preventing cross-cluster connections if a CA is re-used across clusters.
- Each Node has:
- A server certificate (used on Client and Peer gRPC servers) and a client certificate (used for outbound Peer connections).
- The
CN(Node ID) is validated during loading/startup to ensure it matches the Node configuration. - The server certificate SANs are validated during loading/startup to ensure they cover the configured Client, Peer, and election advertise addresses.
- A Bind address and an Advertise address for Client Node/Cluster connections and Peer Node connections.
- For Service Discovery, each Node registers itself by writing a file in object storage under the
nodes/prefix containing its Advertise addresses.- The Elector also maintains a durable
members.jsonfile containing the Cluster ID and stable etcdmember_id -> node_idmappings used for cluster topology APIs such asMemberList.
- The Elector also maintains a durable
Further Reading
- Leader Election - Netsy two-tier leader election system design.
- Netsy Data Files – Netsy (.netsy) data file format/specification.
- Storage & Replication – Netsy data storage and replication system design.
- Loading, Startup, & Shutdown - Outline of how Node Loading, Primary Startup, and graceful Node Shutdown works.
- Failure Scenarios – Data integrity and safety analysis across quorum configurations and cluster sizes.
- Watches & Compaction – Watch support & Compaction system design.
- Compatibility With etcd – Supported etcd RPCs, unsupported RPCs, and notes on compatibility differences.
- Observability – Metrics, structured logging, and debugging for Netsy clusters.