Observability
Netsy exposes Prometheus-compatible metrics and structured logs for monitoring cluster health, debugging failures, and alerting on degradation.
The observability design is intended to explain:
- the Node’s current role and state
- if the Primary is using
syncorquorumwrites - whether Replicas are healthy enough for quorum
- if startup, catchup, draining, or election has stalled
- whether object storage, replication, or compaction is the bottleneck
Metrics
Naming and instrumentation rules:
- All metrics use the
netsy_prefix. - Use base units in metric names e.g. durations use
_seconds, sizes use_bytes. - Use counters for cumulative events, gauges for current state, histograms for latency/size distributions.
- Keep labels bounded and low-cardinality. Avoid labels containing keys, revisions, object names, addresses, or error strings.
- Role-specific metrics use
netsy_primary_,netsy_elector_, ornetsy_replica_prefixes and are exposed via a customprometheus.Collectorthat only emits them while the Node is currently in that role. - When a Node leaves a role, role-specific metrics disappear from scrape output. They are not set to
0. - Do not rely on gauges for short-lived workflows that may complete between scrapes. Prefer counters, histograms, and structured logs for loading stages, preflight work, elections, and flushes.
- Standard gRPC server interceptor metrics (e.g.
grpc_server_handled_total,grpc_server_handling_seconds_bucket) should be registered via go-grpc-middleware or equivalent for per-RPC observability across both Client and Peer gRPC servers. These complement the Netsy-specific metrics and are not documented individually here.
Node State
Each state metric is a gauge vector with a state label. Exactly one label value is 1 at any time and all others are 0.
| Metric | Type | Labels | Description |
|---|---|---|---|
netsy_state_health | Gauge | state | Current Health State. Values: loading, healthy, degraded. |
netsy_state_elector | Gauge | state | Current Elector State. Values: follower, leader. |
netsy_state_primary | Gauge | state | Current Primary State. Values: replica, starting, active, draining. |
netsy_process_start_time_seconds | Gauge | Unix timestamp when this Netsy process started. | |
netsy_info | Gauge | version, cluster_id, node_id, quorum_config | Build and configuration info. Always 1. Useful for join-enriching dashboards and filtering by cluster or quorum mode. |
Revisions
| Metric | Type | Description |
|---|---|---|
netsy_latest_revision | Gauge | Highest revision present in this Node’s local SQLite database. |
netsy_committed_revision | Gauge | Current committed_revision on this Node. |
netsy_compaction_revision | Gauge | Latest accepted compaction revision on this Node. |
netsy_primary_object_storage_revision | Gauge | Highest revision known by the current Primary to be durably written to object storage. |
Startup, Catchup, and Draining
These metrics make long-running phases visible beyond the high-level state gauges.
| Metric | Type | Labels | Description |
|---|---|---|---|
netsy_loading_stage_duration_seconds | Histogram | stage, result | Duration of individual loading stages. |
netsy_loading_restarts_total | Counter | reason | Number of times the loading flow restarts. |
netsy_local_db_rebuilds_total | Counter | reason | Number of times a Node discards or rebuilds local database state and starts over from snapshot, chunks, or a fresh schema. |
netsy_primary_preflight_stage_duration_seconds | Histogram | stage, result | Duration of Primary preflight stages. |
netsy_primary_drain_duration_seconds | Histogram | result | Time spent draining before stepping down or exiting. |
netsy_primary_chunk_buffer_flushes_total | Counter | trigger, result | Chunk buffer flush attempts. trigger values: size, age, draining, rollback, manual. |
netsy_primary_chunk_buffer_flush_duration_seconds | Histogram | trigger, result | Chunk buffer flush duration. |
Client API (Read and Write Proxy)
These metrics cover the Client API surface, including range (read) requests served directly by any Node and write requests proxied by Replicas to the Primary.
| Metric | Type | Labels | Description |
|---|---|---|---|
netsy_client_requests_total | Counter | kind, result | Client API requests handled by this Node. kind is range, txn, put, delete, compaction, or lease. result is success or error. |
netsy_client_request_duration_seconds | Histogram | kind | Client API request duration. |
netsy_replica_proxy_requests_total | Counter | kind, result | Write requests proxied by this Replica to the Primary. |
netsy_replica_proxy_request_duration_seconds | Histogram | kind | Duration of proxied write requests, measured from the Replica’s perspective (includes network round-trip to Primary). |
Write Path
The Primary chooses between synchronous object storage writes and quorum writes based on Replica health and quorum configuration.
| Metric | Type | Labels | Description |
|---|---|---|---|
netsy_primary_write_path | Gauge | path | Current write path. Exactly one label value is 1. Values: sync, quorum. |
netsy_primary_quorum_rollbacks_total | Counter | reason | Number of quorum transaction rollbacks. Reasons include receipt_timeout and insufficient_receipts. |
netsy_primary_write_transactions_total | Counter | path, result | Total write transactions attempted by the Primary. |
netsy_primary_write_duration_seconds | Histogram | path, result | End-to-end write transaction duration. |
netsy_primary_required_receipts | Gauge | Current Replica Receipt threshold required for quorum writes. | |
netsy_primary_healthy_replicas | Gauge | Number of Replicas currently counted as healthy for quorum by the Primary. | |
netsy_primary_receipted_replicas | Gauge | Number of Replicas that have successfully receipted at least once and are therefore eligible to count toward quorum. |
Replica Health and Replication
These metrics describe the Primary’s view of Replica health and quorum eligibility for write decisions.
| Metric | Type | Labels | Description |
|---|---|---|---|
netsy_replica_receipt_age_seconds | Gauge | Seconds since this Node last successfully sent a Receipt to the Primary, computed at collect-time. | |
netsy_primary_replication_streams | Gauge | Number of currently connected replication streams. |
Elector Cluster View and Elections
These metrics describe the Elector’s view of registered and healthy Nodes, and the results of Primary election attempts.
| Metric | Type | Labels | Description |
|---|---|---|---|
netsy_elector_registered_nodes | Gauge | Number of currently registered Nodes in the Elector’s in-memory map. | |
netsy_elector_healthy_nodes | Gauge | Number of Nodes currently in Healthy Health State according to the Elector. | |
netsy_elector_degraded_nodes | Gauge | Number of Nodes currently in Degraded Health State according to the Elector. | |
netsy_elector_primary_elections_total | Counter | result | Primary elections run by this Node as the Elector. Values: success, failure. |
netsy_elector_primary_election_failures_total | Counter | reason | Failed Primary elections by failure reason. |
netsy_elector_primary_election_duration_seconds | Histogram | result | End-to-end Primary election duration. |
netsy_elector_primary_election_contacts_total | Counter | result | Node contact attempts made by the Elector during Primary elections. Values: success, failure. |
Chunk Buffer
| Metric | Type | Description |
|---|---|---|
netsy_primary_chunk_buffer_records | Gauge | Number of records currently in the Chunk Buffer. |
netsy_primary_chunk_buffer_bytes | Gauge | Size in bytes of all records currently in the Chunk Buffer. |
netsy_primary_chunk_buffer_age_seconds | Gauge | Age in seconds of the oldest unflushed record in the Chunk Buffer, computed at collect-time. 0 when buffer is empty. |
Object Storage
Write metrics carry kind (chunk or snapshot) and mode (sync or async) labels so operators can distinguish client-facing sync writes from background buffer flushes. Read metrics carry only result — reads are off the hot path (bootstrap, preflight, discovery). Only explicitly instrumented write paths record metrics; internal metadata writes (discovery, registration) are excluded.
| Metric | Type | Labels | Description |
|---|---|---|---|
netsy_object_storage_writes_total | Counter | kind, mode, result | Object storage write attempts. kind is chunk or snapshot. mode is sync or async. |
netsy_object_storage_write_duration_seconds | Histogram | kind, mode, result | Object storage write duration. |
netsy_object_storage_write_bytes | Histogram | kind, mode | Payload size written to object storage. |
netsy_object_storage_reads_total | Counter | result | Object storage read attempts. |
netsy_object_storage_read_duration_seconds | Histogram | result | Object storage read duration. |
Snapshots
Snapshot creation is a Primary-only maintenance operation that compacts Chunk files into a single Snapshot file. These metrics are separate from the general object storage write metrics to give visibility into snapshot scheduling and lifecycle.
| Metric | Type | Labels | Description |
|---|---|---|---|
netsy_primary_snapshot_creations_total | Counter | result | Snapshot creation attempts by the Primary. |
netsy_primary_snapshot_creation_duration_seconds | Histogram | result | End-to-end duration of snapshot creation, including reading records from SQLite and uploading to object storage. |
netsy_primary_snapshot_age_seconds | Gauge | Seconds since the last successful snapshot was created, computed at collect-time. Useful for alerting when snapshots are not being created on schedule. |
Retries
Every retry path has a corresponding counter so operators can see degradation before it becomes a failure.
| Metric | Type | Labels | Description |
|---|---|---|---|
netsy_retries_total | Counter | operation | Retry attempts by operation. Values include object_storage_write, heartbeat_send, receipt_send, compaction_confirmation, election_contact, node_registration. |
Service Discovery and Registration
| Metric | Type | Labels | Description |
|---|---|---|---|
netsy_node_registration_duration_seconds | Histogram | result | Duration of a Node’s registration attempt with the Elector during loading. |
netsy_elector_auto_deregistrations_total | Counter | Number of Nodes automatically deregistered by the Elector after exceeding elector.deregistration_timeout. |
Watches and Compaction
Note: netsy_watch_min_revision is emitted per Node. Detecting compaction-blocking skew requires comparing this metric across Prometheus instance labels (e.g. min(netsy_watch_min_revision) by (instance)).
| Metric | Type | Labels | Description |
|---|---|---|---|
netsy_watchers | Gauge | Number of connected Watchers on this Node. | |
netsy_watches | Gauge | Number of active Watches on this Node. | |
netsy_watch_min_revision | Gauge | Minimum revision across active Watches on this Node. If there are no active Watches, this equals netsy_committed_revision. | |
netsy_compaction_duration_seconds | Histogram | result | Duration of local compaction work on an individual Node after a compaction revision has been accepted. |
netsy_primary_compactions_total | Counter | result | Compaction coordination runs initiated by the current Primary. |
netsy_primary_compaction_coordination_duration_seconds | Histogram | result | Duration of a cluster-wide compaction coordination run on the Primary. |
netsy_primary_compaction_confirmation_failures_total | Counter | reason | Failed compaction confirmations by reason. |
Structured Logging
Netsy uses structured logs for all operationally significant events. Each log entry includes at minimum:
msgcluster_idnode_id- timestamp
Logs should use stable event names and stable key names. Put detailed error text in error. Put short, bounded failure categories in reason so logs can be aggregated and alerted on safely.
State and Lifecycle Events
Logged whenever a Node changes state or enters or exits a major lifecycle phase.
| Key | Description |
|---|---|
msg | state_transition, loading_stage_started, loading_stage_completed, loading_restarted, primary_preflight_stage_started, primary_preflight_stage_completed, drain_started, drain_completed |
state_type | health, elector, or primary for state_transition |
previous | Previous state value |
new | New state value |
stage | Current loading, preflight, or drain stage |
reason | Bounded trigger or failure reason |
duration_ms | Stage duration when completed |
error | Optional error text |
When local database state is newly created, discarded, or rebuilt, Netsy should emit dedicated lifecycle logs such as local_db_initialized, local_db_rebuild_started, and local_db_rebuild_completed with a bounded reason field.
Election Events
Logged when an election starts, advances, completes, or fails.
| Key | Description |
|---|---|
msg | election_started, election_stage_completed, election_completed, or election_failed |
role | elector or primary |
stage | Election stage name |
elected_node_id | The Node elected on completion |
reason | Failure reason on failure |
registered_nodes | Number of registered Nodes |
contacted_nodes | Number of Nodes successfully contacted |
healthy_candidates | Number of healthy candidate Replicas considered |
duration_ms | Election duration |
Write Path and Transaction Events
Logged when the Primary switches write mode or when a write fails in an operationally interesting way.
| Key | Description |
|---|---|
msg | write_path_switched, write_transaction (debug level), quorum_rollback, write_failed, chunk_buffer_flush_started, chunk_buffer_flush_completed |
path | Current write path |
from | Previous write path on switch |
to | New write path on switch |
reason | Bounded trigger or failure reason |
required_receipts | Quorum threshold at the time |
received_receipts | Number of Receipts received for the transaction |
healthy_replicas | Primary’s healthy Replica count |
revision | Assigned revision for write_transaction |
trigger | Chunk buffer flush trigger |
duration_ms | Flush or write duration |
error | Optional error text |
The write_transaction message is logged at debug level for every completed write transaction on the Primary. It includes path, revision, healthy_replicas, required_receipts, received_receipts (quorum only), and duration_ms. This is intentionally debug-level to avoid noise during normal operation, but is invaluable for diagnosing why individual transactions chose a particular write path.
Registration Events
Logged by the Elector when Node registration changes.
| Key | Description |
|---|---|
msg | node_registered or node_deregistered |
target_node_id | The Node that registered or deregistered |
member_id | The stable etcd member_id assigned or re-used |
trigger | startup, direct, or auto |
reason | Bounded trigger for deregistration (e.g. timeout, shutdown, manual) |
duration_ms | Registration duration for node_registered |
error | Optional error text on registration failure |
Object Storage and Compaction Events
| Key | Description |
|---|---|
msg | object_storage_write, compaction_started, compaction_notice_failed, compaction_completed |
kind | chunk or snapshot |
mode | sync, async, or maintenance |
revision | Relevant revision for the operation |
compaction_revision | Compaction revision when relevant |
reason | Bounded failure reason |
duration_ms | Operation duration |
error | Optional error text |
Debugging
Key Relationships
When diagnosing issues, the following metric relationships are useful:
- Quorum eligibility: compare
netsy_primary_healthy_replicasandnetsy_primary_required_receipts. If healthy Replicas fall below the required threshold,netsy_primary_write_path{path="sync"}becomes1. - Replica-specific degradation: compare
netsy_primary_healthy_replicas,netsy_primary_required_receipts,netsy_primary_replication_streams, and each Replica’s ownnetsy_replica_receipt_age_seconds. Use structured logs to identify which specific Replica is timing out, disconnected, or no longer counted toward quorum. - Replication lag: compare
netsy_latest_revisionandnetsy_committed_revisionacross Nodes to identify Replicas that are behind the current committed point. - Object storage lag: on the Primary, compare
netsy_latest_revisionandnetsy_primary_object_storage_revisionto measure how far async object storage writes are behind quorum-committed data. - Buffer pressure: rising
netsy_primary_chunk_buffer_bytestogether with risingnetsy_primary_chunk_buffer_age_secondssuggests async flushes are not keeping up. - Retry pressure: rising
netsy_retries_totalforoperation="object_storage_write",operation="heartbeat_send", oroperation="receipt_send"indicates degradation in progress. A sustained increase in object storage write retries may precede the Primary enteringDraining. - Loading stalls: if
netsy_state_health{state="loading"}remains1, inspect recentloading_stage_startedandloading_stage_completedlogs together withnetsy_loading_stage_duration_secondsto see which startup step is slow or failing. - Local DB rebuild churn: increases in
netsy_local_db_rebuilds_totalindicate repeated local-database resets or rebuilds. Correlate withlocal_db_rebuild_startedandlocal_db_rebuild_completedlogs to see the trigger. - Election stalls: inspect
netsy_elector_primary_election_duration_seconds,netsy_elector_primary_election_failures_total,netsy_elector_primary_election_contacts_total, and election logs to determine whether the cluster is blocked on prior-Primary contact, node contactability, or candidate validation. - Compaction stalls: rising
netsy_watch_min_revisionskew across Nodes (compare by Prometheusinstancelabel), longnetsy_compaction_duration_seconds, or repeated increments innetsy_primary_compaction_confirmation_failures_totalindicate watch-admission, confirmation, or local compaction-work issues. - Snapshot staleness: a rising
netsy_primary_snapshot_age_secondsor repeatednetsy_primary_snapshot_creations_total{result="error"}increments indicate snapshots are not being created, which increases loading/recovery time for new Nodes. - Registration issues: rising
netsy_retries_total{operation="node_registration"}or longnetsy_node_registration_duration_secondsduring loading indicate Service Discovery or Elector connectivity problems. Correlate withnode_registered/node_deregisteredlogs. - Client API health: rising
netsy_client_requests_total{result="error"}or elevatednetsy_client_request_duration_secondsindicates client-facing degradation. Comparenetsy_replica_proxy_requests_totalerror rates on Replicas to determine whether failures originate from the Replica’s proxy path or the Primary itself.
Alerting Notes
- Alerts should be written with Prometheus staleness semantics in mind. Role-specific metrics disappear when a Node is no longer the Primary or Elector.
- Prefer alerting on sustained conditions using
for:rather than single scrape failures. - Prefer counters for rate-based alerts and gauges for current-role or current-health dashboards.