Skip to content
Observability

Observability

Netsy exposes Prometheus-compatible metrics and structured logs for monitoring cluster health, debugging failures, and alerting on degradation.

The observability design is intended to explain:

  • the Node’s current role and state
  • if the Primary is using sync or quorum writes
  • whether Replicas are healthy enough for quorum
  • if startup, catchup, draining, or election has stalled
  • whether object storage, replication, or compaction is the bottleneck

Metrics

Naming and instrumentation rules:

  • All metrics use the netsy_ prefix.
  • Use base units in metric names e.g. durations use _seconds, sizes use _bytes.
  • Use counters for cumulative events, gauges for current state, histograms for latency/size distributions.
  • Keep labels bounded and low-cardinality. Avoid labels containing keys, revisions, object names, addresses, or error strings.
  • Role-specific metrics use netsy_primary_, netsy_elector_, or netsy_replica_ prefixes and are exposed via a custom prometheus.Collector that only emits them while the Node is currently in that role.
  • When a Node leaves a role, role-specific metrics disappear from scrape output. They are not set to 0.
  • Do not rely on gauges for short-lived workflows that may complete between scrapes. Prefer counters, histograms, and structured logs for loading stages, preflight work, elections, and flushes.
  • Standard gRPC server interceptor metrics (e.g. grpc_server_handled_total, grpc_server_handling_seconds_bucket) should be registered via go-grpc-middleware or equivalent for per-RPC observability across both Client and Peer gRPC servers. These complement the Netsy-specific metrics and are not documented individually here.

Node State

Each state metric is a gauge vector with a state label. Exactly one label value is 1 at any time and all others are 0.

MetricTypeLabelsDescription
netsy_state_healthGaugestateCurrent Health State. Values: loading, healthy, degraded.
netsy_state_electorGaugestateCurrent Elector State. Values: follower, leader.
netsy_state_primaryGaugestateCurrent Primary State. Values: replica, starting, active, draining.
netsy_process_start_time_secondsGaugeUnix timestamp when this Netsy process started.
netsy_infoGaugeversion, cluster_id, node_id, quorum_configBuild and configuration info. Always 1. Useful for join-enriching dashboards and filtering by cluster or quorum mode.

Revisions

MetricTypeDescription
netsy_latest_revisionGaugeHighest revision present in this Node’s local SQLite database.
netsy_committed_revisionGaugeCurrent committed_revision on this Node.
netsy_compaction_revisionGaugeLatest accepted compaction revision on this Node.
netsy_primary_object_storage_revisionGaugeHighest revision known by the current Primary to be durably written to object storage.

Startup, Catchup, and Draining

These metrics make long-running phases visible beyond the high-level state gauges.

MetricTypeLabelsDescription
netsy_loading_stage_duration_secondsHistogramstage, resultDuration of individual loading stages.
netsy_loading_restarts_totalCounterreasonNumber of times the loading flow restarts.
netsy_local_db_rebuilds_totalCounterreasonNumber of times a Node discards or rebuilds local database state and starts over from snapshot, chunks, or a fresh schema.
netsy_primary_preflight_stage_duration_secondsHistogramstage, resultDuration of Primary preflight stages.
netsy_primary_drain_duration_secondsHistogramresultTime spent draining before stepping down or exiting.
netsy_primary_chunk_buffer_flushes_totalCountertrigger, resultChunk buffer flush attempts. trigger values: size, age, draining, rollback, manual.
netsy_primary_chunk_buffer_flush_duration_secondsHistogramtrigger, resultChunk buffer flush duration.

Client API (Read and Write Proxy)

These metrics cover the Client API surface, including range (read) requests served directly by any Node and write requests proxied by Replicas to the Primary.

MetricTypeLabelsDescription
netsy_client_requests_totalCounterkind, resultClient API requests handled by this Node. kind is range, txn, put, delete, compaction, or lease. result is success or error.
netsy_client_request_duration_secondsHistogramkindClient API request duration.
netsy_replica_proxy_requests_totalCounterkind, resultWrite requests proxied by this Replica to the Primary.
netsy_replica_proxy_request_duration_secondsHistogramkindDuration of proxied write requests, measured from the Replica’s perspective (includes network round-trip to Primary).

Write Path

The Primary chooses between synchronous object storage writes and quorum writes based on Replica health and quorum configuration.

MetricTypeLabelsDescription
netsy_primary_write_pathGaugepathCurrent write path. Exactly one label value is 1. Values: sync, quorum.
netsy_primary_quorum_rollbacks_totalCounterreasonNumber of quorum transaction rollbacks. Reasons include receipt_timeout and insufficient_receipts.
netsy_primary_write_transactions_totalCounterpath, resultTotal write transactions attempted by the Primary.
netsy_primary_write_duration_secondsHistogrampath, resultEnd-to-end write transaction duration.
netsy_primary_required_receiptsGaugeCurrent Replica Receipt threshold required for quorum writes.
netsy_primary_healthy_replicasGaugeNumber of Replicas currently counted as healthy for quorum by the Primary.
netsy_primary_receipted_replicasGaugeNumber of Replicas that have successfully receipted at least once and are therefore eligible to count toward quorum.

Replica Health and Replication

These metrics describe the Primary’s view of Replica health and quorum eligibility for write decisions.

MetricTypeLabelsDescription
netsy_replica_receipt_age_secondsGaugeSeconds since this Node last successfully sent a Receipt to the Primary, computed at collect-time.
netsy_primary_replication_streamsGaugeNumber of currently connected replication streams.

Elector Cluster View and Elections

These metrics describe the Elector’s view of registered and healthy Nodes, and the results of Primary election attempts.

MetricTypeLabelsDescription
netsy_elector_registered_nodesGaugeNumber of currently registered Nodes in the Elector’s in-memory map.
netsy_elector_healthy_nodesGaugeNumber of Nodes currently in Healthy Health State according to the Elector.
netsy_elector_degraded_nodesGaugeNumber of Nodes currently in Degraded Health State according to the Elector.
netsy_elector_primary_elections_totalCounterresultPrimary elections run by this Node as the Elector. Values: success, failure.
netsy_elector_primary_election_failures_totalCounterreasonFailed Primary elections by failure reason.
netsy_elector_primary_election_duration_secondsHistogramresultEnd-to-end Primary election duration.
netsy_elector_primary_election_contacts_totalCounterresultNode contact attempts made by the Elector during Primary elections. Values: success, failure.

Chunk Buffer

MetricTypeDescription
netsy_primary_chunk_buffer_recordsGaugeNumber of records currently in the Chunk Buffer.
netsy_primary_chunk_buffer_bytesGaugeSize in bytes of all records currently in the Chunk Buffer.
netsy_primary_chunk_buffer_age_secondsGaugeAge in seconds of the oldest unflushed record in the Chunk Buffer, computed at collect-time. 0 when buffer is empty.

Object Storage

Write metrics carry kind (chunk or snapshot) and mode (sync or async) labels so operators can distinguish client-facing sync writes from background buffer flushes. Read metrics carry only result — reads are off the hot path (bootstrap, preflight, discovery). Only explicitly instrumented write paths record metrics; internal metadata writes (discovery, registration) are excluded.

MetricTypeLabelsDescription
netsy_object_storage_writes_totalCounterkind, mode, resultObject storage write attempts. kind is chunk or snapshot. mode is sync or async.
netsy_object_storage_write_duration_secondsHistogramkind, mode, resultObject storage write duration.
netsy_object_storage_write_bytesHistogramkind, modePayload size written to object storage.
netsy_object_storage_reads_totalCounterresultObject storage read attempts.
netsy_object_storage_read_duration_secondsHistogramresultObject storage read duration.

Snapshots

Snapshot creation is a Primary-only maintenance operation that compacts Chunk files into a single Snapshot file. These metrics are separate from the general object storage write metrics to give visibility into snapshot scheduling and lifecycle.

MetricTypeLabelsDescription
netsy_primary_snapshot_creations_totalCounterresultSnapshot creation attempts by the Primary.
netsy_primary_snapshot_creation_duration_secondsHistogramresultEnd-to-end duration of snapshot creation, including reading records from SQLite and uploading to object storage.
netsy_primary_snapshot_age_secondsGaugeSeconds since the last successful snapshot was created, computed at collect-time. Useful for alerting when snapshots are not being created on schedule.

Retries

Every retry path has a corresponding counter so operators can see degradation before it becomes a failure.

MetricTypeLabelsDescription
netsy_retries_totalCounteroperationRetry attempts by operation. Values include object_storage_write, heartbeat_send, receipt_send, compaction_confirmation, election_contact, node_registration.

Service Discovery and Registration

MetricTypeLabelsDescription
netsy_node_registration_duration_secondsHistogramresultDuration of a Node’s registration attempt with the Elector during loading.
netsy_elector_auto_deregistrations_totalCounterNumber of Nodes automatically deregistered by the Elector after exceeding elector.deregistration_timeout.

Watches and Compaction

Note: netsy_watch_min_revision is emitted per Node. Detecting compaction-blocking skew requires comparing this metric across Prometheus instance labels (e.g. min(netsy_watch_min_revision) by (instance)).

MetricTypeLabelsDescription
netsy_watchersGaugeNumber of connected Watchers on this Node.
netsy_watchesGaugeNumber of active Watches on this Node.
netsy_watch_min_revisionGaugeMinimum revision across active Watches on this Node. If there are no active Watches, this equals netsy_committed_revision.
netsy_compaction_duration_secondsHistogramresultDuration of local compaction work on an individual Node after a compaction revision has been accepted.
netsy_primary_compactions_totalCounterresultCompaction coordination runs initiated by the current Primary.
netsy_primary_compaction_coordination_duration_secondsHistogramresultDuration of a cluster-wide compaction coordination run on the Primary.
netsy_primary_compaction_confirmation_failures_totalCounterreasonFailed compaction confirmations by reason.

Structured Logging

Netsy uses structured logs for all operationally significant events. Each log entry includes at minimum:

  • msg
  • cluster_id
  • node_id
  • timestamp

Logs should use stable event names and stable key names. Put detailed error text in error. Put short, bounded failure categories in reason so logs can be aggregated and alerted on safely.

State and Lifecycle Events

Logged whenever a Node changes state or enters or exits a major lifecycle phase.

KeyDescription
msgstate_transition, loading_stage_started, loading_stage_completed, loading_restarted, primary_preflight_stage_started, primary_preflight_stage_completed, drain_started, drain_completed
state_typehealth, elector, or primary for state_transition
previousPrevious state value
newNew state value
stageCurrent loading, preflight, or drain stage
reasonBounded trigger or failure reason
duration_msStage duration when completed
errorOptional error text

When local database state is newly created, discarded, or rebuilt, Netsy should emit dedicated lifecycle logs such as local_db_initialized, local_db_rebuild_started, and local_db_rebuild_completed with a bounded reason field.

Election Events

Logged when an election starts, advances, completes, or fails.

KeyDescription
msgelection_started, election_stage_completed, election_completed, or election_failed
roleelector or primary
stageElection stage name
elected_node_idThe Node elected on completion
reasonFailure reason on failure
registered_nodesNumber of registered Nodes
contacted_nodesNumber of Nodes successfully contacted
healthy_candidatesNumber of healthy candidate Replicas considered
duration_msElection duration

Write Path and Transaction Events

Logged when the Primary switches write mode or when a write fails in an operationally interesting way.

KeyDescription
msgwrite_path_switched, write_transaction (debug level), quorum_rollback, write_failed, chunk_buffer_flush_started, chunk_buffer_flush_completed
pathCurrent write path
fromPrevious write path on switch
toNew write path on switch
reasonBounded trigger or failure reason
required_receiptsQuorum threshold at the time
received_receiptsNumber of Receipts received for the transaction
healthy_replicasPrimary’s healthy Replica count
revisionAssigned revision for write_transaction
triggerChunk buffer flush trigger
duration_msFlush or write duration
errorOptional error text

The write_transaction message is logged at debug level for every completed write transaction on the Primary. It includes path, revision, healthy_replicas, required_receipts, received_receipts (quorum only), and duration_ms. This is intentionally debug-level to avoid noise during normal operation, but is invaluable for diagnosing why individual transactions chose a particular write path.

Registration Events

Logged by the Elector when Node registration changes.

KeyDescription
msgnode_registered or node_deregistered
target_node_idThe Node that registered or deregistered
member_idThe stable etcd member_id assigned or re-used
triggerstartup, direct, or auto
reasonBounded trigger for deregistration (e.g. timeout, shutdown, manual)
duration_msRegistration duration for node_registered
errorOptional error text on registration failure

Object Storage and Compaction Events

KeyDescription
msgobject_storage_write, compaction_started, compaction_notice_failed, compaction_completed
kindchunk or snapshot
modesync, async, or maintenance
revisionRelevant revision for the operation
compaction_revisionCompaction revision when relevant
reasonBounded failure reason
duration_msOperation duration
errorOptional error text

Debugging

Key Relationships

When diagnosing issues, the following metric relationships are useful:

  • Quorum eligibility: compare netsy_primary_healthy_replicas and netsy_primary_required_receipts. If healthy Replicas fall below the required threshold, netsy_primary_write_path{path="sync"} becomes 1.
  • Replica-specific degradation: compare netsy_primary_healthy_replicas, netsy_primary_required_receipts, netsy_primary_replication_streams, and each Replica’s own netsy_replica_receipt_age_seconds. Use structured logs to identify which specific Replica is timing out, disconnected, or no longer counted toward quorum.
  • Replication lag: compare netsy_latest_revision and netsy_committed_revision across Nodes to identify Replicas that are behind the current committed point.
  • Object storage lag: on the Primary, compare netsy_latest_revision and netsy_primary_object_storage_revision to measure how far async object storage writes are behind quorum-committed data.
  • Buffer pressure: rising netsy_primary_chunk_buffer_bytes together with rising netsy_primary_chunk_buffer_age_seconds suggests async flushes are not keeping up.
  • Retry pressure: rising netsy_retries_total for operation="object_storage_write", operation="heartbeat_send", or operation="receipt_send" indicates degradation in progress. A sustained increase in object storage write retries may precede the Primary entering Draining.
  • Loading stalls: if netsy_state_health{state="loading"} remains 1, inspect recent loading_stage_started and loading_stage_completed logs together with netsy_loading_stage_duration_seconds to see which startup step is slow or failing.
  • Local DB rebuild churn: increases in netsy_local_db_rebuilds_total indicate repeated local-database resets or rebuilds. Correlate with local_db_rebuild_started and local_db_rebuild_completed logs to see the trigger.
  • Election stalls: inspect netsy_elector_primary_election_duration_seconds, netsy_elector_primary_election_failures_total, netsy_elector_primary_election_contacts_total, and election logs to determine whether the cluster is blocked on prior-Primary contact, node contactability, or candidate validation.
  • Compaction stalls: rising netsy_watch_min_revision skew across Nodes (compare by Prometheus instance label), long netsy_compaction_duration_seconds, or repeated increments in netsy_primary_compaction_confirmation_failures_total indicate watch-admission, confirmation, or local compaction-work issues.
  • Snapshot staleness: a rising netsy_primary_snapshot_age_seconds or repeated netsy_primary_snapshot_creations_total{result="error"} increments indicate snapshots are not being created, which increases loading/recovery time for new Nodes.
  • Registration issues: rising netsy_retries_total{operation="node_registration"} or long netsy_node_registration_duration_seconds during loading indicate Service Discovery or Elector connectivity problems. Correlate with node_registered / node_deregistered logs.
  • Client API health: rising netsy_client_requests_total{result="error"} or elevated netsy_client_request_duration_seconds indicates client-facing degradation. Compare netsy_replica_proxy_requests_total error rates on Replicas to determine whether failures originate from the Replica’s proxy path or the Primary itself.

Alerting Notes

  • Alerts should be written with Prometheus staleness semantics in mind. Role-specific metrics disappear when a Node is no longer the Primary or Elector.
  • Prefer alerting on sustained conditions using for: rather than single scrape failures.
  • Prefer counters for rate-based alerts and gauges for current-role or current-health dashboards.