Prometheus Node Exporter and Grafana

Hardware Diagnostics with Prometheus

Prometheus is an open-source system monitoring and alerting toolkit which allows viewing what statistics are available for a system, even under failure conditions. 

  • Prometheus scrapes metrics from instrumented jobs, running rules over this data to record aggregated time series or to generate alerts. 

  • Grafana and other API consumers can allow visualizing collected data.

The Prometheus Node Exporter is included with Swarm for monitoring and diagnostics on the machines in a Swarm cluster, to provide a wide variety of hardware and kernel-related metrics.

Configuring the Node Exporter

The required Storage setting for Node Exporter is enabled by default: metrics.enableNodeExporter = True. A cluster reboot is required to re-enable if disabled.

Change how frequently the exports occur if needed. Perform this using Swarm UI or SNMP on the running cluster:  metrics.nodeExporterFrequency = 120

Adding Grafana Dashboards

DataCore has published public Grafana dashboards for monitoring Swarm products and features to visualize this Prometheus data. Check here for the latest dashboards for the versions of Swarm products being used: 

Customized dashboards are available for the following products:

Swarm System Monitoring (Select the dashboard for the version of Storage)

  • Visualizations include cluster health, capacity, indexing, licensing, temperature, and network and CPU loads:

     

  • Cluster-wide operations:

Swarm Node View (new since v12.0)

  • Detail view of a single Swarm node:

Gateway Monitoring

Note

Some statistics show a value after S3 operations have run against the Gateway

  • Visualizations include CPU load, operations, connections, and HTTP status codes:

Swarm Search (new since v15 )

  • Visualizations include elasticsearch 7.5.x metrics

Video Clipping

Optional

The Video Clipping is Optional.

  • Gateway / Content UI added the optional feature Video Clipping for Partial File Restore.

  • Visualizations include numbers, rates, and error counts for video clipping requests.

  • The errors are counted by stage (preprocessing, processing, postprocessing), to help with troubleshooting:

Importing a Dashboard

  1. Navigate to https://grafana.com/get to obtain a free hosted instance of Grafana (1 user, 5 dashboards).

  2. View the desired dashboard page and select Copy ID to Clipboard to get the ID for the dashboard:

  3. Open the dashboard search on the Grafana instance and then click Import to import a dashboard.

  4. Paste in the ID when prompted:

  5. Verify the name is correct once the dashboard is found.

Important

Set the Folder option to make the dashboard visible. The folder "General" is available by default.

  1. In the following import process, Grafana prompts setting the data source, and specify any metric prefixes (if the dashboard uses any).

Troubleshooting "No Data" Errors

There are multiple points at which things can go wrong in the pipeline from collecting data to displaying charts. The following is the process for troubleshooting No Data errors in graphs.

Checking Endpoints

Services monitored by Prometheus (Swarm nodes and Gateways) expose an endpoint (usually port 9100).

Swarm: In the Swarm UI Cluster Settings, Advanced, verify metrics.nodeExporterFrequency=120. The metrics.enableNodeExporter=True must be explicitly set if not on the latest Swarm release. Test the endpoint:

curl http://SWARM_NODE:9100/metrics

Prometheus: Prometheus polls those endpoints as set in /etc/prometheus/prometheus.ini. Test the targets:

http://PROMETHEUS:9090/targets

Configuring Elasticsearch Exporter

A new Grafana dashboard Swarm Search v7 is added to SwarmTelemetry. The new dashboard uses a new Prometheus exporter called elasticsearch_exporter and runs as a service on SwarmTelemetry. It is important to set the target Elasticsearch host IP in /usr/lib/system.d/system/elasticsearch_exporter.service.  

The process of changing the IP is as follows:

modify /usr/lib/system.d/system/elasticsearch_exporter.service --es.uri parameter to match one of your ES node IP’s

systemctl daemon-reload systemctl enable elasticsearch_exporter systemctl start elasticsearch_exporter

By default, this IP points to the swarm storage network, internal IP address of the SwarmSearch VM. See https://github.com/prometheus-community/elasticsearch_exporter for various Elasticseach metrics.

Checking Grafana

  1. Verify Grafana has a “Prometheus” Data Source and is set to Default.
    The dashboards automatically use this when they are imported; Edit a panel to see.

  2. Verify Grafana is at least version 9.3.2

Node Exporter Statistics

Following is information about what Swarm statistics are exported by Prometheus. As possible, these statistics are correlated with MIB entries, although the scales may differ. 

Metric Name
(Blue indicates cluster-level scope)

Label(s)

Value Meaning

Related SNMP Entry Name(s)

Metric Name
(Blue indicates cluster-level scope)

Label(s)

Value Meaning

Related SNMP Entry Name(s)

caringo_swarm_cluster_license_capacity_tb

cluster_name

Cluster capacity in terabytes.

totalGBLicensedCapacity

caringo_swarm_cluster_license_days_remaining

cluster_name

Integer number of days remaining on the license.

 

caringo_swarm_cluster_license_enabled

cluster_name

1 for license enabled.  0 for not enabled.

 

caringo_swarm_cluster_state

cluster_name

-1 = unknown; 0 = ok; 1 = idle; 2 = mounting; 3 = initializing; 4 = finalizing; 5 = maintenance; 6 = retiring; 7 = retired; 8 = error; 9 = unavailable; 10 = offline

clusterState

caringo_swarm_index_overlay_state

 

The overlay index status. 2=authoritative; 1=operational; 0 otherwise.

indexOverlayStatus

caringo_swarm_index_overlay_inflating

 

Whether the overlay index is inflating on this node. 1=true; 0; false.

indexOverlayInflating

caringo_swarm_index_overlay_attractors

 

The number of desired attractors.

indexOverlayDesiredAttractors

caringo_swarm_feeds_deleted_pending

feed_name, feed_type

The number of deleted object events pending waiting to be processed.

feedNodeDeletesUnprocessed

caringo_swarm_feeds_deleted_retrying

feed_name, feed_type

The number of deleted object events needing to be retried.

feedNodeDeletesFailing

caringo_swarm_feeds_deleted_successful

feed_name, feed_type

The number of deleted object events successfully processed.

feedNodeDeletesSuccess

caringo_swarm_feeds_deleted_unqualified

feed_name, feed_type

The number of deleted object events potentially requiring processing.

feedNodeDeletesUnqualified

caringo_swarm_feeds_est_backlog_clear_time

feed_name, feed_type

The estimated number of seconds to complete all processing.  -1 for unknown.

feedEstBacklogClearTime

caringo_swarm_feeds_existing_pending

feed_name, feed_type

The number of current object events pending waiting to be processed.

feedNodeExistsUnprocessed

caringo_swarm_feeds_existing_retrying

feed_name, feed_type

The number of current object events needing to be retried.

feedNodeExistsFailing

caringo_swarm_feeds_existing_successful

feed_name, feed_type

The number of current object events successfully processed.

feedNodeExistsSuccess

caringo_swarm_feeds_existing_unqualified

feed_name, feed_type

The number of current object events potentially requiring processing.

feedNodeExistsUnqualified

caringo_swarm_feeds_feed_id

feed_name, feed_type

The id number of the feed.

feedFeedId

caringo_swarm_feeds_feed_state

feed_name, feed_type

-1 = unknown; 0 = closed; 1 = config-error; 2 = too many overlapping feeds; 3 = blocked; 4 = paused by request; 5 = paused for recovery; 6 = priority (processing contexts after start/restart); 7 = ok

feedState

caringo_swarm_feeds_last_failure

feed_name, feed_type

The time of the last failure event in epoch milliseconds.

feedLastExistFailure, feedLastDeleteFailure, feedLastVersionedFailure

caringo_swarm_feeds_last_success

feed_name, feed_type

The time of the last successful event in epoch milliseconds.

feedLastSuccess

caringo_swarm_feeds_remote_failure

feed_name, feed_type

The number of replication/indexing failures.

feedPluginRemoteFailure

caringo_swarm_feeds_remote_success_duplicate

feed_name, feed_type

The number of duplicate indexing/replication successes.

feedPluginRemoteSuccessDuplicate

caringo_swarm_feeds_remote_success_transfer

feed_name, feed_type

The number of new indexing/replication successes.

feedPluginRemoteSuccessTransfer

caringo_swarm_feeds_versioned_pending

feed_name, feed_type

The number of versioned object events pending waiting to be processed.

feedNodeVersionedUnprocessed

caringo_swarm_feeds_versioned_retrying

feed_name, feed_type

The number of versioned object events needing to be retried.

feedNodeVersionedFailing

caringo_swarm_feeds_versioned_successful

feed_name, feed_type

The number of versioned object events successfully processed.

feedNodeVersionedSuccess

caringo_swarm_feeds_versioned_unqualified

feed_name, feed_type

The number of versioned object events potentially requiring processing.

feedNodeVersionedUnqualified

caringo_swarm_health_cycle

 

The HP cycle number.

ongoingHPCycleNumber

caringo_swarm_health_examined

 

The number of streams examined so far this HP cycle.

ongoingHPCycleStreamsExamined

caringo_swarm_health_offloaded

 

The number of streams moved to another node this HP cycle.

ongoingHPCycleStreamsOffloaded

caringo_swarm_health_relocated

 

The number of streams relocated on disk this HP cycle.

ongoingHPCycleStreamsRelocated

caringo_swarm_health_total

 

The number of streams processed so far this HP cycle.

ongoingHPCycleStreamsTotal

caringo_swarm_health_verified

 

The number of streams checked for data integrity this HP cycle.

ongoingHPCycleStreamsVerified

caringo_swarm_index_alias_slots

 

The number of memory index slots used for alias objects.

indexSlotsAlias

caringo_swarm_index_deleted_slots

 

The number of memory index slots used for deleted objects.

indexSlotsDeleted

caringo_swarm_index_immutable_slots

 

The number of memory index slots used for immutable objects.

indexSlotsImmutable

caringo_swarm_index_manifest_slots

 

Not useful.  Always 0.

indexSlotsManifest

caringo_swarm_index_mutable_slots

 

The number of memory index slots used for named+alias objects.

indexSlotsMutable

caringo_swarm_index_named_slots

 

The number of memory index slots used for named objects.

indexSlotsNamed

caringo_swarm_index_overlay_slots

 

The number of memory index slots used for the overlay index.

indexSlotsOverlayUsed

caringo_swarm_index_policy_slots

 

The number of memory index slots used for policy attributes.

indexSlotsPolicy

caringo_swarm_index_total_slots

 

The number of memory index slots total.

indexSlotsTotal

caringo_swarm_index_used_slots

 

The number of memory index slots used.

indexSlotsUsed

caringo_swarm_index_versioned_slots

 

The number of memory index slots used for prior object versions.

indexSlotsVersioned

caringo_swarm_memory_cache_memory_allocated

 

The memory allocated to the content cache in bytes.

contentCacheCapacityMB

caringo_swarm_memory_cache_memory_items

 

The number of objects stored in the content cache.

contentCacheItems

caringo_swarm_memory_cache_memory_used

 

The memory used to store objects in the content cache in bytes.

contentCacheUsedMB

caringo_swarm_memory_chassis_arena

 

The bytes of memory available for use by Swarm on the chassis.

chassisArenaM

caringo_swarm_memory_chassis_free

 

The bytes of memory free on the chassis.

chassisFreeMemM

caringo_swarm_memory_chassis_headroom

 

The bytes of memory reserved for emergency use on the chassis.

chassisHeadroomM

caringo_swarm_memory_chassis_shared

 

The bytes of shared memory used on the chassis.

chassisSharedMemM

caringo_swarm_memory_chassis_total

 

The bytes of physical memory on the chassis.

chassisTotalMemM

caringo_swarm_memory_node_accounted

 

Bytes of buffer memory in use in the main process.

accountedMemM

caringo_swarm_memory_node_accounts

 

Number of memory accounts in used in the main process.

memAccountsActual

caringo_swarm_memory_node_accounts_over_budget

 

Number of memory accounts over budget in the main process.

memAccountsOverlimit

caringo_swarm_memory_node_accounts_throttled

 

Number of memory accounts throttled in the main process.

memAccountsQueued

caringo_swarm_memory_node_actual

 

Total bytes in use for the main process.

processActualSizeM

caringo_swarm_memory_node_allowance

 

Total buffer memory allocated in the main process.

accountAllowanceM

caringo_swarm_memory_node_file_descriptors

 

Number of file descriptors in use by the main process.

processFDs

caringo_swarm_memory_node_non_accounted

 

Bytes of non accounted memory in use in the main process.

nonAccountedMemM

caringo_swarm_memory_node_target

 

Main process target size in bytes.

processTargetSizeM

caringo_swarm_node_errors

 

The number of reported errors on the node.

castorErrTableSize

caringo_swarm_node_examq

 

The number of examination queue entries on the node.

examQueueCount

caringo_swarm_node_state

 

-1 = unknown; 0 = ok; 1 = idle; 2 = mounting; 3 = initializing; 4 = finalizing; 5 = maintenance; 6 = retiring; 7 = retired; 8 = error; 9 = unavailable; 10 = offline

castorState

caringo_swarm_node_swarm_version

version

Value is always 1.

castorVersion

caringo_swarm_node_uptime

 

The uptime of the main process in seconds.

sysUpTimeInstance

caringo_swarm_node_volumes

 

The number of volumes in use on the node.

castorVolumes

caringo_swarm_scsp_appends

 

The delta since last publication or the total number of APPEND requests.

appends

caringo_swarm_scsp_appends_total

 

 

 

caringo_swarm_scsp_client_close_read

 

The delta since last publication or the total number of client premature closes on read-type requests.

clientPrematureCloseRead

caringo_swarm_scsp_client_close_read_total

 

 

 

caringo_swarm_scsp_client_close_write

 

The delta since last publication or the total number of client premature closes on write-type requests.

clientPrematureCloseWrite

caringo_swarm_scsp_client_close_write_total

 

 

 

caringo_swarm_scsp_copies

 

The delta since last publication or the total number of COPY requests.

copies

caringo_swarm_scsp_copies_total

 

 

 

caringo_swarm_scsp_deletes

 

The delta since last publication or the total number of DELETE requests.

deletes

caringo_swarm_scsp_deletes_total

 

 

 

caringo_swarm_scsp_gets

 

The delta since last publication or the total number of GET requests.

reads

caringo_swarm_scsp_gets_total

 

 

 

caringo_swarm_scsp_heads

 

The delta since last publication or the total number of HEAD requests.

infos

caringo_swarm_scsp_heads_total

 

 

 

caringo_swarm_scsp_indirectDeletes

 

The delta since last publication or the total number of deletes performed internally by the health processor.

indirectDeletes

caringo_swarm_scsp_indirectDeletes_total

 

 

 

caringo_swarm_scsp_internode_reads

 

The delta since last publication or the total number of GET requests internally performed.

internodeReads

caringo_swarm_scsp_internode_reads_total

 

 

 

caringo_swarm_scsp_internode_redirects

 

The delta since last publication or the total number of client redirects between nodes in the cluster.

redirects

caringo_swarm_scsp_internode_redirects_total

 

 

 

caringo_swarm_scsp_internode_trims

 

The delta since last publication or the total number of replicas internally removed.

internodeTrims

caringo_swarm_scsp_internode_trims_total

 

 

 

caringo_swarm_scsp_internode_writes

 

The delta since last publication or the total number of POST requests internally performed.

internodeWrites

caringo_swarm_scsp_internode_writes_total

 

 

 

caringo_swarm_scsp_patches

 

The delta since last publication or the total number of PATCH requests.

patches

caringo_swarm_scsp_patches_total

 

 

 

caringo_swarm_scsp_posts

 

The delta since last publication or the total number of POST requests.

writes

caringo_swarm_scsp_posts_total

 

 

 

caringo_swarm_scsp_processes_active

 

The number of SCSP processes that are active.

 

caringo_swarm_scsp_puts

 

The delta since last publication or the total number of PUT requests.

updates

caringo_swarm_scsp_puts_total

 

 

 

caringo_swarm_scsp_response_200

 

The delta since last publication or the total number of SCSP 200 responses.

clientSuccess200

caringo_swarm_scsp_response_200_total

 

 

 

caringo_swarm_scsp_response_201

 

The delta since last publication or the total number of SCSP 201 responses.

clientSuccess201

caringo_swarm_scsp_response_201_total

 

 

 

caringo_swarm_scsp_response_202

 

The delta since last publication or the total number of SCSP 202 responses.

clientSuccess202

caringo_swarm_scsp_response_202_total

 

 

 

caringo_swarm_scsp_response_206

 

The delta since last publication or the total number of SCSP 206 responses.

clientSuccess206

caringo_swarm_scsp_response_206_total

 

 

 

caringo_swarm_scsp_response_301

 

The delta since last publication or the total number of 301 redirect responses.

clientRedir301

caringo_swarm_scsp_response_301_total

 

 

 

caringo_swarm_scsp_response_304

 

The delta since last publication or the total number of 304 redirect responses.

clientRedir304

caringo_swarm_scsp_response_304_total

 

 

 

caringo_swarm_scsp_response_400

 

The delta since last publication or the total number of 400 error responses.

clientError400

caringo_swarm_scsp_response_400_total

 

 

 

caringo_swarm_scsp_response_401

 

The delta since last publication or the total number of 401 error responses.

clientError401

caringo_swarm_scsp_response_401_total

 

 

 

caringo_swarm_scsp_response_404

 

The delta since last publication or the total number of 404 error responses.

clientError404

caringo_swarm_scsp_response_404_total

 

 

 

caringo_swarm_scsp_response_410

 

The delta since last publication or the total number of 410 error responses.

clientError410

caringo_swarm_scsp_response_410_total

 

 

 

caringo_swarm_scsp_response_412

 

The delta since last publication or the total number of 412 error responses.

clientError412

caringo_swarm_scsp_response_412_total

 

 

 

caringo_swarm_scsp_response_4xx

 

The delta since last publication or the total number of other 400-type error responses.

clientError4xx

caringo_swarm_scsp_response_4xx_total

 

 

 

caringo_swarm_scsp_response_500
caringo_swarm_scsp_response_500_total

 

The delta since last publication or the total number of 500 error responses.

clientError500

caringo_swarm_scsp_response_503
caringo_swarm_scsp_response_503_total

 

The delta since last publication or the total number of 503 error responses.

clientError503

caringo_swarm_scsp_response_507
caringo_swarm_scsp_response_507_total

 

The delta since last publication or the total number of 507 error responses.

clientError507

caringo_swarm_scsp_response_5xx
caringo_swarm_scsp_response_5xx_total

 

The delta since last publication or the total number of other 500-type error responses.

clientError5xx

caringo_swarm_scsp_searches
caringo_swarm_scsp_searches_total

 

The delta since last publication or the total number of search requests.

searches

caringo_swarm_volume_capacity

volume_dev, volume_id

The volume capacity in bytes.

volMaxMbytes

caringo_swarm_volume_ecrs

volume_dev, volume_id

The number of EC recoveries ongoing against this volume.

recoveryType, recoveryLocalVolId

caringo_swarm_volume_errors

volume_dev, volume_id

The number of reported IO errors on the volume.

volErrors

caringo_swarm_volume_free

volume_dev, volume_id

The number of free bytes on the volume.

volFreeMbytes

caringo_swarm_volume_fvrs

volume_dev, volume_id

The number of failed volume recoveries ongoing against this volume.

recoveryType, recoveryLocalVolId

caringo_swarm_volume_journal_utilization

volume_dev, volume_id

The portion of the volume journal space in use.

volLastJournalBid

caringo_swarm_volume_logical_objects

volume_dev, volume_id

The contribution to estimated cluster logical objects from this volume.

logicalObjects

caringo_swarm_volume_logical_space

volume_dev, volume_id

The contribution to estimated cluster logical space (in bytes) from this volume.

logicalSpace

caringo_swarm_volume_logical_unprocessed

volume_dev, volume_id

The number of streams on the volume not considered for the logical object/space estimates.

logicalUnprocessed

caringo_swarm_volume_read_bid

volume_dev, volume_id

The last read bid for the volume.

lastRead

caringo_swarm_volume_rep_bid

volume_dev, volume_id

The last replicate bid for the volume.

lastWrite

caringo_swarm_volume_state

volume_dev, volume_id

The status of the given volume name and ID. Statuses: 0 (OK), 1 (retiring), 2 (retired), 3 (unavailable), 4 (mounting), 5 (idle), -1 (unknown).

volState

caringo_swarm_volume_stats_io_queue_count

volume_dev, volume_id

The number of IO queue items on the last sampling.

 

caringo_swarm_volume_stats_io_queue_sec

volume_dev, volume_id

The time in seconds to process items on the IO queue at the last sampling.

 

caringo_swarm_volume_stats_io_utilization

volume_dev, volume_id

The fraction of the time at the last sampling the volume was busy.

 

caringo_swarm_volume_stats_sec_per_io_max

volume_dev, volume_id

The longest IO request time at the last sampling.

 

caringo_swarm_volume_stats_sec_per_io_running

volume_dev, volume_id

The average IO request time at the last sampling. 

 

caringo_swarm_volume_streams

volume_dev, volume_id

The number of streams on the volume.

volUsedstreams

caringo_swarm_volume_trapped

volume_dev, volume_id

The trapped space on the volume in bytes.

volTrappedMbytes

caringo_swarm_volume_uptime

volume_dev, volume_id

The time in seconds the volume has been up.

volUptime

caringo_swarm_volume_used

volume_dev, volume_id

The number of bytes used on the volume.

volUsedMbytes

caringo_swarm_volume_write_bid

volume_dev, volume_id

The last written bid for the volume.

 

caringo_swarm_scsp_processes_active

 

The number of SCSP processes that are active.

 

caringo_swarm_feeds_remote_disconnects_last_hour

feed_name, feed_type

The number of remote disconnections in the last hour.

feedPluginRemoteDisconnectsLastHour

caringo_gateway_request_count

protocol="s3", scope="MultiDelete", method="POST"

Total number of S3 multidelete requests. Counts requests, not individual deleted objects.

 

caringo_gateway_status_code_count

protocol="s3", scope="MultiDelete", method="POST", status

Total number of S3 multidelete requests along with the http result in the status.

 



 

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.