Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: change text color
Table of Contents

Hardware Diagnostics with Prometheus

Prometheus is an open-source system monitoring and alerting toolkit which allows viewing what statistics are available for a system, even under failure conditions. 

  • Prometheus scrapes metrics from instrumented jobs, running rules over this data to record aggregated time series or to generate alerts. 

  • Grafana and other API consumers can allow visualizing collected data.

The Prometheus The Prometheus Node Exporter is included with Swarm for monitoring and diagnostics on the machines in a Swarm cluster, to provide a wide variety of hardware and kernel-related metrics.

Configuring the Node Exporter

...

Info

Statistics renamed

In Swarm 11, the naming of the statistics has been globalized for clarity: the prefix metrics_ is now caringo_swarm_. (v11.0)

Metric Name
(blue indicates cluster-level scope)

Label(s)

Value Meaning

Related SNMP Entry Name(s)

caringo_swarm_cluster_license_capacity_tb

cluster_name

Cluster capacity in terabytes.

totalGBLicensedCapacity

caringo_swarm_cluster_license_days_remaining

cluster_name

Integer number of days remaining on the license.

 

caringo_swarm_cluster_license_enabled

cluster_name

1 for license enabled.  0 for not enabled.

 

caringo_swarm_cluster_state

cluster_name

-1 = unknown; 0 = ok; 1 = idle; 2 = mounting; 3 = initializing; 4 = finalizing; 5 = maintenance; 6 = retiring; 7 = retired; 8 = error; 9 = unavailable; 10 = offline

clusterState

caringo_swarm_index_overlay_state

The overlay index status. 2=authoritative; 1=operational; 0 otherwise.

indexOverlayStatus

caringo_swarm_index_overlay_inflating

Whether the overlay index is inflating on this node. 1=true; 0; false.

indexOverlayInflating

caringo_swarm_index_overlay_attractors

The number of desired attractors.

indexOverlayDesiredAttractors

caringo_swarm_feeds_deleted_pending

feed_name, feed_type

The number of deleted object events pending waiting to be processed.

feedNodeDeletesUnprocessed

caringo_swarm_feeds_deleted_retrying

feed_name, feed_type

The number of deleted object events needing to be retried.

feedNodeDeletesFailing

caringo_swarm_feeds_deleted_successful

feed_name, feed_type

The number of deleted object events successfully processed.

feedNodeDeletesSuccess

caringo_swarm_feeds_deleted_unqualified

feed_name, feed_type

The number of deleted object events potentially requiring processing.

feedNodeDeletesUnqualified

caringo_swarm_feeds_est_backlog_clear_time

feed_name, feed_type

The estimated number of seconds to complete all processing.  -1 for unknown.

feedEstBacklogClearTime

caringo_swarm_feeds_existing_pending

feed_name, feed_type

The number of current object events pending waiting to be processed.

feedNodeExistsUnprocessed

caringo_swarm_feeds_existing_retrying

feed_name, feed_type

The number of current object events needing to be retried.

feedNodeExistsFailing

caringo_swarm_feeds_existing_successful

feed_name, feed_type

The number of current object events successfully processed.

feedNodeExistsSuccess

caringo_swarm_feeds_existing_unqualified

feed_name, feed_type

The number of current object events potentially requiring processing.

feedNodeExistsUnqualified

caringo_swarm_feeds_feed_id

feed_name, feed_type

The id number of the feed.

feedFeedId

caringo_swarm_feeds_feed_state

feed_name, feed_type

-1 = unknown; 0 = closed; 1 = config-error; 2 = too many overlapping feeds; 3 = blocked; 4 = paused by request; 5 = paused for recovery; 6 = priority (processing contexts after start/restart); 7 = ok

feedState

caringo_swarm_feeds_last_failure

feed_name, feed_type

The time of the last failure event in epoch milliseconds.

feedLastExistFailure, feedLastDeleteFailure, feedLastVersionedFailure

caringo_swarm_feeds_last_success

feed_name, feed_type

The time of the last successful event in epoch milliseconds.

feedLastSuccess

caringo_swarm_feeds_remote_failure

feed_name, feed_type

The number of replication/indexing failures.

feedPluginRemoteFailure

caringo_swarm_feeds_remote_success_duplicate

feed_name, feed_type

The number of duplicate indexing/replication successes.

feedPluginRemoteSuccessDuplicate

caringo_swarm_feeds_remote_success_transfer

feed_name, feed_type

The number of new indexing/replication successes.

feedPluginRemoteSuccessTransfer

caringo_swarm_feeds_versioned_pending

feed_name, feed_type

The number of versioned object events pending waiting to be processed.

feedNodeVersionedUnprocessed

caringo_swarm_feeds_versioned_retrying

feed_name, feed_type

The number of versioned object events needing to be retried.

feedNodeVersionedFailing

caringo_swarm_feeds_versioned_successful

feed_name, feed_type

The number of versioned object events successfully processed.

feedNodeVersionedSuccess

caringo_swarm_feeds_versioned_unqualified

feed_name, feed_type

The number of versioned object events potentially requiring processing.

feedNodeVersionedUnqualified

caringo_swarm_health_cycle

 

The HP cycle number.

ongoingHPCycleNumber

caringo_swarm_health_examined

 

The number of streams examined so far this HP cycle.

ongoingHPCycleStreamsExamined

caringo_swarm_health_offloaded

 

The number of streams moved to another node this HP cycle.

ongoingHPCycleStreamsOffloaded

caringo_swarm_health_relocated

 

The number of streams relocated on disk this HP cycle.

ongoingHPCycleStreamsRelocated

caringo_swarm_health_total

 

The number of streams processed so far this HP cycle.

ongoingHPCycleStreamsTotal

caringo_swarm_health_verified

 

The number of streams checked for data integrity this HP cycle.

ongoingHPCycleStreamsVerified

caringo_swarm_index_alias_slots

 

The number of memory index slots used for alias objects.

indexSlotsAlias

caringo_swarm_index_deleted_slots

 

The number of memory index slots used for deleted objects.

indexSlotsDeleted

caringo_swarm_index_immutable_slots

 

The number of memory index slots used for immutable objects.

indexSlotsImmutable

caringo_swarm_index_manifest_slots

 

Not useful.  Always 0.

indexSlotsManifest

caringo_swarm_index_mutable_slots

 

The number of memory index slots used for named+alias objects.

indexSlotsMutable

caringo_swarm_index_named_slots

 

The number of memory index slots used for named objects.

indexSlotsNamed

caringo_swarm_index_overlay_slots

 

The number of memory index slots used for the overlay index.

indexSlotsOverlayUsed

caringo_swarm_index_policy_slots

 

The number of memory index slots used for policy attributes.

indexSlotsPolicy

caringo_swarm_index_total_slots

 

The number of memory index slots total.

indexSlotsTotal

caringo_swarm_index_used_slots

 

The number of memory index slots used.

indexSlotsUsed

caringo_swarm_index_versioned_slots

 

The number of memory index slots used for prior object versions.

indexSlotsVersioned

caringo_swarm_memory_cache_memory_allocated

 

The memory allocated to the content cache in bytes.

contentCacheCapacityMB

caringo_swarm_memory_cache_memory_items

 

The number of objects stored in the content cache.

contentCacheItems

caringo_swarm_memory_cache_memory_used

 

The memory used to store objects in the content cache in bytes.

contentCacheUsedMB

caringo_swarm_memory_chassis_arena

 

The bytes of memory available for use by Swarm on the chassis.

chassisArenaM

caringo_swarm_memory_chassis_free

 

The bytes of memory free on the chassis.

chassisFreeMemM

caringo_swarm_memory_chassis_headroom

 

The bytes of memory reserved for emergency use on the chassis.

chassisHeadroomM

caringo_swarm_memory_chassis_shared

 

The bytes of shared memory used on the chassis.

chassisSharedMemM

caringo_swarm_memory_chassis_total

 

The bytes of physical memory on the chassis.

chassisTotalMemM

caringo_swarm_memory_node_accounted

 

Bytes of buffer memory in use in the main process.

accountedMemM

caringo_swarm_memory_node_accounts

 

Number of memory accounts in used in the main process.

memAccountsActual

caringo_swarm_memory_node_accounts_over_budget

 

Number of memory accounts over budget in the main process.

memAccountsOverlimit

caringo_swarm_memory_node_accounts_throttled

 

Number of memory accounts throttled in the main process.

memAccountsQueued

caringo_swarm_memory_node_actual

 

Total bytes in use for the main process.

processActualSizeM

caringo_swarm_memory_node_allowance

 

Total buffer memory allocated in the main process.

accountAllowanceM

caringo_swarm_memory_node_file_descriptors

 

Number of file descriptors in use by the main process.

processFDs

caringo_swarm_memory_node_non_accounted

 

Bytes of non accounted memory in use in the main process.

nonAccountedMemM

caringo_swarm_memory_node_target

 

Main process target size in bytes.

processTargetSizeM

caringo_swarm_node_errors

 

The number of reported errors on the node.

castorErrTableSize

caringo_swarm_node_examq

 

The number of examination queue entries on the node.

examQueueCount

caringo_swarm_node_state

 

-1 = unknown; 0 = ok; 1 = idle; 2 = mounting; 3 = initializing; 4 = finalizing; 5 = maintenance; 6 = retiring; 7 = retired; 8 = error; 9 = unavailable; 10 = offline

castorState

caringo_swarm_node_swarm_version

version

Value is always 1.

castorVersion

caringo_swarm_node_uptime

 

The uptime of the main process in seconds.

sysUpTimeInstance

caringo_swarm_node_volumes

 

The number of volumes in use on the node.

castorVolumes

caringo_swarm_scsp_appends

 

The delta since last publication or the total number of APPEND requests.

appends

caringo_swarm_scsp_appends_total

 

 

 

caringo_swarm_scsp_client_close_read

 

The delta since last publication or the total number of client premature closes on read-type requests.

clientPrematureCloseRead

caringo_swarm_scsp_client_close_read_total

 

 

 

caringo_swarm_scsp_client_close_write

 

The delta since last publication or the total number of client premature closes on write-type requests.

clientPrematureCloseWrite

caringo_swarm_scsp_client_close_write_total

 

 

 

caringo_swarm_scsp_copies

 

The delta since last publication or the total number of COPY requests.

copies

caringo_swarm_scsp_copies_total

 

 

 

caringo_swarm_scsp_deletes

 

The delta since last publication or the total number of DELETE requests.

deletes

caringo_swarm_scsp_deletes_total

 

 

 

caringo_swarm_scsp_gets

 

The delta since last publication or the total number of GET requests.

reads

caringo_swarm_scsp_gets_total

 

 

 

caringo_swarm_scsp_heads

 

The delta since last publication or the total number of HEAD requests.

infos

caringo_swarm_scsp_heads_total

 

 

 

caringo_swarm_scsp_indirectDeletes

 

The delta since last publication or the total number of deletes performed internally by the health processor.

indirectDeletes

caringo_swarm_scsp_indirectDeletes_total

 

 

 

caringo_swarm_scsp_internode_reads

 

The delta since last publication or the total number of GET requests internally performed.

internodeReads

caringo_swarm_scsp_internode_reads_total

 

 

 

caringo_swarm_scsp_internode_redirects

 

The delta since last publication or the total number of client redirects between nodes in the cluster.

redirects

caringo_swarm_scsp_internode_redirects_total

 

 

 

caringo_swarm_scsp_internode_trims

 

The delta since last publication or the total number of replicas internally removed.

internodeTrims

caringo_swarm_scsp_internode_trims_total

 

 

 

caringo_swarm_scsp_internode_writes

 

The delta since last publication or the total number of POST requests internally performed.

internodeWrites

caringo_swarm_scsp_internode_writes_total

 

 

 

caringo_swarm_scsp_patches

 

The delta since last publication or the total number of PATCH requests.

patches

caringo_swarm_scsp_patches_total

 

 

 

caringo_swarm_scsp_posts

 

The delta since last publication or the total number of POST requests.

writes

caringo_swarm_scsp_posts_total

 

 

 

caringo_swarm_scsp_processes_active

 

The number of SCSP processes that are active.

 

caringo_swarm_scsp_puts

 

The delta since last publication or the total number of PUT requests.

updates

caringo_swarm_scsp_puts_total

 

 

 

caringo_swarm_scsp_response_200

 

The delta since last publication or the total number of SCSP 200 responses.

clientSuccess200

caringo_swarm_scsp_response_200_total

 

 

 

caringo_swarm_scsp_response_201

 

The delta since last publication or the total number of SCSP 201 responses.

clientSuccess201

caringo_swarm_scsp_response_201_total

 

 

 

caringo_swarm_scsp_response_202

 

The delta since last publication or the total number of SCSP 202 responses.

clientSuccess202

caringo_swarm_scsp_response_202_total

 

 

 

caringo_swarm_scsp_response_206

 

The delta since last publication or the total number of SCSP 206 responses.

clientSuccess206

caringo_swarm_scsp_response_206_total

 

 

 

caringo_swarm_scsp_response_301

 

The delta since last publication or the total number of 301 redirect responses.

clientRedir301

caringo_swarm_scsp_response_301_total

 

 

 

caringo_swarm_scsp_response_304

 

The delta since last publication or the total number of 304 redirect responses.

clientRedir304

caringo_swarm_scsp_response_304_total

 

 

 

caringo_swarm_scsp_response_400

 

The delta since last publication or the total number of 400 error responses.

clientError400

caringo_swarm_scsp_response_400_total

 

 

 

caringo_swarm_scsp_response_401

 

The delta since last publication or the total number of 401 error responses.

clientError401

caringo_swarm_scsp_response_401_total

 

 

 

caringo_swarm_scsp_response_404

 

The delta since last publication or the total number of 404 error responses.

clientError404

caringo_swarm_scsp_response_404_total

 

 

 

caringo_swarm_scsp_response_410

 

The delta since last publication or the total number of 410 error responses.

clientError410

caringo_swarm_scsp_response_410_total

 

 

 

caringo_swarm_scsp_response_412

 

The delta since last publication or the total number of 412 error responses.

clientError412

caringo_swarm_scsp_response_412_total

 

 

 

caringo_swarm_scsp_response_4xx

 

The delta since last publication or the total number of other 400-type error responses.

clientError4xx

caringo_swarm_scsp_response_4xx_total

 

 

 

caringo_swarm_scsp_response_500
caringo_swarm_scsp_response_500_total

The delta since last publication or the total number of 500 error responses.

clientError500

caringo_swarm_scsp_response_503
caringo_swarm_scsp_response_503_total

The delta since last publication or the total number of 503 error responses.

clientError503

caringo_swarm_scsp_response_507
caringo_swarm_scsp_response_507_total

The delta since last publication or the total number of 507 error responses.

clientError507

caringo_swarm_scsp_response_5xx
caringo_swarm_scsp_response_5xx_total

The delta since last publication or the total number of other 500-type error responses.

clientError5xx

caringo_swarm_scsp_searches
caringo_swarm_scsp_searches_total

The delta since last publication or the total number of search requests.

searches

caringo_swarm_volume_capacity

volume_dev, volume_id

The volume capacity in bytes.

volMaxMbytes

caringo_swarm_volume_ecrs

volume_dev, volume_id

The number of EC recoveries ongoing against this volume.

recoveryType, recoveryLocalVolId

caringo_swarm_volume_errors

volume_dev, volume_id

The number of reported IO errors on the volume.

volErrors

caringo_swarm_volume_free

volume_dev, volume_id

The number of free bytes on the volume.

volFreeMbytes

caringo_swarm_volume_fvrs

volume_dev, volume_id

The number of failed volume recoveries ongoing against this volume.

recoveryType, recoveryLocalVolId

caringo_swarm_volume_journal_utilization

volume_dev, volume_id

The portion of the volume journal space in use.

volLastJournalBid

caringo_swarm_volume_logical_objects

volume_dev, volume_id

The contribution to estimated cluster logical objects from this volume.

logicalObjects

caringo_swarm_volume_logical_space

volume_dev, volume_id

The contribution to estimated cluster logical space (in bytes) from this volume.

logicalSpace

caringo_swarm_volume_logical_unprocessed

volume_dev, volume_id

The number of streams on the volume not considered for the logical object/space estimates.

logicalUnprocessed

caringo_swarm_volume_read_bid

volume_dev, volume_id

The last read bid for the volume.

lastRead

caringo_swarm_volume_rep_bid

volume_dev, volume_id

The last replicate bid for the volume.

lastWrite

caringo_swarm_volume_state

volume_dev, volume_id

The status of the given volume name and ID. Statuses: 0 (OK), 1 (retiring), 2 (retired), 3 (unavailable), 4 (mounting), 5 (idle), -1 (unknown).

volState

caringo_swarm_volume_stats_io_queue_count

volume_dev, volume_id

The number of IO queue items on the last sampling.

caringo_swarm_volume_stats_io_queue_sec

volume_dev, volume_id

The time in seconds to process items on the IO queue at the last sampling.

caringo_swarm_volume_stats_io_utilization

volume_dev, volume_id

The fraction of the time at the last sampling the volume was busy.

caringo_swarm_volume_stats_sec_per_io_max

volume_dev, volume_id

The longest IO request time at the last sampling.

caringo_swarm_volume_stats_sec_per_io_running

volume_dev, volume_id

The average IO request time at the last sampling. 

caringo_swarm_volume_streams

volume_dev, volume_id

The number of streams on the volume.

volUsedstreams

caringo_swarm_volume_trapped

volume_dev, volume_id

The trapped space on the volume in bytes.

volTrappedMbytes

caringo_swarm_volume_uptime

volume_dev, volume_id

The time in seconds the volume has been up.

volUptime

caringo_swarm_volume_used

volume_dev, volume_id

The number of bytes used on the volume.

volUsedMbytes

caringo_swarm_volume_write_bid

volume_dev, volume_id

The last written bid for the volume.

caringo_swarm_scsp_processes_active

The number of SCSP processes that are active.

caringo_swarm_feeds_remote_disconnects_last_hour

feed_name, feed_type

The number of remote disconnections in the last hour.

feedPluginRemoteDisconnectsLastHour

caringo_gateway_request_count

protocol="s3", scope="MultiDelete", method="POST"

Total number of S3 multidelete requests. Counts requests, not individual deleted objects.

caringo_gateway_status_code_count

protocol="s3", scope="MultiDelete", method="POST", status

Total number of S3 multidelete requests along with the http result in the status.