Table of Contents |
---|
Hardware Diagnostics with Prometheus
Prometheus is an open-source system monitoring and alerting toolkit which allows viewing what statistics are available for a system, even under failure conditions.
Prometheus scrapes metrics from instrumented jobs, running rules over this data to record aggregated time series or to generate alerts.
Grafana and other API consumers can allow visualizing collected data.
The Prometheus The Prometheus Node Exporter is included with Swarm for monitoring and diagnostics on the machines in a Swarm cluster, to provide a wide variety of hardware and kernel-related metrics.
Configuring the Node Exporter
...
Info |
---|
Statistics renamed In Swarm 11, the naming of the statistics has been globalized for clarity: the prefix |
Metric Name | Label(s) | Value Meaning | Related SNMP Entry Name(s) |
---|---|---|---|
caringo_swarm_cluster_license_capacity_tb | cluster_name | Cluster capacity in terabytes. | totalGBLicensedCapacity |
caringo_swarm_cluster_license_days_remaining | cluster_name | Integer number of days remaining on the license. |
|
caringo_swarm_cluster_license_enabled | cluster_name | 1 for license enabled. 0 for not enabled. |
|
caringo_swarm_cluster_state | cluster_name | -1 = unknown; 0 = ok; 1 = idle; 2 = mounting; 3 = initializing; 4 = finalizing; 5 = maintenance; 6 = retiring; 7 = retired; 8 = error; 9 = unavailable; 10 = offline | clusterState |
caringo_swarm_index_overlay_state | The overlay index status. 2=authoritative; 1=operational; 0 otherwise. | indexOverlayStatus | |
caringo_swarm_index_overlay_inflating | Whether the overlay index is inflating on this node. 1=true; 0; false. | indexOverlayInflating | |
caringo_swarm_index_overlay_attractors | The number of desired attractors. | indexOverlayDesiredAttractors | |
caringo_swarm_feeds_deleted_pending | feed_name, feed_type | The number of deleted object events pending waiting to be processed. | feedNodeDeletesUnprocessed |
caringo_swarm_feeds_deleted_retrying | feed_name, feed_type | The number of deleted object events needing to be retried. | feedNodeDeletesFailing |
caringo_swarm_feeds_deleted_successful | feed_name, feed_type | The number of deleted object events successfully processed. | feedNodeDeletesSuccess |
caringo_swarm_feeds_deleted_unqualified | feed_name, feed_type | The number of deleted object events potentially requiring processing. | feedNodeDeletesUnqualified |
caringo_swarm_feeds_est_backlog_clear_time | feed_name, feed_type | The estimated number of seconds to complete all processing. -1 for unknown. | feedEstBacklogClearTime |
caringo_swarm_feeds_existing_pending | feed_name, feed_type | The number of current object events pending waiting to be processed. | feedNodeExistsUnprocessed |
caringo_swarm_feeds_existing_retrying | feed_name, feed_type | The number of current object events needing to be retried. | feedNodeExistsFailing |
caringo_swarm_feeds_existing_successful | feed_name, feed_type | The number of current object events successfully processed. | feedNodeExistsSuccess |
caringo_swarm_feeds_existing_unqualified | feed_name, feed_type | The number of current object events potentially requiring processing. | feedNodeExistsUnqualified |
caringo_swarm_feeds_feed_id | feed_name, feed_type | The id number of the feed. | feedFeedId |
caringo_swarm_feeds_feed_state | feed_name, feed_type | -1 = unknown; 0 = closed; 1 = config-error; 2 = too many overlapping feeds; 3 = blocked; 4 = paused by request; 5 = paused for recovery; 6 = priority (processing contexts after start/restart); 7 = ok | feedState |
caringo_swarm_feeds_last_failure | feed_name, feed_type | The time of the last failure event in epoch milliseconds. | feedLastExistFailure, feedLastDeleteFailure, feedLastVersionedFailure |
caringo_swarm_feeds_last_success | feed_name, feed_type | The time of the last successful event in epoch milliseconds. | feedLastSuccess |
caringo_swarm_feeds_remote_failure | feed_name, feed_type | The number of replication/indexing failures. | feedPluginRemoteFailure |
caringo_swarm_feeds_remote_success_duplicate | feed_name, feed_type | The number of duplicate indexing/replication successes. | feedPluginRemoteSuccessDuplicate |
caringo_swarm_feeds_remote_success_transfer | feed_name, feed_type | The number of new indexing/replication successes. | feedPluginRemoteSuccessTransfer |
caringo_swarm_feeds_versioned_pending | feed_name, feed_type | The number of versioned object events pending waiting to be processed. | feedNodeVersionedUnprocessed |
caringo_swarm_feeds_versioned_retrying | feed_name, feed_type | The number of versioned object events needing to be retried. | feedNodeVersionedFailing |
caringo_swarm_feeds_versioned_successful | feed_name, feed_type | The number of versioned object events successfully processed. | feedNodeVersionedSuccess |
caringo_swarm_feeds_versioned_unqualified | feed_name, feed_type | The number of versioned object events potentially requiring processing. | feedNodeVersionedUnqualified |
caringo_swarm_health_cycle |
| The HP cycle number. | ongoingHPCycleNumber |
caringo_swarm_health_examined |
| The number of streams examined so far this HP cycle. | ongoingHPCycleStreamsExamined |
caringo_swarm_health_offloaded |
| The number of streams moved to another node this HP cycle. | ongoingHPCycleStreamsOffloaded |
caringo_swarm_health_relocated |
| The number of streams relocated on disk this HP cycle. | ongoingHPCycleStreamsRelocated |
caringo_swarm_health_total |
| The number of streams processed so far this HP cycle. | ongoingHPCycleStreamsTotal |
caringo_swarm_health_verified |
| The number of streams checked for data integrity this HP cycle. | ongoingHPCycleStreamsVerified |
caringo_swarm_index_alias_slots |
| The number of memory index slots used for alias objects. | indexSlotsAlias |
caringo_swarm_index_deleted_slots |
| The number of memory index slots used for deleted objects. | indexSlotsDeleted |
caringo_swarm_index_immutable_slots |
| The number of memory index slots used for immutable objects. | indexSlotsImmutable |
caringo_swarm_index_manifest_slots |
| Not useful. Always 0. | indexSlotsManifest |
caringo_swarm_index_mutable_slots |
| The number of memory index slots used for named+alias objects. | indexSlotsMutable |
caringo_swarm_index_named_slots |
| The number of memory index slots used for named objects. | indexSlotsNamed |
caringo_swarm_index_overlay_slots |
| The number of memory index slots used for the overlay index. | indexSlotsOverlayUsed |
caringo_swarm_index_policy_slots |
| The number of memory index slots used for policy attributes. | indexSlotsPolicy |
caringo_swarm_index_total_slots |
| The number of memory index slots total. | indexSlotsTotal |
caringo_swarm_index_used_slots |
| The number of memory index slots used. | indexSlotsUsed |
caringo_swarm_index_versioned_slots |
| The number of memory index slots used for prior object versions. | indexSlotsVersioned |
caringo_swarm_memory_cache_memory_allocated |
| The memory allocated to the content cache in bytes. | contentCacheCapacityMB |
caringo_swarm_memory_cache_memory_items |
| The number of objects stored in the content cache. | contentCacheItems |
caringo_swarm_memory_cache_memory_used |
| The memory used to store objects in the content cache in bytes. | contentCacheUsedMB |
caringo_swarm_memory_chassis_arena |
| The bytes of memory available for use by Swarm on the chassis. | chassisArenaM |
caringo_swarm_memory_chassis_free |
| The bytes of memory free on the chassis. | chassisFreeMemM |
caringo_swarm_memory_chassis_headroom |
| The bytes of memory reserved for emergency use on the chassis. | chassisHeadroomM |
caringo_swarm_memory_chassis_shared |
| The bytes of shared memory used on the chassis. | chassisSharedMemM |
caringo_swarm_memory_chassis_total |
| The bytes of physical memory on the chassis. | chassisTotalMemM |
caringo_swarm_memory_node_accounted |
| Bytes of buffer memory in use in the main process. | accountedMemM |
caringo_swarm_memory_node_accounts |
| Number of memory accounts in used in the main process. | memAccountsActual |
caringo_swarm_memory_node_accounts_over_budget |
| Number of memory accounts over budget in the main process. | memAccountsOverlimit |
caringo_swarm_memory_node_accounts_throttled |
| Number of memory accounts throttled in the main process. | memAccountsQueued |
caringo_swarm_memory_node_actual |
| Total bytes in use for the main process. | processActualSizeM |
caringo_swarm_memory_node_allowance |
| Total buffer memory allocated in the main process. | accountAllowanceM |
caringo_swarm_memory_node_file_descriptors |
| Number of file descriptors in use by the main process. | processFDs |
caringo_swarm_memory_node_non_accounted |
| Bytes of non accounted memory in use in the main process. | nonAccountedMemM |
caringo_swarm_memory_node_target |
| Main process target size in bytes. | processTargetSizeM |
caringo_swarm_node_errors |
| The number of reported errors on the node. | castorErrTableSize |
caringo_swarm_node_examq |
| The number of examination queue entries on the node. | examQueueCount |
caringo_swarm_node_state |
| -1 = unknown; 0 = ok; 1 = idle; 2 = mounting; 3 = initializing; 4 = finalizing; 5 = maintenance; 6 = retiring; 7 = retired; 8 = error; 9 = unavailable; 10 = offline | castorState |
caringo_swarm_node_swarm_version | version | Value is always 1. | castorVersion |
caringo_swarm_node_uptime |
| The uptime of the main process in seconds. | sysUpTimeInstance |
caringo_swarm_node_volumes |
| The number of volumes in use on the node. | castorVolumes |
caringo_swarm_scsp_appends |
| The delta since last publication or the total number of APPEND requests. | appends |
caringo_swarm_scsp_appends_total |
|
|
|
caringo_swarm_scsp_client_close_read |
| The delta since last publication or the total number of client premature closes on read-type requests. | clientPrematureCloseRead |
caringo_swarm_scsp_client_close_read_total |
|
|
|
caringo_swarm_scsp_client_close_write |
| The delta since last publication or the total number of client premature closes on write-type requests. | clientPrematureCloseWrite |
caringo_swarm_scsp_client_close_write_total |
|
|
|
caringo_swarm_scsp_copies |
| The delta since last publication or the total number of COPY requests. | copies |
caringo_swarm_scsp_copies_total |
|
|
|
caringo_swarm_scsp_deletes |
| The delta since last publication or the total number of DELETE requests. | deletes |
caringo_swarm_scsp_deletes_total |
|
|
|
caringo_swarm_scsp_gets |
| The delta since last publication or the total number of GET requests. | reads |
caringo_swarm_scsp_gets_total |
|
|
|
caringo_swarm_scsp_heads |
| The delta since last publication or the total number of HEAD requests. | infos |
caringo_swarm_scsp_heads_total |
|
|
|
caringo_swarm_scsp_indirectDeletes |
| The delta since last publication or the total number of deletes performed internally by the health processor. | indirectDeletes |
caringo_swarm_scsp_indirectDeletes_total |
|
|
|
caringo_swarm_scsp_internode_reads |
| The delta since last publication or the total number of GET requests internally performed. | internodeReads |
caringo_swarm_scsp_internode_reads_total |
|
|
|
caringo_swarm_scsp_internode_redirects |
| The delta since last publication or the total number of client redirects between nodes in the cluster. | redirects |
caringo_swarm_scsp_internode_redirects_total |
|
|
|
caringo_swarm_scsp_internode_trims |
| The delta since last publication or the total number of replicas internally removed. | internodeTrims |
caringo_swarm_scsp_internode_trims_total |
|
|
|
caringo_swarm_scsp_internode_writes |
| The delta since last publication or the total number of POST requests internally performed. | internodeWrites |
caringo_swarm_scsp_internode_writes_total |
|
|
|
caringo_swarm_scsp_patches |
| The delta since last publication or the total number of PATCH requests. | patches |
caringo_swarm_scsp_patches_total |
|
|
|
caringo_swarm_scsp_posts |
| The delta since last publication or the total number of POST requests. | writes |
caringo_swarm_scsp_posts_total |
|
|
|
caringo_swarm_scsp_processes_active |
| The number of SCSP processes that are active. |
|
caringo_swarm_scsp_puts |
| The delta since last publication or the total number of PUT requests. | updates |
caringo_swarm_scsp_puts_total |
|
|
|
caringo_swarm_scsp_response_200 |
| The delta since last publication or the total number of SCSP 200 responses. | clientSuccess200 |
caringo_swarm_scsp_response_200_total |
|
|
|
caringo_swarm_scsp_response_201 |
| The delta since last publication or the total number of SCSP 201 responses. | clientSuccess201 |
caringo_swarm_scsp_response_201_total |
|
|
|
caringo_swarm_scsp_response_202 |
| The delta since last publication or the total number of SCSP 202 responses. | clientSuccess202 |
caringo_swarm_scsp_response_202_total |
|
|
|
caringo_swarm_scsp_response_206 |
| The delta since last publication or the total number of SCSP 206 responses. | clientSuccess206 |
caringo_swarm_scsp_response_206_total |
|
|
|
caringo_swarm_scsp_response_301 |
| The delta since last publication or the total number of 301 redirect responses. | clientRedir301 |
caringo_swarm_scsp_response_301_total |
|
|
|
caringo_swarm_scsp_response_304 |
| The delta since last publication or the total number of 304 redirect responses. | clientRedir304 |
caringo_swarm_scsp_response_304_total |
|
|
|
caringo_swarm_scsp_response_400 |
| The delta since last publication or the total number of 400 error responses. | clientError400 |
caringo_swarm_scsp_response_400_total |
|
|
|
caringo_swarm_scsp_response_401 |
| The delta since last publication or the total number of 401 error responses. | clientError401 |
caringo_swarm_scsp_response_401_total |
|
|
|
caringo_swarm_scsp_response_404 |
| The delta since last publication or the total number of 404 error responses. | clientError404 |
caringo_swarm_scsp_response_404_total |
|
|
|
caringo_swarm_scsp_response_410 |
| The delta since last publication or the total number of 410 error responses. | clientError410 |
caringo_swarm_scsp_response_410_total |
|
|
|
caringo_swarm_scsp_response_412 |
| The delta since last publication or the total number of 412 error responses. | clientError412 |
caringo_swarm_scsp_response_412_total |
|
|
|
caringo_swarm_scsp_response_4xx |
| The delta since last publication or the total number of other 400-type error responses. | clientError4xx |
caringo_swarm_scsp_response_4xx_total |
|
|
|
caringo_swarm_scsp_response_500 | The delta since last publication or the total number of 500 error responses. | clientError500 | |
caringo_swarm_scsp_response_503 | The delta since last publication or the total number of 503 error responses. | clientError503 | |
caringo_swarm_scsp_response_507 | The delta since last publication or the total number of 507 error responses. | clientError507 | |
caringo_swarm_scsp_response_5xx | The delta since last publication or the total number of other 500-type error responses. | clientError5xx | |
caringo_swarm_scsp_searches | The delta since last publication or the total number of search requests. | searches | |
caringo_swarm_volume_capacity | volume_dev, volume_id | The volume capacity in bytes. | volMaxMbytes |
caringo_swarm_volume_ecrs | volume_dev, volume_id | The number of EC recoveries ongoing against this volume. | recoveryType, recoveryLocalVolId |
caringo_swarm_volume_errors | volume_dev, volume_id | The number of reported IO errors on the volume. | volErrors |
caringo_swarm_volume_free | volume_dev, volume_id | The number of free bytes on the volume. | volFreeMbytes |
caringo_swarm_volume_fvrs | volume_dev, volume_id | The number of failed volume recoveries ongoing against this volume. | recoveryType, recoveryLocalVolId |
caringo_swarm_volume_journal_utilization | volume_dev, volume_id | The portion of the volume journal space in use. | volLastJournalBid |
caringo_swarm_volume_logical_objects | volume_dev, volume_id | The contribution to estimated cluster logical objects from this volume. | logicalObjects |
caringo_swarm_volume_logical_space | volume_dev, volume_id | The contribution to estimated cluster logical space (in bytes) from this volume. | logicalSpace |
caringo_swarm_volume_logical_unprocessed | volume_dev, volume_id | The number of streams on the volume not considered for the logical object/space estimates. | logicalUnprocessed |
caringo_swarm_volume_read_bid | volume_dev, volume_id | The last read bid for the volume. | lastRead |
caringo_swarm_volume_rep_bid | volume_dev, volume_id | The last replicate bid for the volume. | lastWrite |
caringo_swarm_volume_state | volume_dev, volume_id | The status of the given volume name and ID. Statuses: 0 (OK), 1 (retiring), 2 (retired), 3 (unavailable), 4 (mounting), 5 (idle), -1 (unknown). | volState |
caringo_swarm_volume_stats_io_queue_count | volume_dev, volume_id | The number of IO queue items on the last sampling. | |
caringo_swarm_volume_stats_io_queue_sec | volume_dev, volume_id | The time in seconds to process items on the IO queue at the last sampling. | |
caringo_swarm_volume_stats_io_utilization | volume_dev, volume_id | The fraction of the time at the last sampling the volume was busy. | |
caringo_swarm_volume_stats_sec_per_io_max | volume_dev, volume_id | The longest IO request time at the last sampling. | |
caringo_swarm_volume_stats_sec_per_io_running | volume_dev, volume_id | The average IO request time at the last sampling. | |
caringo_swarm_volume_streams | volume_dev, volume_id | The number of streams on the volume. | volUsedstreams |
caringo_swarm_volume_trapped | volume_dev, volume_id | The trapped space on the volume in bytes. | volTrappedMbytes |
caringo_swarm_volume_uptime | volume_dev, volume_id | The time in seconds the volume has been up. | volUptime |
caringo_swarm_volume_used | volume_dev, volume_id | The number of bytes used on the volume. | volUsedMbytes |
caringo_swarm_volume_write_bid | volume_dev, volume_id | The last written bid for the volume. | |
caringo_swarm_scsp_processes_active | The number of SCSP processes that are active. | ||
caringo_swarm_feeds_remote_disconnects_last_hour | feed_name, feed_type | The number of remote disconnections in the last hour. | feedPluginRemoteDisconnectsLastHour |
caringo_gateway_request_count | protocol="s3", scope="MultiDelete", method="POST" | Total number of S3 multidelete requests. Counts requests, not individual deleted objects. | |
caringo_gateway_status_code_count | protocol="s3", scope="MultiDelete", method="POST", status | Total number of S3 multidelete requests along with the http result in the status. |