Using Cluster Reports

The Reports section of the Swarm UI includes valuable real-time and historical views into the health and activity of a swarm cluster.

Health Report

The Health Report provides both summary and detailed information at the level of cluster, subcluster, chassis, and drive.

The sunburst graphic shows an interactive visualization of the cluster, with the cluster represented in the center, subclusters in the next concentric circle out, chassis in the next concentric circle, and drives on the outside. To see only the data for a particular subcluster or chassis, click on its wedge in the sunburst. All summary and detail data get updated to show only the selected component. To return to a higher-level view, click the center of the sunburst. The component's ID (IP Address, drive name, etc.) and status display when moving the cursor over each wedge in the sunburst to make it easier to identify a component.

The status of each component is represented by the color of its wedge in the sunburst. Statuses include the following:

Component Status

Description

OK

The chassis or drive is working and there are no errors.

Alert
Warning

The chassis or drive has experienced one or more errors. 

Cluster-level alerts often relate to space thresholds or network issues (unable to reach NTP/Gateway/ES/Metrics servers or other nodes).

Initializing

The short state after a chassis boots when it is reading cluster persisted settings and is not quite ready to accept requests.

Maintenance

The chassis has been shut down or rebooted by an administrator from either SNMP or the UI and should not be considered missing for recovery purposes. By default a chassis can be in a Maintenance state for 3 hours before it transitions to Offline and the cluster starts recovery of its content. Maintenance mode is not initialized when the power is manually cycled on the chassis outside of Swarm (either physically on the hardware or via a remote shutdown mechanism in an out-of-band management platform such as IPMI, Dell iDRAC, or HP iLO) or if there is a drive error; in both these instances recovery processes start for the chassis/drive unless recovery is suspended.

Mounting

The chassis is mounting one or more drives, including formatting the drive if it is new and reading all objects on the volume into the RAM index for faster access.

Offline

The chassis or drive was previously but is no longer present in the cluster.

Retiring

The chassis or drive is in the process of retiring, verifying all objects are fully protected elsewhere in the cluster and then removing them locally.

Retired

The chassis or drive has completed the retiring process and may be removed from the cluster.

Idle

The chassis or drive is in power-saving mode due during a period of configurable inactivity. (See Configuring Power Management.)

Subcluster and Cluster status are inherited from the chassis or drives contained within.

The data table below the sunburst displays more detailed information about the cluster, including the amount of used and free capacity and how many streams reside on the chassis/drive. Clicking on a subcluster row loads the Subcluster page. Clicking on a chassis row takes loads to the Chassis Details page unless the chassis status is Maintenance or Offline.

Storage Contents

The Storage Contents chart displays the total amount of used capacity in the cluster over time as well as the total stream count (including replicas and erasure coding segments).

Usage

Note

Historical usage charts may show artificial bumps in usage when adding or removing a large percentage of drives within a single day.

The Usage charts display percentages of Disk space and Stream index (memory):

  • Disk Space - The amount of free, trapped, and used drive space as a percent of the total available over time.

  • Stream Index - The amount of free, overlay, and used RAM index space as a percent of the total available over time.

Network Traffic

The Network Traffic graphs display Requests, Responses, and Internal Requests (inter-cluster activity).

Requests: The count of each SCSP method type in incoming client requests to the cluster over time. SCSP Method types are: https://perifery.atlassian.net/wiki/spaces/public/pages/2443821480, https://perifery.atlassian.net/wiki/spaces/public/pages/2443821518, writes (sum of https://perifery.atlassian.net/wiki/spaces/public/pages/2443821645, , and ), , and Other (sum of and ). This information is useful in understanding both when and how a cluster is being used by client applications.

Responses: The count of returned to clients by the storage cluster over time. This data is helpful in identifying problems in the cluster or client applications, including if there are particular times during which error responses occur.

Internal Requests: The count of various internal, cluster-initiated activities between nodes in the cluster over time. This information is helpful in understanding how much data movement is happening in the cluster as hardware is added, removed, retired, etc. Spikes in activity within the cluster not correlating with client activity are often associated with either a failed drive recovery or an admin-requested retire.

Elasticsearch Reports

Research an ES cluster status on the Elasticsearch Reports page if the Elasticsearch panel on the Dashboard shows a problem. These reports generate on demand and allow drilling into details spanning the ES nodes, thread pools, indices, and shards. (v2.0) 

Important

Opening the Elasticsearch Reports page requires generation of a lot of status data; allow time for the page to display.

For details on the columns that are reported, see the relevant Elasticsearch Reference: version 2.3 or version 5.6.

Section

Setting

Notes

Section

Setting

Notes

RESOURCES
Node details

  • name

  • ip

  • uptime

  • master

  • cpu

  • disk avail

  • memory size

  • tripped breaker

  • file desc current

  • heap max

  • heap percent

  • ram percent

  • indexing delete total

  • indexing index total

  • search query total

Shows the ES cluster topology. 

For seeing where a node resides and to check performance stats, focus on these columns:

  • ip

  • cpu

  • tripped breaker

Important

The tripped breaker field signals trouble. Contact DataCore Support if the status is red.

  • heap percent

  • ram percent

Other columns are more helpful when looking at larger clusters, such as determining how many master-eligible nodes are available:

  • master

  • name

RESOURCES
Thread pool details

  • name

  • ip

  • bulk rejected

  • flush rejected

  • force_merge rejected

  • generic rejected

  • get rejected

  • index rejected

  • refresh rejected

  • search rejected

  • warmer rejected

Shows ES cluster-wide thread pool statistics per node. The rejected statistics are returned for all thread pools.

INDICES

  • index

  • health

  • status

  • docs count

  • docs deleted

  • pri

  • pri store size

  • rep

  • store size

Provides low-level information about the segments in the shards of an index.

docs.count - The number of non-deleted documents stored in this segment. These are Lucene documents, so the count includes hidden documents (such as from nested types).

docs.deleted - The number of deleted documents stored in this segment. The space for these documents is reclaimed when this segment is merged

SHARDS

  • index

  • node

  • ip

  • docs

  • prirep

  • shard

  • state

  • store

The detailed view of what nodes contain which shards. It tells if it is a primary or replica, the number of docs, the bytes consumed on disk, and the node where it is located.

prirep - Whether this segment belongs to a primary or replica shard.

Feeds Reports

The Data Feeds Reports show the number of processed events for each configured search or replication feed over time, providing insight into how busy each feed is. Status markers alert to problems with the feed.

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.