Managing Chassis and Drives

Chassis Details

Detailed hardware and status information for each chassis (physical or virtual machine) are displayed on the hardware details page.

 

Tip

Streams are counts of the total number of Swarm-managed data components (such as replicas and segments). Streams are not logical objects (such as video files).

Status States: These are the states reported for hardware in a cluster and how to interpret them:

Status

Nodes / Chassis

Volumes / Disks

Status

Nodes / Chassis

Volumes / Disks

ok

Nominal

Nominal

idle

Nominal, but the node is idle 

Nominal, but idle

retiring

One or more volumes are offloading streams to the cluster due to retire

Offloading streams to the cluster due to retire

retired

All volumes are retired

Empty of objects and not taking new ones

unavailable



In an error state

error

Errors are reported on the node (hardware or software)



mounting

One or more volumes are mounting

Mounting at startup/discovery

finalizing

Can appear while the node is rebooting or shutting down, as the node finishes sessions in process



maintenance

A 3-hour window during an administrative reboot or shutdown where Failed Volume Recovery does not run



initializing

Volumes have mounted but the node is not yet ready for client activity.

 

offline

Node is known to be offline but not in maintenance

 

Details Tab

Each detailed row displays a disk name, status, total capacity, amount of used journal space, the largest stream size it contains in MB, Model number, Serial Number, ID, Firmware version, and Encryption status. The largest value displays as 0 if the largest stream on disk is less than 1MB.

Tip

Watch the Streams count to track the progress when retiring a disk.

Logs Tab

The Logs tab lists the last 10 logged announcements in the cluster as well as the last 10 logged critical alerts. The tab itself includes a count of these messages, and appears red if any are errors:  

  • Use the Clear command to remove log messages which have either been addressed or are not interesting from the display. 

  • Click the Log Level (gear) settings command to view and change the log levels set for this machine. 

Hot-Swapping Disks: Messages display on this tab if a disk is removed or inserted into a running node. This feature, referred to as https://perifery.atlassian.net/wiki/spaces/public/pages/2443808801, allows removal of failed disks for analysis or to add storage capacity to a node at any time.

The following messages appears if adding and then removing a volume: 

mounted /dev/sdb, volumeID is 561479FB832DCC526B1D7EDCD06B83E1 removed /dev/sdb, volumeID was 561479FB832DCC526B1D7EDCD06B83E1

Message Levels

These messages appear at the announcement level. Additional debug level messages appear in the syslog.

Driver Message Tab

dmesg (driver message) prints the message buffer of the kernel. These driver messages are useful for diagnosing a Swarm issue when a system panic or error occurs.

Hardware Info Tab

hwinfo (hardware information) is the Linux hardware detection tool output. This tool probes for the hardware present in the system and displays detailed information about various hardware components in human-readable format.

Memory Tab

The usage report on the Memory tab provides detailed information to help with troubleshooting insufficient memory.

Each node uses memory to hold an index of the objects stored in it. A node stops storing new content until space is freed through deletions if a node runs out of index space. A full node continues to respond to client read requests for data already present. Each named or alias object requires two index slots. Erasure coding typically requires more memory than replication; exactly how much depends on the encoding.

Statistics Tab

The Statistics tab rolls up a detailed, expandable report combining Health Processor (HP), Communications (cluster network), and Memory usage counts and values, to help with analysis and troubleshooting.

The health processor runs on each Swarm node to check the status of streams, performing a wide range of actions:

  • Sends replica checks to the other nodes and adds or trims replicas based on responses

  • Deletes streams requiring deletion according to life points

  • Provides a safety net to remove older alias and named stream versions when a newer version is found in the cluster (which can happen when nodes are restored)

  • Checks each stream for data corruption using comparison with the stored stream hash

  • Moves the stream on disk if defragmentation is needed

  • Verifies the disk index is consistent with the streams found on disk

  • Verifies replicas are distributed properly in the cluster

Advanced Tab

The Advanced tab allows dynamically changing machine-level logging levels and also work with Swarm's management API, both through a hands-on HAL browser and a Swagger visualizer.

The Health Data is the raw JSON content of the health report the cluster sends to DataCore Support. See https://perifery.atlassian.net/wiki/spaces/public/pages/2443815867.

The log levels can be reset from this tab as well as from the Logs tab:

Restarting or Shutting Down a Chassis

The gear icon at the top of the page allows restarting or shutting down the chassis.  A node shut down or rebooted by an Administrator appears with a Maintenance state on other nodes in the cluster.

Retiring a Chassis

Retire the chassis when replacing Swarm storage volumes for regular maintenance or to upgrade the cluster chassis with higher capacity disks. Retiring a chassis copies all objects to other chassis in the cluster, allowing safe removal of the chassis disks without risking any data loss.

Select the Retire option under the gear icon at the top of the Chassis Details page to initiate a retire. Choose to perform a minimally disruptive retire limited to the chassis being retired, or an accelerated retire using all nodes in the cluster to replicate objects on the retiring chassis as quickly as possible when initiating a retire.

A retiring chassis accepts no new or updated objects. Each chassis volume's state changes to Retired and Swarm no longer uses the volume after all objects are copied elsewhere. The volume can be safely removed at this point.

Rate of the Retire: Swarm calculates the retire rate over the last hour, which it publishes using SNMP as retireRatePerHour. This covers the entire chassis regardless of how many volumes are being retired.

Canceling the Retire: Cancel an in-process retire by selecting the Cancel Retire option under the gear icon at the top of the Chassis Details page. Cancel a retire while one or more disks in the chassis have a Retiring status.

Retiring a Disk (Volume)

Disk-level retires are useful for targeting bad (slow) disks and for working around having too limited capacity for retires of entire chassis. Check the diagnostic data collected in the logs if a disk retires automatically because of I/O errors. (v11.1)

Locate and click the gear icon in the row for the affected disk to retire a volume:

Select the speed of retire. The fastest method incurs maximum effort by the cluster to move the content:

Rate of the Retire: Swarm generates an announce-level message reporting the overall duration and rate of the retire when Swarm completes a retire task on a disk. (v11.0)
See https://perifery.atlassian.net/wiki/spaces/public/pages/2443811993/Retiring+Hardware#Retire-Rate.

Canceling the Retire: Click the gear icon in the row for the affected disk and select the Cancel retire command:

Identifying a Disk

It is helpful to enable the LED disk light for the disk when attempting to identify a failed or failing disk. Click on the disk light toggle in the disk's display row to flash the disk light for a specific disk:

 

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.