Chassis Details

Detailed hardware and status information for each chassis (physical or virtual machine) are displayed on the hardware details page.

 

Tip

Streams are counts of the total number of Swarm-managed data components (such as replicas and segments). Streams are not logical objects (such as video files).

Status States: These are the states reported for hardware in a cluster and how to interpret them:

Status

Nodes / Chassis

Volumes / Disks

ok

Nominal

Nominal

idle

Nominal, but the node is idle 

Nominal, but idle

retiring

One or more volumes are offloading streams to the cluster due to retire

Offloading streams to the cluster due to retire

retired

All volumes are retired

Empty of objects and not taking new ones

unavailable


In an error state

error

Errors are reported on the node (hardware or software)


mounting

One or more volumes are mounting

Mounting at startup/discovery

finalizing

Can appear while the node is rebooting or shutting down, as the node finishes sessions in process


maintenance

A 3-hour window during an administrative reboot or shutdown where Failed Volume Recovery does not run


initializing

Volumes have mounted but the node is not yet ready for client activity.

offline

Node is known to be offline but not in maintenance

Details Tab

Each detailed row displays a disk name, status, total capacity, amount of used journal space, the largest stream size it contains in MB, Model number, Serial Number, ID, Firmware version, and Encryption status. The largest value displays as 0 if the largest stream on disk is less than 1MB.

Tip

Watch the Streams count to track the progress when retiring a disk.

Logs Tab

The Logs tab lists the last 10 logged announcements in the cluster as well as the last 10 logged critical alerts. The tab itself includes a count of these messages, and appears red if any are errors:  

Hot-Swapping Disks: Messages display on this tab if a disk is removed or inserted into a running node. This feature, referred to as Hot Swapping and Plugging Disks, allows removal of failed disks for analysis or to add storage capacity to a node at any time.

The following messages appears if adding and then removing a volume: 

mounted /dev/sdb, volumeID is 561479FB832DCC526B1D7EDCD06B83E1
removed /dev/sdb, volumeID was 561479FB832DCC526B1D7EDCD06B83E1

Message Levels

These messages appear at the announcement level. Additional debug level messages appear in the syslog.

Driver Message Tab

dmesg (driver message) prints the message buffer of the kernel. These driver messages are useful for diagnosing a Swarm issue when a system panic or error occurs.

Limited to 1000

dmesg is a circular buffer; it shows the last 1000 kernel messages.

Hardware Info Tab

hwinfo (hardware information) is the Linux hardware detection tool output. This tool probes for the hardware present in the system and displays detailed information about various hardware components in human-readable format.

Memory Tab

The usage report on the Memory tab provides detailed information to help with troubleshooting insufficient memory.

Each node uses memory to hold an index of the objects stored in it. A node stops storing new content until space is freed through deletions if a node runs out of index space. A full node continues to respond to client read requests for data already present. Each named or alias object requires two index slots. Erasure coding typically requires more memory than replication; exactly how much depends on the encoding.

Best Practice

Increase the memory in the node if running out of index slots through normal activity.

Statistics Tab

The Statistics tab rolls up a detailed, expandable report combining Health Processor (HP), Communications (cluster network), and Memory usage counts and values, to help with analysis and troubleshooting.

The health processor runs on each Swarm node to check the status of streams, performing a wide range of actions:

Advanced Tab

The Advanced tab allows dynamically changing machine-level logging levels and also work with Swarm's management API, both through a hands-on HAL browser and a Swagger visualizer.

The Health Data is the raw JSON content of the health report the cluster sends to DataCore Support. See Health Data to Support.

The log levels can be reset from this tab as well as from the Logs tab:

Restarting or Shutting Down a Chassis

The gear icon at the top of the page allows restarting or shutting down the chassis.  A node shut down or rebooted by an Administrator appears with a Maintenance state on other nodes in the cluster.

Retiring a Chassis

Retire the chassis when replacing Swarm storage volumes for regular maintenance or to upgrade the cluster chassis with higher capacity disks. Retiring a chassis copies all objects to other chassis in the cluster, allowing safe removal of the chassis disks without risking any data loss.

Important

Verify the cluster meets the following requirements before retiring a chassis:

  • Has enough capacity for the objects on the retiring chassis to replicate elsewhere.

  • Has enough remaining nodes to replicate the objects with one replica on any given node.

Select the Retire option under the gear icon at the top of the Chassis Details page to initiate a retire. Choose to perform a minimally disruptive retire limited to the chassis being retired, or an accelerated retire using all nodes in the cluster to replicate objects on the retiring chassis as quickly as possible when initiating a retire.

note

Note

The cluster-wide retire may impact performance as it puts additional load on the cluster.

Note

The cluster-wide retire may impact performance as it puts additional load on the cluster.

Replica Protection

Retire succeeds if objects can be replicated elsewhere in the cluster. The Retire action does not remove an object until it can guarantee at least two replicas exist in the cluster or the existing number of replicas matches the policy.replicas min parameter value.

A retiring chassis accepts no new or updated objects. Each chassis volume's state changes to Retired and Swarm no longer uses the volume after all objects are copied elsewhere. The volume can be safely removed at this point.

Rate of the Retire: Swarm calculates the retire rate over the last hour, which it publishes using SNMP as retireRatePerHour. This covers the entire chassis regardless of how many volumes are being retired.

Canceling the Retire: Cancel an in-process retire by selecting the Cancel Retire option under the gear icon at the top of the Chassis Details page. Cancel a retire while one or more disks in the chassis have a Retiring status.

Retiring a Disk (Volume)

Disk-level retires are useful for targeting bad (slow) disks and for working around having too limited capacity for retires of entire chassis. Check the diagnostic data collected in the logs if a disk retires automatically because of I/O errors. (v11.1)

Locate and click the gear icon in the row for the affected disk to retire a volume:

Select the speed of retire. The fastest method incurs maximum effort by the cluster to move the content:

Rate of the Retire: Swarm generates an announce-level message reporting the overall duration and rate of the retire when Swarm completes a retire task on a disk. (v11.0)
See https://perifery.atlassian.net/wiki/spaces/public/pages/2443811993/Retiring+Hardware#Retire-Rate.

Canceling the Retire: Click the gear icon in the row for the affected disk and select the Cancel retire command:

Identifying a Disk

It is helpful to enable the LED disk light for the disk when attempting to identify a failed or failing disk. Click on the disk light toggle in the disk's display row to flash the disk light for a specific disk:

note

Note

Disk lights remain ON until manually turned off so return to the Chassis Details page and click the disk light switch to Off.

Note

Disk lights remain ON until manually turned off so return to the Chassis Details page and click the disk light switch to Off.