Chassis Details

Detailed hardware and status information for each chassis (physical or virtual machine) are displayed on the hardware details page.

Swarm Documentation > Managing Chassis and Drives > image2018-12-19_9-37-54.png

Tip

Streams are counts of the total number of Swarm-managed data components (such as replicas and segments). Streams are not logical objects (such as video files).

Status States: These are the states reported for hardware in a cluster and how to interpret them:

Status	Nodes / Chassis	Volumes / Disks
ok	Nominal	Nominal
idle	Nominal, but the node is idle	Nominal, but idle
retiring	One or more volumes are offloading streams to the cluster due to retire	Offloading streams to the cluster due to retire
retired	All volumes are retired	Empty of objects and not taking new ones
unavailable		In an error state
error	Errors are reported on the node (hardware or software)
mounting	One or more volumes are mounting	Mounting at startup/discovery
finalizing	Can appear while the node is rebooting or shutting down, as the node finishes sessions in process
maintenance	A 3-hour window during an administrative reboot or shutdown where Failed Volume Recovery does not run
initializing	Volumes have mounted but the node is not yet ready for client activity.
offline	Node is known to be offline but not in maintenance

Details Tab

Each detailed row displays a disk name, status, total capacity, amount of used journal space, the largest stream size it contains in MB, Model number, Serial Number, ID, Firmware version, and Encryption status. The largest value displays as 0 if the largest stream on disk is less than 1MB.

Tip

Watch the Streams count to track the progress when retiring a disk.

Swarm Documentation > Managing Chassis and Drives > image2018-12-18_10-36-39.png

Logs Tab

The Logs tab lists the last 10 logged announcements in the cluster as well as the last 10 logged critical alerts. The tab itself includes a count of these messages, and appears red if any are errors:

Swarm Documentation > Managing Chassis and Drives > image2018-12-17_23-49-40.png

Use the Clear command to remove log messages which have either been addressed or are not interesting from the display.
Click the Log Level (gear) settings command to view and change the log levels set for this machine.

Hot-Swapping Disks: Messages display on this tab if a disk is removed or inserted into a running node. This feature, referred to as Hot Swapping and Plugging Disks, allows removal of failed disks for analysis or to add storage capacity to a node at any time.

The following messages appears if adding and then removing a volume:

mounted /dev/sdb, volumeID is 561479FB832DCC526B1D7EDCD06B83E1
removed /dev/sdb, volumeID was 561479FB832DCC526B1D7EDCD06B83E1

Message Levels

These messages appear at the announcement level. Additional debug level messages appear in the syslog.

Driver Message Tab

dmesg (driver message) prints the message buffer of the kernel. These driver messages are useful for diagnosing a Swarm issue when a system panic or error occurs.

Limited to 1000

dmesg is a circular buffer; it shows the last 1000 kernel messages.

Swarm Documentation > Managing Chassis and Drives > image2018-12-18_0-32-2.png

Hardware Info Tab

hwinfo (hardware information) is the Linux hardware detection tool output. This tool probes for the hardware present in the system and displays detailed information about various hardware components in human-readable format.

Swarm Documentation > Managing Chassis and Drives > image2018-12-18_0-32-55.png

Memory Tab

The usage report on the Memory tab provides detailed information to help with troubleshooting insufficient memory.

Each node uses memory to hold an index of the objects stored in it. A node stops storing new content until space is freed through deletions if a node runs out of index space. A full node continues to respond to client read requests for data already present. Each named or alias object requires two index slots. Erasure coding typically requires more memory than replication; exactly how much depends on the encoding.

Best Practice

Increase the memory in the node if running out of index slots through normal activity.

Swarm Documentation > Managing Chassis and Drives > image2018-12-18_0-33-42.png

Statistics Tab

The Statistics tab rolls up a detailed, expandable report combining Health Processor (HP), Communications (cluster network), and Memory usage counts and values, to help with analysis and troubleshooting.

Swarm Documentation > Managing Chassis and Drives > image2018-12-18_0-36-3.png

The health processor runs on each Swarm node to check the status of streams, performing a wide range of actions:

Sends replica checks to the other nodes and adds or trims replicas based on responses
Deletes streams requiring deletion according to life points
Provides a safety net to remove older alias and named stream versions when a newer version is found in the cluster (which can happen when nodes are restored)
Checks each stream for data corruption using comparison with the stored stream hash
Moves the stream on disk if defragmentation is needed
Verifies the disk index is consistent with the streams found on disk
Verifies replicas are distributed properly in the cluster

Advanced Tab

The Advanced tab allows dynamically changing machine-level logging levels and also work with Swarm's management API, both through a hands-on HAL browser and a Swagger visualizer.

The Health Data is the raw JSON content of the health report the cluster sends to DataCore Support. See Health Data to Support.

Swarm Documentation > Managing Chassis and Drives > image2018-12-18_0-38-55.png

The log levels can be reset from this tab as well as from the Logs tab:

Swarm Documentation > Managing Chassis and Drives > image2018-12-17_23-42-59.png

Restarting or Shutting Down a Chassis

The gear icon at the top of the page allows restarting or shutting down the chassis. A node shut down or rebooted by an Administrator appears with a Maintenance state on other nodes in the cluster.

Swarm Documentation > Managing Chassis and Drives > image2018-12-17_23-33-2.png

Retiring a Chassis

Retire the chassis when replacing Swarm storage volumes for regular maintenance or to upgrade the cluster chassis with higher capacity disks. Retiring a chassis copies all objects to other chassis in the cluster, allowing safe removal of the chassis disks without risking any data loss.

Important

Verify the cluster meets the following requirements before retiring a chassis:

Has enough capacity for the objects on the retiring chassis to replicate elsewhere.
Has enough remaining nodes to replicate the objects with one replica on any given node.

Select the Retire option under the gear icon at the top of the Chassis Details page to initiate a retire. Choose to perform a minimally disruptive retire limited to the chassis being retired, or an accelerated retire using all nodes in the cluster to replicate objects on the retiring chassis as quickly as possible when initiating a retire.

note

Note

The cluster-wide retire may impact performance as it puts additional load on the cluster.

Note

The cluster-wide retire may impact performance as it puts additional load on the cluster.

Swarm Documentation > Managing Chassis and Drives > image2018-12-18_0-22-43.png

Replica Protection

Retire succeeds if objects can be replicated elsewhere in the cluster. The Retire action does not remove an object until it can guarantee at least two replicas exist in the cluster or the existing number of replicas matches the policy.replicas min parameter value.

A retiring chassis accepts no new or updated objects. Each chassis volume's state changes to Retired and Swarm no longer uses the volume after all objects are copied elsewhere. The volume can be safely removed at this point.

Rate of the Retire: Swarm calculates the retire rate over the last hour, which it publishes using SNMP as retireRatePerHour. This covers the entire chassis regardless of how many volumes are being retired.

Canceling the Retire: Cancel an in-process retire by selecting the Cancel Retire option under the gear icon at the top of the Chassis Details page. Cancel a retire while one or more disks in the chassis have a Retiring status.

Retiring a Disk (Volume)

Disk-level retires are useful for targeting bad (slow) disks and for working around having too limited capacity for retires of entire chassis. Check the diagnostic data collected in the logs if a disk retires automatically because of I/O errors. (v11.1)

Locate and click the gear icon in the row for the affected disk to retire a volume:

Swarm Documentation > Managing Chassis and Drives > image2018-12-19_22-56-8.png

Select the speed of retire. The fastest method incurs maximum effort by the cluster to move the content:

Swarm Documentation > Managing Chassis and Drives > image2018-12-19_22-58-24.png

Rate of the Retire: Swarm generates an announce-level message reporting the overall duration and rate of the retire when Swarm completes a retire task on a disk. (v11.0)
See https://perifery.atlassian.net/wiki/spaces/public/pages/2443811993/Retiring+Hardware#Retire-Rate.

Canceling the Retire: Click the gear icon in the row for the affected disk and select the Cancel retire command:

Swarm Documentation > Managing Chassis and Drives > image2018-12-19_23-1-29.png

Identifying a Disk

It is helpful to enable the LED disk light for the disk when attempting to identify a failed or failing disk. Click on the disk light toggle in the disk's display row to flash the disk light for a specific disk:

Swarm Documentation > Managing Chassis and Drives > image2018-12-18_0-28-39.png

note

Note

Disk lights remain ON until manually turned off so return to the Chassis Details page and click the disk light switch to Off.

Note

Disk lights remain ON until manually turned off so return to the Chassis Details page and click the disk light switch to Off.