Cluster Protection Planning

With Swarm's density-friendly architecture (introduced in Swarm 10), Swarm's cluster structure and protections have changed:

Requirements and Guidelines

Observe the following data protection requirements and guidelines when designing the Swarm cluster:

  • Small Clusters: Verify the following settings if running 10 or fewer Swarm nodes (min three required in production).

    • Policy.replicas: The min and default values for numbers of replicas to keep in the cluster must not exceed the number of nodes. For example, a 3-node cluster may have only min=2 or min=3.

    • EC Encoding: For EC encoding, verify you have enough nodes to support the cluster's encoding (policy.ecEncoding). For EC k:p encoded writes to succeed with fewer than (k+p)/p nodes, use the lower level, ec.protectionLevel=volume.

Best Practice

Keep at least one physical machine in the cluster beyond the minimum number needed. This allows for one machine to be down for maintenance without compromising the constraint.

Important

If you need to change any, do before upgrading to Swarm 10.

  • Cluster in a Box: Swarm supports a "cluster in a box" configuration as long as that box is running a virtual machine host and Swarm instances are running in three or more VMs. Each VM boots separately and has its own IP address. Follow the recommendations for small clusters, substituting VMs for nodes. For two physical machines, use the "cluster in a box" configuration, but with three or more, move to direct booting of Swarm.

  • Subclusters: All nodes remain in the single, default subcluster unless manually grouped into named subclusters by setting node.subcluster across the nodes. Perform this to allow Swarm to distribute content according to groupings of machines with a shared failure mode, such as being in the same building in a widely distributed cluster.

Caution

Setting ec.protectionLevel=subcluster without creating subclusters cause a critical error and lower the protection level to 'node'.

  • Replication: For data protection reasons, Swarm does not store multiple replicas of an object on the same node. If using fewer physical machines than are required for the replication scheme, use a virtualization/containerization technology to run multiple Swarm nodes on the same hardware appliance. Do not specify too many replicassetting the number of replicas equal to the number of storage nodes can lead to uneven loading when responding to volume recoveries.

  • Erasure-coding: Best practice is to use ec.protectionLevel=node, which distributes segments across the cluster's physical/virtual machines. Do not use ec.protectionLevel=subcluster unless subclusters are defined and are sure enough nodes (machines) exist to support the specified EC encoding. The lowest level, ec.protectionLevel=volume, allows EC writes to succeed with a small cluster with fewer than (k+p)/p nodes. See the next section for details.

Choosing EC Encoding and Sizing

The EC encoding defines the way Swarm divides and stores large objects: 

  • k:p: Defines the encoding, where 

    • k (data segments) drives the footprint: An EC object's data footprint in the cluster approximates this valuesize * (k+p)/k

    • p (parity segments) is protection: Choose the protection level needed, two or higher; p=2 and p=3 are most common.

    • k+p (total segments) is the count of segments: The original object can be reconstructed if any p segments are lost.

  • Manifests: Segments are tracked in a manifest, which is itself protected with p+1 replicas, distributed across the cluster.

  • Sets of Sets: Very large EC objects (or incrementally written objects) are broken up into multiple EC sets because any segment that's over the size limit triggers another level of EC. Each set has its own k:p encoding, and the overall request combines them all in sequence.

See https://perifery.atlassian.net/wiki/spaces/public/pages/2443811171

How Many Nodes are Needed?

The number of nodes required in the cluster depends on both the encoding scheme and the protection profile being targeted:

EC Profile

Formula

Example: 5:2 

Notes

EC Profile

Formula

Example: 5:2 

Notes

Manifest minimum

p+1

2 + 1 = 3

Basic requirement for storing manifests.

Segment minimum

ceil((k+p)/p)

ceil((5 + 2) / 2) = 4

Objects can be read (but not written) if one node is lost or offline.

Per 5:2, four nodes allow 2+2+2+1 segment distribution because Swarm allows two segments per node.

Recommended protection

ceil((k+p)/p +p)

ceil((5 + 2) / 2 + 2) = 6

Objects can be read and written if one node is lost or offline.

High protection

k+p

5 + 2 = 7

Objects can be read and written even if two entire nodes are lost or offline.

High performance

(k+p)*2

(5 + 2) × 2 = 14

Recommended for best performance and load distribution (load-balancing becomes easier as clusters expand).

How Many Volumes are Needed?

A minimum of k+p volumes are needed in a cluster, assuming ec.protectionLevel=volume, which is not recommended. For ec.protectionLevel=node, a minimum of p volumes are needed per node. For recommended volume protection, use the formula ‘>=p+1 per node.

Optimizing Erasure Coding

What Improves EC Performance?

  1. Good-Enough Encoding: Do not over-protect. The more nodes are involved, the more constraints on EC write to succeed and the more overhead is created.

    • Keeping k+p small reduces the overhead of EC writes.

    • Keeping k small reduces the overhead of EC reads.

  2. Consistent Scaling: The rule of thumb is to scale erasure coding and add one additional node for each ceil((k+p)/p)+1 node.

  3. Faster Nodes: As a rule, an EC read/write is limited by the slowest node, and there is a significant constant expense to set up connections.

  4. More Nodes: Having more nodes in the cluster than needed for an encoding allows the cluster to better load-balance.

What Helps Balancing?

  1. Do not Run Full: This is the most important principle, so be ready to proactively scale the cluster. Unbalancing typically happens if a cluster is allowed to fill up before provisioning additional, empty nodes.

  2. More Nodes: Larger clusters have an easier time load balancing, and all nodes do not need to be involved in an EC write. A cluster with k+p nodes fills those nodes at the same rate, but, if a node loses a volume, one node fills faster and stops fully-distributed writes, even though there may be ample space on other nodes.

It takes Swarm a long time to rebalance a cluster that is heavy on EC objects, several times longer than if they are fully replicated, because inadequately distributed EC segments can only be moved by health processors on other nodes, and there are many constraints.

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.