Migrating from Traditional Storage

Advantages of Object Storage

Object storage brings capabilities making aspects of traditional file systems obsolete.

  • It works as a unified, self-scaling, self-protecting, and self-healing pool of storage requiring no backup (and may even be too large to back up in a traditional way).

  • It offers enhanced metadata (data about data), which can be customized and leveraged programmatically.

  • It includes large-scale, high-performance searching based on rich metadata.

Never-Ending Storage Systems

In the end, all traditional file systems and storage systems run up against hard limits. Whether it is at the volume/block layer level or at the partition level, an upper limit is always faced on how large a LUN (logical unit number) can be made or where a partition begins to become unmanageable, due to size. Object storage offers an effectively limitless namespace and storage layer to house growing data.

Large data LUNs are created by aggregating multiple disks together using hardware or software RAID technologies and accessing them using a fast interconnect, like fiber or iSCSI. These RAID volumes have limits and durability characteristics, which creates challenges for LUN sizing. Different SAN manufacturers have different limits on the ideal sizes and distribution of LUNs. IT administrators must prioritize the type of data protection level and speed each time they commission new storage. It is rarely a matter of making the largest volume possible and offering it out to users to carve up as they like, so dynamically scaling these systems is challenging, if not impossible.

In contrast, Swarm clusters are unified volumes of storage with the ability to share a single protection profile or apply different protection profiles within the cluster. Add new hardware to the cluster and allow Swarm to scale.

Bullet-Proof Protection

Data loss at a small scale, such as 1 or 2 disk failures for a single RAID volume in a SAN or local RAID group, is survivable: replace those failed disks and suffer decreased performance while the parity is rebuilt after a period of hours or days, depending on the size of the volume and the amount of data on it. Multi-disk failures are common enough, and hard disk capacities increase with time, leading to longer rebuild times and larger datasets and backup times.

Swarm object storage is inherently designed to sustain and heal from multiple disk failures and, depending on the configuration, multiple server failures. In addition to content protection policies allowing precise cluster, domain, and bucket-level controls, Swarm also offers additional layers of data protection implemented to support an organization's needs for protection:

Rich Metadata

Information about the data is now as important as the data itself, for analytics, retrieval, and value-add processes. With a traditional file system, such as NTFS or ext4, metadata for the file is fixed by the file system and is limited to system-side information (access times, owner, attributes). With Swarm, up to 32 KB of custom metadata can be stored with each object, which is a tremendous amount of text-based information.

A growing number of specialty file formats have emerged to allow files to embed critical information about the contents like a passport, such as the richness of data modern digital cameras store with each photo, capturing the location, camera make and model, resolution, speed, exposure, and more. In most of these cases, the file contains the metadata itself, and the application used to view the file restricts what metadata is visible to the user.

In the same way, extended metadata becomes part of each object being stored in Swarm and so cannot be lost. In a Swarm cluster, all metadata associated with a file is stored as header information on the file itself. This header information is viewed using a HTTP HEAD of the file, requiring no special drivers or applications.

In addition, Swarm allows creating and storing standalone metadata annotations as a header-only object associated with an existing content object. The ability to keep extending custom metadata and add metadata to read-only objects is effectively unlimited.

Advantages of Deploying Content Gateway

Implementing Swarm with Content Gateway provides an organization with authentication, a browser UI for end users, S3 protocol access, and enhanced multi-tenancy. Multi-tenancy (discussed below) can be a critical tool for dividing and delegating content access and structure within large organizations.

Below is a basic Swarm deployment leveraging Content Gateway:

  • A 6-chassis Swarm cluster, for hardware resilience

  • Elasticsearch cluster for dynamic searching

  • Content Gateway

The Swarm Storage cluster is protected within a dedicated private network, and all client and application traffic passes through Content Gateway:

Tenants, Domains, and Buckets

Swarm offers multiple levels of access. The following focuses on tenants, domains, and buckets:

  • Tenant: A tenant is a hierarchy owning one or more storage domains. Each tenant scope can define a separate identity management system so users and groups within them are separated from those in other tenants. The tenant administrators have the ability to create and access storage domains on behalf of the tenant, and they can delegate management duties for the storage domains they create. The tenant scope does not store end-user data; it is a meta store for information about the tenant, users, and storage domains.

  • Domain: The domain scope is directly tied to a Swarm storage domain and is where end-user data is kept. The SCSP and S3 storage protocols create and use data within the domain scope. While the domain scope can inherit user and group identity information from the tenant, it also has the ability to define a separate identity management system. The domain administrators can create and access all content within the storage domain. They can optionally delegate control of storage buckets to individual users or groups.

  • Bucket: The bucket scope is directly tied to a bucket existing within the Swarm storage domain. While access control policies can be defined for every bucket, there is no option for an identity management system definition at the bucket scope. All buckets with a domain share the domain's identity management system definition.

In short: A tenant holds multiple domains, and a domain holds multiple buckets.

Organizing by Tenant

Outside of multi-tenancy environments, tenants are useful for grouping similar storage areas in a cluster.  

Single Tenant and Wildcard DNS

Here is a top-level structure:

Tenant1's auth and protection levels are inherited by the domains lower down. Domain3 has buckets (represented here by folders).

Note

Even though the tenant is a special type of domain specific to Gateway, it is still a domain.

Each of these domains can also be fully qualified within the corporate DNS structure.

Tip

Use wildcards so there is no need to add DNS records for every new domain as they are created. This allows users to create separate domains, and DNS resolution happens automatically as long as the domains are created with a similar naming structure.

One Domain per Department and Employee

Create a wildcard DNS record for the gateway's address: *.cloud.example.com 

Each domain created here represents a single department in the organization. A domain can be created for every employee as there is no limit to the number of domains within the storage cluster. The last domain is an employee domain: asmith3.cloud.example.com  

Employees can create as many buckets as they wish within separate domains, to further subdivide content.

One Tenant per Division

It may make sense to have more than one top-level tenant for an organization. Provide each corporate division a separate tenant so it can create and control separate departmental and employee domains. This provides an additional level of organization and authorization to work with.

Create the most readable and shortest path to the information relevant to users.

Verify the division is correct as it appears from a browser. The following URL is easy to interpret and access:

http://accounting.finance.example.com/fiscalresults2017/data.xls http://<dept>.<division>.<org>.com/<bucket>/<filename>

Migration Planning

The following are guidelines to facilitate a smooth migration to object storage.

Adapting the Legacy Structure

When migrating data from a traditional block storage or file-sharing solution, carefully evaluate the structure as it already exists and decide how much of the structure to take forward.

Users do not like change. When a file server is added to an organization, tribal knowledge tends to develop and ingrained about what data goes where in an enterprise. New users are given access to the “P” drive or the “docs” folder, and, bit by bit, they learn where things are and where to put things. When implementing a new structure or a new file server, ask a user how they prefer the new system look. They may insist “Exactly like the one now!” There is no easy answer to combat this and it is a real challenge.

When performing a migration of any kind, this is the time to start to manage change in the organization, to minimize problems and resistance. It is also important to evaluate the old structure for duplication and dead wood early on, to eliminate it as part of the migration.

Best Practices for Restructuring

The following are lessons emerged from many implementations:

Do not bulk move folders to pseudo folders

  • An object store offers immense flexibility; bulk moving folders of files removes the flexibility and keeps the older structures. This adds challenges to changes going forward.

  • Any 1-to-1 movement needs to use pseudo folders, which are prefixes to an object name. Pseudo folders add challenges to object searches.

  • Permissions and user attributes apply to the object, not the folder. Users can be disappointed if creating a pseudo folder thinking they share it and all files in it.

Convert pathnames using domains and buckets

  • Think about what looks best in an object context if a very long pathname exists such as /year/month/day/filename. The shortest path is to have the domain as the year, with the bucket being a month+day context. For example: 2017-hq-videos.example.com/Sep-13/videofile.mp4

  • There is no need to have a date on a bucket name if the date is in the filename.

Use domains for data groups

  • Provide separate domains if a large amount of similar data or data always used in the same workflow exists.

Use tenants/domains for applications

  • Provide an application a separate tenant or domain if an organization uses a particular application whose dedicated data is used via the application.

Optimize for searches

  • Collections are saved searches where the scope of the search can be the entire domain or a specific bucket.

  • When creating domains and buckets, avoid creating a structure too granular for large searches. For example: Creating a bucket per hour may be excessive unless there is a lot in each bucket if creating a bucket per day in a domain for a certain type of data.

Planning Areas

Any migration project requires consultation with DataCore and planning around these key area. This requires all integration points in an environment be listed and diagrammed:

Namespace

Strategy for mapping file systems to objects (discussed above)

What FQDN (fully qualified domain name) and DNS setup to use for Gateway (see https://perifery.atlassian.net/wiki/spaces/public/pages/2443810075)

Networking

Work out, down to each port (see https://perifery.atlassian.net/wiki/spaces/public/pages/2443808571), how all Swarm components integrate, to surface design issues

List required applications and verify they can access storage regardless of network segment

Evaluate need for HTTP versus HTTPS (see https://perifery.atlassian.net/wiki/spaces/public/pages/2443814996)

Whether to use front-end load balancing or round robin

Authentication

Is LDAP or Active Directory integration being used?

How does the current ACL structure map to Gateway ACLs? (see  and )

Swarm Clients
(Optional)

  • Check minimum requirements if deployed client-side

  • Networking implications (Elasticsearch access and IP whitelisting)

FileFly

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.