When it comes to data storage, there are a lot of things to think about beyond just performance, including:
- Sheer numbers. How do you effectively manage data as the number of files in storage moves from millions to billions?
- Location. How do you make sure data is in the locations where it's needed—and make sure that sensitive data isn't stored where it's not supposed to be?
- Durability. How do you make sure data that you store for years remains readable if it is rarely or never accessed?
- Compliance. How do you make sure you're meeting both corporate governance and regulatory requirements?
- Retention. How do you retain data for periods that might span multiple generations of storage hardware?
- Cost. Finally, how do you make sure that data is stored on the most cost-effective media throughout its lifecycle?
Until now putting together an effective solution for any of these problems has been difficult, let alone solving all of them. But that's exactly what NetApp has achieved with StorageGRID® Webscale.
StorageGRID Webscale is massively scalable, software-defined object storage designed specifically for large archives, media repositories, and web datastores.
In this article, I'll introduce a few of the concepts behind object storage, discuss the features and capabilities of StorageGRID Webscale, and talk about a few use cases.
Why Object Storage and Why Now?
Object storage is a little different than the familiar block and file storage. It organizes data into flexibly sized data containers called objects. Objects are stored in a flat namespace that may span multiple locations. Each object has both data (an un-interpreted sequence of bytes) and metadata (a unique ID plus an extensible set of attributes that describe the object). A simple way to think about this is that object storage is like valet parking—you give the valet your ticket, and you get your car back without ever needing to know anything about where the car was parked.
Figure 1) Object storage offers flexible containers and extensible metadata that makes it possible to efficiently manage billions of files.
The advantage of this approach is that data can be referenced and queried based on any attribute. And unlike with paper valet tickets, which can be far too easy to misplace, there can be multiple ways to find the right "ticket", plus you can make using a ticket as hard (secure) or as easy as you wish. Identifier tags allow for the indexing of files in quantities several orders of magnitude greater than a file system, making object storage ideal for enterprise storage that is distributed over wide areas and encompasses billions of files.
Three trends are contributing to increasing interest in object storage:
- Continued growth in the amount of unstructured data necessitates a new approach to storing and protecting data. Object-enabled data management facilitates intelligent data placement to meet a wide range of performance, durability, availability, location, and longevity requirements.
- Enterprises are increasingly coping with decentralized data creation and consumption. The "primary data center plus DR data center" model is being replaced by a multisite approach in which users, workloads, and data are brought closer together.
- The hybrid cloud is providing new options to balance cost and performance, and IT teams are looking at how they can best leverage both on-premises and cloud-based storage.
Introducing StorageGRID Webscale
StorageGRID Webscale is an enterprise-grade storage platform that offers significant advantages over other object storage approaches. Its unique software-defined architecture supports billions of objects and tens of petabytes of storage spanning numerous locations in a single namespace.
Built to support hybrid cloud, StorageGRID Webscale provides always-on data availability and proven native support for cloud applications with S3 and CDMI APIs. A dynamic policy engine lets you optimize availability, performance, and cost for each data object stored, providing much greater granularity.
StorageGRID Webscale takes advantage of experience and capabilities gained from over ten years of production object-storage deployments with our original StorageGRID product. As a result, NetApp is able to deliver:
- The industry's most advanced policy framework for data lifecycle management
- True geo-distributed and geo-selective object placement
- An unparalleled level of data durability
- Tape as an active tier (with ability to retrieve single objects from tape)
Table 1) StorageGRID Webscale features.
|StorageGRID Webscale: Key Capabilities|
|100 billion objects per namespace||Integrated data protection|
|70PB per namespace||Objects up to 5TB in size|
|Up to 16 data center locations||Full audit and reporting (Splunk compatible)|
|Nondisruptive upgrades||E-Series for density, performance, availability|
|Location and storage tier chosen based on policy||Scale-out|
|Integrity verification and self-healing||Long-term retention|
|Native S3 and CDMI RESTful APIs|
Dynamic Policy Engine
The level of granularity and flexibility offered by StorageGRID Webscale is unmatched in the industry. Other solutions manage data based on containers, limiting your options. StorageGRID Webscale features a dynamic policy engine that allows you to set policies in terms of a variety of criteria including:
- Resource availability and latency
- Data retention requirements
- Geo-location requirements
- Network cost (factor in the cost of the network link)
It can evaluate objects based on criteria such as custom user and application metadata, method of ingest, size, or time of last access, and apply policies that define:
- Where an object is placed geographically
- The type of storage used to store an object (SSD, HDD, or tape)
- The number of copies made of an object
- Retention policy, including changes over time to the placement, storage grade, number of copies, and deletion (if applicable).
An object's metadata includes the locations where it is stored and the number of copies. Metadata can include custom fields, and new fields can be added as requirements change. Metadata is distributed throughout the StorageGRID Webscale environment to increase scalability and resiliency and provide faster retrieval.
Figure 2) Advantages of the StorageGRID Webscale policy engine and extensible metadata.
Verify policy compliance. The policy engine in StorageGRID Webscale is unique in that it not only executes policy against an object on ingest, it also periodically verifies compliance and takes corrective action. For instance, a policy might mandate that three copies of a particular type of object be maintained at all times. If a failure affects one copy of an object subject to this policy, a new copy is automatically created to bring the object back into compliance.
Apply policy retroactively. Policy changes can even be applied retroactively. For instance, suppose you have a policy set up so that data is stored with one copy in the US, one copy in Germany, and one copy in Japan, but then the law changes and you are no longer allowed to store that data type in Japan. You simply change the policy and StorageGRID Webscale automatically moves data as needed to ensure compliance. This can turn what might otherwise be a monumental data management task into a matter of a few mouse clicks.
Availability and Data Durability
StorageGRID Webscale features a fault-tolerant architecture that supports nondisruptive operations, upgrades, and infrastructure refreshes. It is designed to respond to the loss of individual nodes and entire sites to provide continuous data access. Load balancing automatically distributes workloads during normal operations and when failures occur to achieve the best possible performance under all conditions. NetApp AutoSupport provides automatic notifications to your administrators and to NetApp when a problem occurs.
Dual commit and multiple copies. On ingest, objects are immediately protected by dual commit (two local copies), and all objects are replicated to several locations. All object copies are active and can be used to satisfy retrieval requests.
Data integrity. Multiple interlocking layers of integrity protection including authentication, hashes, and checksums are used to protect your data. A digital fingerprint is created for each object on ingest and verified on retrieval, replication, migration, and while the object is at rest. Suspect objects are automatically replaced. If you're retrieving an object and it fails a test, it is automatically retrieved from a different location and a new copy is created—transparent to both the user and the administrator.
Regular health checks. StorageGRID Webscale performs "health checks" on data that might not otherwise be accessed at regular intervals (defined in policy) to ensure its integrity. This means that you can store data for long periods of time and be confident it will remain readable should it ever be needed.
S3 RESTful Object API Support
The S3 RESTful object API used by Amazon Web Services has become a de facto standard for object storage. By providing compatibility with S3 APIs, StorageGRID Webscale is able to immediately support applications built for S3. You can move applications written for public cloud providers on premises, and you can develop applications that can run in both private and public clouds.
StorageGRID Webscale provides support for S3 content including AccountID, Bucket and key prefix, and S3 metadata, and is also able to provide audit logging, monitoring, and reporting.
StorageGRID Webscale Architecture and Deployment
The simple, logical architecture of StorageGRID Webscale supports a physical architecture that can both scale up and scale out. The logical architecture is shown in figure 3. Objects are stored and retrieved using RESTful APIs. As discussed above, a lot of the power of this architecture comes from policy-driven data placement and a location-transparent distributed object store.
Figure 3) StorageGRID Webscale has a simple logical architecture that can overlay object-level data management on a variety of storage hardware.
The physical architecture utilizes four types of nodes:
- Admin nodes provide management services such as configuration, monitoring, audit, and logging.
- Storage nodes manage object storage, including replication.
- API gateway nodes (optional) provide a load-balancing interface through which applications connect to StorageGRID Webscale using standard APIs.
- Archive nodes (optional) provide an interface to archive media such as tape.
You can scale out with multiple nodes of each type—in each data center—to support massive scale. StorageGRID Webscale nodes run as VMware virtual machines in front of block storage, which can be NetApp E-Series storage or third-party arrays. Each virtual machine utilizes 8 vCPUs and 24GB of RAM. SSDs and 10GbE can optionally be used to enhance VM performance.
Figure 4) The StorageGRID Webscale physical architecture relies on four types of nodes. Nodes serving a single deployment can be distributed across up to 16 data centers.
Running StorageGRID Webscale on E-Series
StorageGRID Webscale is a software-defined product that runs on VMware virtual infrastructure in combination with block storage. We believe you can achieve best results by deploying StorageGRID Webscale on proven enterprise-grade storage such as NetApp E-Series. If you buy your infrastructure from your local discount store, that's who you're depending on for support when something fails in the middle of the night. E-Series storage is not only highly resilient—with over 750,000 deployed systems—it's backed by enterprise-grade support services.
E-Series delivers the performance and resiliency required for StorageGRID Webscale use cases by offering features such as dynamic disk pools (DDP), which delivers node-level erasure coding. DDP distributes data, parity information, and spare capacity evenly across the entire pool of drives, simplifying setup, eliminating hot spots, and maximizing capacity utilization. Free space is distributed across all disks, so there are no dedicated hot spares sitting idle. You get the full performance of all disks in the system. DDP minimizes the performance impact of a drive failure and can return the system to optimal condition up to 8 times more quickly than traditional RAID.
NetApp believes that StorageGRID Webscale is extremely well suited for web data repositories, data archives, and media repositories. Each of these use cases has its own distinct set of requirements, but StorageGRID Webscale adapts to accommodate the wide variety of needs encompassed in this set of use cases.
Web Data Repositories
Web data repositories are characterized by small object size, high object count, and high transactions. Because it can handle up to 100 billion objects in a single repository distributed across many locations and both S3 and CDMI API support, StorageGRID Webscale is extremely well suited for this use case.
Increasingly, enterprises are storing massive amounts of data for extended periods to satisfy both corporate governance and legal requirements. With data archives of this type, cost and management are typically the most important concerns. Long access latency is tolerated in exchange for reduced cost. StorageGRID Webscale satisfies this use case with tape integration, proven data durability, and flexible, policy-based management.
Media repositories are characterized by large object sizes (250MB+), a need for geographical distribution, a need for data integrity, and a low time-to-first-byte latency. The geographically distributed, durable design of StorageGRID Webscale satisfies these requirements. It also supports "ranged reads," so, for example, a video can be streamed from any point without having to download the entire object.
Because StorageGRID Webscale is built on a firm foundation derived from our original StorageGRID solution, it is a mature product ready to satisfy your object storage needs. It offers capabilities that you won't find in other object solutions, including geo-distributed and geo-selective object placement, proven data durability with regular health checks, and retroactive policy compliance.
Taken together, the capabilities of StorageGRID Webscale can greatly simplify the management of web data, archives, and media repositories and allow object storage to be architected for decades of nonstop production use. And you get all that from a proven company that has enterprise-class support.
This is becoming a cloud-dominated world. NetApp is doubling down on object storage and positioning itself to take a leadership role, with much more to come. Keep your eyes on Tech OnTap for future developments.