by Ezra Tingler, Sr. Storage Engineer, NetApp IT
NetApp IT’s global enterprise environment has a large storage footprint. One of our biggest challenges is in capacity management. As a senior storage engineer in our Customer-1 organization, I am responsible for the storage capacity and performance management of our ONTAP systems.
Managing storage capacity is a daily grind. It involves analyzing utilization trends, balancing existing storage pool utilization, and forecasting future storage needs of applications. While all those tasks are certainly important and interesting aspects of my job, I also spend a lot of time addressing everyday issues that hamper my ability to effectively manage our ever-growing environment.
I consistently see alerts for full aggregates, which NetApp IT has defined as any aggregate which exceeds the 70% capacity threshold. This is a low threshold by most IT organization standards. Even so, whenever an aggregate exceeds this threshold a support ticket is opened. These events cause a lot of extra work for our operations team as they work to resolve the incident.
In addition to the workload generated by aggregates running out of space, full aggregates are a real problem. Full aggregates can result in performance and data accessibility issues. In NetApp IT all volumes within an aggregate (FlexVols) are thin-provisioned. Thin-provisioning allows us to provision the full amount of storage at time of creation, but allocate it when needed. In many cases, the aggregate full conditions are the result of volumes that do not conform to our thin-provisioning standards. These are easily resolvable by simply applying our standard thin provisioning settings.
If the aggregate full condition is not resolved by a simple adjustment of the FlexVol’s properties, then we need to look for a way to lower the consumed space within the aggregate. In these cases, we look for a good destination aggregate for the volume and plan a volume move. We manually perform a series of checks prior to and after the move. This is a time-consuming and repetitive task, with lots of back and forth between the engineering and operations teams.
My goal was to automate capacity management to reduce the number of capacity-based alerts that require manual intervention. To do this, I needed to automate the process of finding potential capacity issues and resolve them by moving volumes to aggregates with more available free space. I’m not a software developer, but once again I found myself writing PERL scripts for a few basic steps:
I used a similar methodology to write a script for automating cluster ONTAP configuration, using the NetApp Manageability Software Development Kit (NM SDK). The scripts check for violations by performing several steps to identify and resolve issues and evaluate possible targets. The redistribution script moves the affected volume to another like aggregate of the same service level. NetApp IT uses three levels--Extreme, Value, Performance—for delivering storage services. Using the OCUM database enabled me to minimize the number of connections to the clusters.
The process is kicked off by the wrapper script which runs as an hourly cronjob on our administrative hosts. It pulls a list of all clusters in our environment from the OCUM reporting database and calls the configuration enforcement script to check and resolve cluster configuration issues. The wrapper script also calls the capacity redistribution script to evaluate the target cluster and determine a source volume and destination aggregate. The wrapper script is also used for other storage automation tasks.
Specific users and roles as well as SSL certificate authentication provide secure access. The capacity redistribution script writes to the standard syslog facility on the administrative host and generates a service incident for tracking and reporting purposes. Automated capacity management is running in both our development and production environments.
Following the implementation of this automation, the NetAppIT storage environment has greatly improved. We found that just under 30% of the volumes in our development environment and about 20% in production had configuration issues or conflicted with our current volume setting standard, a common issue in large storage environments. The scripts automatically resolved these standard conflicts.
This new process also significantly improves storage team productivity. Each time an aggregate exceeded 70% capacity it generated a storage service incident. We had an average of 15 aggregate capacity incidents per month. Due to the complexity of the work and the cross-team coordination required, the incidents took an average of 6.25 days to resolve. These scripts enable us to resolve these incidents in minutes or hours, not days.
Another benefit is that the scripts automatically implemented thin-provisioning on all volumes, which instantly resulted in space savings of more than 5PB. During our transition to ONTAP with its storage efficiency features, we have been able to keep our capacity growth almost flat and eliminate the immediate need to invest in additional storage capacity.
Now that version two of the scripts are deployed, I will begin working on some other automation tasks. The next natural step will be importing the scripts into OnCommand® Workflow Automation (WFA) as an automation script available to all NetApp customers. I should have more time to work on this project now that I don’t need to worry about as many capacity issues.
Check back here to see when the scripts are available in GitHub and WFA.
The NetApp-on-NetApp blog series features advice from subject matter experts from NetApp IT who share their real-world experiences using NetApp’s industry-leading storage solutions to support business goals. Want to learn more about the program? Visit www.NetAppIT.com.