For many of us, applying best practices is like exercising—you know you should do more than you currently do. That’s why NetApp has spent so much time developing tools that make it easier to identify where your storage best practices can be enhanced.
Over the last several years, I’ve been the NetApp Support account manager (SAM) for a large financial institution in Australia where we’ve seen firsthand how much difference using these tools can make. Simply by making consistent use of data available from the AutoSupport™ tool, My AutoSupport, and the Remote Support Diagnostics Tool (RSDT) in conjunction with minor process improvements, we’ve been able to eliminate the occurrence of severity 1 incidents (critical failure with major business impact, multiple users or business units affected)—the bank hasn’t had one in over two and a half years. We’ve also consistently met SLAs for availability of 99.99% and 99.999%. When issues do arise, we’re able to resolve them faster and with less impact.
In this article I provide a brief introduction to these tools for those who may not be familiar with all of them. I also explain how we use the tools to achieve significantly better adherence to best practices for improved storage system stability, availability, performance, and efficiency.
The AutoSupport Family of Tools
To begin I briefly describe each of the three tools. If you already know about these tools you can skip this section. However, you may still want to check out the links I provide for each tool that lead to a lot of valuable information.
Most of you are probably familiar with AutoSupport since it’s been around since the early days of NetApp. When you enable AutoSupport on a NetApp® storage system it sends system alerts and weekly logs to your administrators and to NetApp. At NetApp, this information is analyzed automatically to identify any issues that might impact future storage system stability and performance. You can read more about AutoSupport on the NetApp Support Web site. (Requires NetApp Support login.)
My AutoSupport is a Web-based tool that uses the AutoSupport data from your NetApp storage systems to help you analyze, model, and optimize your storage infrastructure. Storage systems with a valid hardware warranty or support contract have access to all My AutoSupport features, including:
- Risk reporting with proactive checks
- Performance overview reports
- Device visualization (system, disk, RAID, qtree, capacity)
- Storage system configuration comparison
- Storage efficiency profiling
- Data ONTAP® Upgrade Advisor
- Full AutoSupport history and events
- AutoSupport content viewer
You can learn more about My AutoSupport here. (Requires NetApp Support login.) Be sure to check out the various links and videos at the bottom of the page.
The Remote Support Diagnostics Tool helps NetApp Support diagnose storage system issues without intervention from your IT staff. This tool can significantly accelerate problem resolution while reducing the burden on your staff. RSDT provides secure, authenticated communication between your storage systems and NetApp. This allows NetApp Support personnel to upload core files and other diagnostic data in real time, enabling NetApp to diagnose problems without on-site assistance.
Because of the potential security concerns associated with remote access, we pay special attention to security with:
- Outbound 128-bit encrypted HTTPS connections
- A digital certificate that prevents spoofing
- Data collection only during problem triage
- Security policies you control
- Full audit log of NetApp actions
According to an independent assessment, RSDT conforms to all security best practices. You can read more about RSDT, including the third-party assessment of RSDT security, here. (Requires NetApp Support login.)
Taking Full Advantage of the AutoSupport Tools
The financial institution I’ve been the SAM for has over 120 NetApp storage systems. Production systems are all in HA clusters, with secondary HA clusters providing DR and additional standalone systems for backup using NetApp SnapVault® technology. About 3.5PB of data is backed up per month. The storage infrastructure provides storage for the application tier serving various business units, all file serving (CIFS), and Exchange.
The bank already had AutoSupport enabled on many of its storage systems, so for us it was mostly a matter of making sure that all systems were covered and then taking advantage of the features of My AutoSupport as they were rolled out. Because the business is a financial institution, enabling RSDT was another matter; it took some time for RSDT approval to flow through the various checks and balances. However, the bank was really sold on RSDT’s ability to expedite access to core files and other diagnostic data, and the security team ultimately decided that RSDT met all of the bank’s guidelines for data and networking security.
One of the keys to success for this institution was My AutoSupport risk reporting. My AutoSupport looks for previously identified risk signatures and creates a proactive risk report that identifies problems that might reduce storage system availability, performance, or efficiency. These risk signatures are constantly updated by experts at NetApp based on field experience and data, so each report always provides the latest information. My AutoSupport also provides a procedure to remove or mitigate each risk it identifies.
Figure 2) My AutoSupport risk report.
The My AutoSupport risk report is used along with the Supportability Profile report (see sidebar) to identify each risk, document it, judge its importance and risk profile, and create an action plan to address it. Action plans fall into three categories:
- Risks that can be resolved in a nondisruptive fashion
- Risks that can wait until the next planned downtime
- Risks that need mitigation as soon as possible
About every two weeks my team and I work through each report and create a plan to address each risk that hasn’t been previously identified. All risks are then documented on a “risk register” for the Operations team to resolve. This risk register is referenced each time downtime is scheduled so that outstanding work is completed. For operational reasons, a certain number of risks can’t be corrected in the near term; the bank deems those risks acceptable.
My team and I have been able to achieve tremendous improvements in storage system stability simply by using these tools and implementing the process changes I’ve described. The risk reports immediately found a variety of potentially serious issues—such as several failed FC-AL loops—and we put action plans in place to correct the issues. As Table 1 indicates, between July 2010 and January 2012 we achieved significant improvements in compliance to best practices in a variety of areas. This has contributed directly to greater storage system stability.
Table 1) Improvements in compliance with best practices.
|July 2010||January 2012|
|Systems running recommended version of Data ONTAP||89%||100%|
|MB firmware up to date||32%||99%|
|Disk firmware up to date||93%||98%|
|Shelf FW/version up to date||93%||98%|
|Implement dual attach loops||81%||99%|
Overall, the bank is extremely satisfied with the performance of its NetApp storage since these changes were implemented. NetApp has been held up as the “model vendor” not just for our stability and availability, but also for our reporting and preemptive risk identification.
Additionally, the implementation of RSDT has allowed problems that do occur to be resolved much more quickly. NetApp Technical Support’s ability to immediately download core files and other diagnostic data has enabled us to more rapidly resolve issues that occurred, minimizing disruption to the bank’s operations.
If you’re not taking advantage of the AutoSupport family of tools in your NetApp storage environment, it’s time to get started. These tools provide a simple way to identify and correct risks before they create problems, thereby enhancing your storage availability, performance, and efficiency.