Tech ONTAP Blogs
Tech ONTAP Blogs
You may be asking yourself why we’re talking about a Dell storage platform alongside a NetApp service, but I assure you it’s for very good reasons. If you’re new to the topic, you may have missed our first discussion on PowerMax. For an introduction to managing PowerMax on Data Infrastructure Insights, check out the original post here.
The last post discussing PowerMax arrays covered some challenges around managing array performance in the context of evolving requirements. However, we didn’t understand how to use Data Infrastructure Insights to understand and improve how we manage the end-user experience. We can do this by understanding Service Level attainment, ensuring that workloads are appropriately distributed among service levels, and getting ahead of issues by implementing an AIOps approach to managing the estate.
Suppose your business doesn’t entirely run on SaaS. In that case, there have undoubtedly been moments when a critical application owner calls you or your colleagues in the infrastructure team in what can only be described as a blind panic. The direction of the conversation can go many ways, but one common thread goes something like this:
Them: “My application’s performance is terrible,” they might say – calling you with absolute certainty that the storage provided to them must be the problem. Someone always thinks it’s the storage...
You calmly jump to their aid, assuring them that their app gets very high priority from both you and the systems on which it runs. You look at the allocated volumes—all with the Platinum service level, no less—and are shocked to find that the latency across them is about five times higher than the target for that service level.
What gives? Why is the system not doing what it should, keeping these folks happy and making my day less stressful? Now, I have a new fire to put out, and I thought this technology was supposed to stop it from ever happening in the first place.
I’ve observed this scenario play out before, both for PowerMax admins and teams managing other systems. It’s one thing to identify when the volume latencies exceeded their scoped service levels, but what could we do to prevent that first phone call from derailing the day? Myriad other questions come to mind that will all doubtlessly have to be answered to put this incident to rest. Was this incident a one-off, or were there warning signs we missed? Is the issue systemic? If so, why hasn’t anyone else complained yet? What caused the problem in the first place?
Data Infrastructure Insights provides the context and deep analysis to answer these questions. It also enables proactive risk management before they become the next catastrophe.
Note: While the actions we explore below to mitigate incidents like this are focused on PowerMax, the capabilities you’ll see below apply to any infrastructure supported by Data Infrastructure Insights.
Artificial Intelligence for IT Operations (AIOps) is a game-changer in modern IT infrastructure management but is far too often a nebulous term without clear benefits. Implementing AIOps to provide a well-defined outcome aligned with the services you need to deliver is critical. Data Infrastructure Insights applies artificial intelligence to focus on providing you with actionable outcomes – you don’t have to be a prompt engineer or an expert at data mining, and you shouldn’t have to be!
For example, machine learning algorithms analyze storage duty cycles and predict potential issues in critical applications based on their seasonality week over week. That intelligence could help you resolve the next SEV-1 before it ever happens.
This forward-looking view not only saves time you would have spent on calls doing live troubleshooting with some very stressed application owners but also enhances the reliability and efficiency of your storage systems because you’re now empowered to anticipate changes in demand on your systems. Did the average block size of that workload shift, causing it to push more IOPS for the same throughput? Data Infrastructure Insights spotted that. Is our most critical ERP database experiencing more occasional blips of increased latency but nothing that would have flagged a hard threshold yet? That would also be identified before the ERP team’s alarm bells blew up your inbox. I could further enhance my awareness with another capability in Data Infrastructure Insights called Expressions.
Taking it further, Data Infrastructure Insights provides greater flexibility to the estate's Ops teams by allowing for more nuanced analysis. If I want to understand an issue for that critical ERP database, measuring IOPS won’t tell me much about whether it’s healthy. Similarly, latency can tell me something, but at the same time, increased latency when the DB is idle isn’t helping me solve any problems. It will generate unnecessary noise that wastes my time. Understanding a Storage Group’s workload weighted against its latency would significantly improve. Monitors and Dashboards in Data Infrastructure Insights provide this crucial capability by allowing for expression-based analysis, allowing more advanced analytics, and improving responsiveness to potential incidents.
For example, look at this graph of a Storage Group’s workload history over the last few days.
We see a repeating activity cycle and occasional cases where latency exceeds half a millisecond. When it does, it’s brief and oftentimes occurs when there is a sudden short burst of activity. I don’t need to spend my time troubleshooting those two spikes of latency when they represent only one minute of low activity – what I need is to understand when the busy periods where the application I’m supporting could be at risk.
I could feed just IOPS or Throughput into an anomaly detection monitor. However, that will report to me when the workload behaves differently – not necessarily when an issue needs the Storage team’s attention. Similarly, focusing on latency will indeed generate noise. On top of that, latency is not a good measure of the duty cycle of this application – it can change based on many variables (storage-side or otherwise). Latency is the best indicator of when I need to act, so let’s use that to our advantage and look at this chart compared to the above.
Data Infrastructure Insights has enabled me to collapse IOPS and Throughput into a new metric that weighs them against observed latency, which I called ‘blendedWorkloadScore’ (I know - the name needs developing). The IOPS and Throughput are weighted 2:1, so IOPS influences the score more significantly when we see high latency vs when we have larger block sizes and greater throughput. You’ll know we have a new spike in the second workload cycle that wasn’t evident before. This spike is more visible because latency did increase (though barely enough for me to have detected before) while there was a high load. I can now feed this scoring into an Anomaly Detection Monitor and be on my way to a world of much greater understanding of the Applications my team is providing infrastructure for and the ability to head off issues before those teams are impacted.
The last piece of this puzzle is a better understanding of how assets interrelate. If I see a spike in latency on a database at the Platinum service level, I need to understand how and why that happened. Data Infrastructure Insights provides multiple layers of correlative analysis to help me achieve this.
Firstly, I investigate any asset (Volume, Host, Port, etc.). This gives me access to the SAN Analyzer, which shows me a map of active paths overlayed with performance data. This provides key context into health that is easy to understand with minimal effort. More on that topic, including a demo video, can be found here.
Second, as I investigate, Data Infrastructure Insights will correlate observed events to alerts. I can quickly understand if that spike was related to some change on one of the volumes in the Storage Group or if it was perhaps related to some changes on the compute side of the house.
The example here shows a Virtual Machine with abnormal latency due to a migration.
It does not end there, though. Data Infrastructure Insights also cross-checks how metrics interrelate as I investigate, showing me correlative matches and helping me determine if the problem is in another layer compared to where I am now. In this case, latency, while it has some correlation to the asset in question, is not directly related to the virtual machine mentioned above vs. the host on which it resides.
Returning to the blip we observed, let’s examine the volume in the Storage Group with the highest latency during that peak. Two notable flags are immediately apparent: First, the SRP’s utilization correlates with this Volume's workload.
Second, Data Infrastructure Insights has identified that contention is at least partly caused by another Volume in a different Storage Group. That is suspicious…
This volume 0032B is in a Storage Group aligned to an ESX cluster – something completely different than this ERP database I was initially investigating. I can also see that it’s assigned to the Diamond service level, meaning it will get higher priority for responsiveness from the PowerMax Array than anything on Platinum – at least when resources are constrained. The Array has to choose what gets priority.
Without going deeper, my working theory is that the SRP might be under pressure or that the Directors have to access disk more than they usually should vs. relying on cache; such cases could cause this deprioritization of the Platinum ERP database and missing SLA target. I can use the purpose-built Service Level Performance Analysis dashboard for PowerMax to validate if we see downward pressure from Diamond on the Platinum level, and wow – it certainly seems to be the case.
Looking here, I see that a considerable volume of IOPS is run on the Diamond Service Level and that latency on Platinum did indeed spike across the board for a while – not just on the ERP service. I’m also unsure why ALL OF ESX is aligned to that service level in the first place. That means tons of general-purpose workloads are probably getting higher priority than many of my most critical apps. While the ERP team hasn’t noticed this blip yet since the portion hitting ERP was insignificant compared to this broader issue, I now have the time to re-examine the ESX Storage Group’s placement on Diamond and take a second look at everything else scoped to Diamond. It’s almost like allocating everything to the highest priority means nothing gets proper priority.
Ensuring optimal performance and user satisfaction will always be challenging, especially as workloads and user needs become more dynamic. Data Infrastructure Insights provides native integration for Dell PowerMax, offering a powerful solution to the challenges presented when operating this storage platform through deep visibility, advanced analytics, and proactive management capabilities.
By leveraging Data Infrastructure Insights, IT teams can comprehensively understand their storage environments, ensuring service levels are met and workloads appropriately distributed. The ability to detect anomalies using well-defined AIOps capabilities, expression-based metrics, and correlative analysis empowers teams to anticipate and resolve issues before they impact end-users. This enhances the reliability and efficiency of storage systems and reduces the stress and reactive firefighting that often accompanies performance issues.
The scenarios discussed illustrate how Data Infrastructure Insights can transform the management of PowerMax arrays. It provides actionable insights that lead to better decision-making and more effective resource allocation. Whether identifying potential issues before they escalate, understanding the nuanced behavior of applications, or correlating performance metrics across different layers, Data Infrastructure Insights equips you with the capabilities you need to maintain a seamless end-user experience.
To conclude, onboarding Dell PowerMax into your NetApp Data Infrastructure Insights environment is a strategic move that can significantly improve the end-user experience. By adopting these advanced capabilities, organizations can ensure their critical applications run smoothly, maintain high service levels, and deliver more substantial value to their business and users.
To learn more, check out the Data Infrastructure Insights overview here, join us on the BlueXP channel in the NetApp Discord, or ask the community a question.