Thanks for clarifying a lot of issues here for me. One more follow-up. If I implement SnapVault, it could backup snapshots to remote site, and restore not infected data from previous data on remote site. Will that do any better comparing to rely on local snapshots? I know this method could save space on the primary site and keep all data should DR happen. But just from the perspective of protecting data from ransomeare attack, I would say there is no difference because the infected data will be replicated to the remote side as well by snapvault. am I right?
... View more
Let's say I have a NFS volume, and with FabricPool and snapshots policy enabled. The minimum cooling days are 2 days, which means any cold BLOCKS have not being modified will be tiered to S3 bucket. The snapshot policy is set to one per day. Let's also assume that the volume got corrupted or infected (encrypted) by ransomware on Friday, can I then use snapshot taken Thursday, Wednesday, or any day before Friday to restore data? My answer probably not, because the metadata or the active data may be got corrupted, It is not necessary that all blocks involved for that day I need would be tiered to S3, then I can not restore all blocks back. FabricPool with Snapshots policy is not backup. Am I right? Can experts here please confirm or shed some light for me?
... View more
Thank you for detailed answers! Since you sound so familiar with these, I have one more specific question: If I wanted to for instance put Oracle into a backup mode first, does SnapCenter provide such commands for me? Or I have to use bash or something like python to achieve that? Does SnapCenter provide the interface to bash or pyton? What about Snap Creator, does Snap Creator have such commands to achieve that?
... View more
Can somebody please tell me: 1. Can SnapCenter Oracle Plugin do everything SnapManager for Oracle? 2. What are fundamental differences between these two in terms of concepts or implantation's? Thank you!
... View more
1. My understanding is SnapMirror/SnapVault will not take much storage IOPS. Correct? 2. How much CPU resources could be taken by SM/SV on the respective local or destination cluster? Thank you for you advice in advance!
... View more
I probably still wanted to have #2 as the other option, because it is no extra cost for running a 3rd party backup tool to back up NFS on the client. We have to pay for Cloud Backup. Make sense?
... View more
Appreciate your help! 1. Each NFS mount is corresponding a CID, and for each CID, OnTap can allow maximum of 128 execs. If I have 2 NFS mounts on a client, there would be then 2x128 = 256 execs OnTap can allow. If nconnect=0, then one TCP connections per mount. If I have 2 NFS mounts then there would be 2 TCP connections. OnTap can only supports 128 slot tables per TCP connection. Are these statements all correct? 2. For clients, 65535 as their default slot tables, should we change it to 128 as the maximum value, because too many slot tables may exhaust "exec_contexts" on nodes from these clients. Make sense? 4. A node can have 100K TCP connection, and each TCP connection can allow 128 exec_contexts, then the node can allow 12,800K exec_contexts. Correct? 5. OnTap enables NAS flow control, then for the cluster mainly uses NAS protocol, we don't need to warry about if we should disable or enable flow control on NIC/Switch. Correct?
... View more
I have some follow-up below if you can please help me out: 1. Based on the large number of increases on execs_blocked_on_cid as the result of running that "statistics" command(> 10k in a minute), It looks that OnTap pushes back clients requests on some nodes in the cluster, can we then conclude that these nodes have performance issues and caused by too many concurrent NFS operation requests? 2. In your document, you used the workload "creating lots of files or directories" as examples to illustrate concepts, what if in the case of Oracle Database files? Could the workload like this also cause large amount of NFS requests at the same time or cause overrunning exec contexts? I can see there are about 300 dbf Oracle datafiles on a NFS File System, and based on the time stamp, they are kept changing across all these files from time to time(I know they are not OLTP type applications). To accurately tell if this client really overrun exec contexts, we will have to upgrade OnTap to 9.8 first. Correct? 3. Would that be okay to configure nconnect without knowing if the client really has overrunning exec contexts? 4. On page 113, you stated that 3ms of latency was added, but Figure 19) showed ~9ms latency. Why? 5. Will overrunning issue also apply to NFS datastore mounted on ESXi host as the client?
... View more
I have read page starting from pg111, there are obviously "exec context blocking" on some storage nodes, because at least 10k or many more increases during 1 minute period of time. We are using 9.7p11 and most of Linux vm servers are running redhat 7.9, all using default RPC slot tables. As I said, previously, each storage nodes has 1200-1600 NFS mounts, so, my question is: 1. Should reduce workloads (NFS mounts) from nodes having more exec context blockings by moving workloads to other nodes? Thus to redistribute loads among nodes? 2. for the server in question, because we are all using default RCP slot tables, then there would be no way to identify if RPC slot tables/exec context blocking is the issue here. What can I do in this situation? Thank you for your valuable message!
... View more
We some time have experienced performance issues on some batch jobs running on Linux VMs, Latency is on the datastore or volume only 1-2 ms, We couldn't identify the root cause. That made me to look down the ifgrp and ports, and find packets discards. I run "node run -node node-name ifstat port-name", and found the "discards/total frame" is higher than expected. As you point out that too many NFS mounts won't cause discards, then what if I increase the bandwidth from 4x10GbE to 4x40GbE, would that improve ? because one of HA pair has 4x40GbE and the discards rate is very low, < 0.1%.
... View more
@parisi , Thanks! NFS datastores (layer-2) or volumes(layer-3) are heavily used here. NFS mounts for both type seem not evenly distributed across LIFs/nodes. Each node bears 1,200-1,600 mounts. I have also noticed there are packets discards on some ports within ifgrp group(4x10GbE or 4x40GbE). Some of ports on some nodes have reached as high as 1.x or 2.x%. 0.1% or below could be treated as acceptable, as my understanding. Should "packets discards" be used as the measurement to determine if LIFs/node have been overly used for NFS mounts, and if there is the need to redistribute NFS mounts? are there any other measurements? Should we increase ifgrpfrom 4x10GbE to 4x40GbE for those ports have 10GbE, and packets discards?
... View more
@parisi , Thanks! NFS datastores (layer-2) or volumes(layer-3) are heavily used here. NFS mounts for both type seem not evenly distributed across LIFs/nodes. Each node bears 1,200-1,600 mounts. I have also noticed there are packets discards on some ports within ifgrp group(4x10GbE or 4x40GbE). Some of ports on some nodes have reached as high as 1.x or 2.x%. 0.1% or below could be treated as acceptable, as my understanding. Should "packets discards" be used as the measurement to determine if LIFs/node have been overly used for NFS mounts, and if there is the need to redistribute NFS mounts? are there any other measurements? Should we increase ifgrpfrom 4x10GbE to 4x40GbE for those ports have 10GbE, and packets discards?
... View more
Well, I can understand your first and last part. Not sure of the 2nd part, and I am not so familiar with PwerShell. There are about 25 vservers, about 700 policies, and thoughts and thoughts of rules across the cluster here. To pull and save them on a Linux server in daily basis, it should be easier, and in ASUP already. However, based on these saved files to "restore" or rebuild the Export, that seems not straightforward to do, and not sure of it is right thing to do. If anything goes wrong, it could screw up the whole cluster, because it is NFS based environment here. As Storage admin, we can write our own script, but may not be that good. Is anybody there doing the similar thing? Or does this make sense at all? >> you should be able to pipe yesterday's variable into the new-* or set-* commands. As I said, I am not good at PowerShell. But, is it really possible just to capture the entire Export including every single policy/rule to a variable, and compare today's with yesterday's, and if there is difference, pipe yesterday's variable which contains entire Export information into new-* or set-* commands, and these commands are going to rebuild the entire Export for me back to yesterday's status? Am I understanding you correct? I am not asking about just taking NAS reports.
... View more
We had an incident whereas a mistake was made and it messed up a policy. After that, we are discussing a way to fix incidents like this. One idea is to use scripts (ex, Ansible) to compare current policies/rules against 'last know good status", and then restore all good values back. Does that make sense to you? Is there any mechanism I can restore policies back to previous good status? We can modify some values using CLI, if we know what needs to change to. But just completely restore all export policies/rules using our own saved database(last know good status), that sounds dangerous. We have to develop a script to save such database, and compare, change accordingly, that sounds not easy. Are we over thinking? We are talking about CDOT cluster, and there about 700 polices along with thousands rules. Thank you for your inputs!
... View more
Assuming I have scheduled a snapmirror transfer every 4 hours, and for whatever reasons a transfer gets failed. After that, 1. how many transfer attempts will be performed for this transfer? 2. Or it will not be started again until next scheduled time? 3. will snapmirror transfers cost a lot of resources on the primary cluster? Thank you!
... View more
The detailed analysis and information sounds very helpful. With all being said, Are you saying that QoS that usually put by public clouds on customer's storage is really on bandwidth not IOPS? Sorry, I am slow man.
... View more
Thanks for your time and messages. I can understand that cloud providers would put QoS on my disks, because essentially they are all shared by other customers. What I missed is, why you would not call IOPS but bandwidth? Would they provide IOPS for IO access as well? If they have strict bandwidth controls, then why they are good for DB workload, not for DB backups?
... View more
FlashCache is helpful for Databasee workload or random read, as you pointed out. But, after I sent out previous message, I am thinking it might be not helpful for RMAN backup or restore because they are sequential read/write. Right? It is interesting to know that NetApp native snapshots can be used for Oracle backup. My understanding was different. I will be looking into that TR document you are referring to. Thanks
... View more
3) Good to know about your comments. That should explain why I have high latency in restore/read, or low in backup/write. I would like to add: to improve read performance, one can choose an instance type that supports NVMe which CVO uses as Flash Cache Also, in Azure, in the case of Premium storage type, larger size of disks will result in better performance (IOPs), but no so in Standard storage type. About burst credits, is there any way to track / monitor burst credits in CVO?
... View more
What is your recommendation on tools using SnapShot to backup or restore Oracle Database? Will that be SnapCenter for Oracle? Also, What is your recommendation on increasing the IOPS capabilities of the backend disks? Upgrading HDD to SDD is expensive. Again, my question was why there was so high latency when RMAN was doing restore/read, but so low latency when doing backup/write under the same aggregate and cvo? Thanks for your inputs
... View more
It is actually CVO in Azure, in a 10x1TB HDD aggregate. Yes, there is bad performance impact. I didn't mention these details because my question was just simply about read(restoration) and write(backup) on the same environment/aggregate/filer(cvo), why was there very high latency for reading, but very low latency for writing and how to explain this?
... View more
We have a process of using RMAN to restore a DB to a VM server by reading a backup image stored on a NFS share, then after the successful restoration, back the DB up to the same NFS share. During the restoration, both latency and IOPS on this NFS share were very high, 25 ms/op and 1400 IOPS. During the backup, the latency was only about 1-2 ms/op, and about the same IOPS. Throughputs are about the same in both case as well. I thought the RMAN read and write both is sequential I/O, I cannot explain why the read would cause such high latency. Can somebody please explain to me? Thank you!
... View more
In Cloud Manager, for CVO in AWS, you can find a "Cost" tab between "Volume" and "Replications" tabs, which show you the AWS Monthly Cost. However, how do I find the same monthly cost for CVO in Azure? There seems no such tab or option. Thanks for your advice!
... View more
on Linux NFS file system, you can do "du .snapshots", but this is not the real space of how much SnapShot being used. I can tell on NetApp filers. Is there anyway I can tell on Linux?
... View more