NetApp iSCSI LUN suddenly full and goes offline (FAS2520 with Data ONTAP v9.0 and ESXi 6.7 U2)

Didi7 · ‎2020-08-12

Hi guys,

we have a strange behavior here in two different VMware Cluster environments. Here are the details ...

NetApp FAS2520 with Data ONTAP Version 9.0 (iSCSI is used for the LUNs) and an additional DS2246 shelf

2 aggregates with 5.48TB (24x 900GB SAS HDDs only)

3 thick volumes per controller for a total of 6 thick volumes (without svm_iscsi_root)

HPE DL360 G9 ESXi-hosts with vSphere 6.7.0 Build 15160138 (2 of them in a VMware cluster)

A total of 6 datastores (VMFS 5) based on the 6 NetApp iSCSI LUNs mentioned above + 2 local dastores (VMFS 5)

One of the volume has a total grow size up to 1.01TB (Autogrow Mode: grow) / Storage Efficiency activated / Snapshot Reserves: 0% / Thick provisioned.

Inside of this volume, is a 1TB LUN with Space reservation disabled.

The VMware datastore which is based on this NetApp LUN (and volume) has a size of 1TB (according to the size of the LUN) and there are 3 VMware VMDK data disks from one VM in this datastore, the OS disk of this VM is located in another datastore (different lun and volume) on the same controller.

Inside of this VMware datastore is an .ssd.sf directory and one directory with the name of the sole VM. Inside of the VM directory there are 9 files (3 for each VM data disk), a vmdk descriptor file, a vmdk flat file and a vmdk CBT file.

The 3 VMDK data disks are all THIN but could grow to a total size of up to 750GB, so that 250GB would still be available. Currently approximately 577GB of 1TB are used, so there is enough free space available and the datastore or the NetApp LUN shouldn't go offline.

This particular LUN goes offline every 2 to 3 weeks and makes this productive VM unavailable. According to the System Manager the LUN is full and the volume behind the LUN is nearly full, because it could grow 0.01TB more than the full LUN size.

Switching the LUN online, makes the datastore within the VMware cluster available again. The datastore reports it has still around 450GB available. The LUN reports it is full.

Because this happened a few times, I created a new volume and a new lun inside of this volume and presented this iSCSI LUN as a new datastore and used VMware Storage vMotion to move the VMDK data disks to this newly created datastore, but without any luck. The LUN went offline again, telling me, it's full and after bringing the LUN online again, the datastore has enough free space again.

The NetApp volume behind that NetApp LUN or VMware datastore does not create any NetApp snapshots (checked within System Manager => SVMs => Volumes => Snapshot Copies) and because Snapshot Reserves = 0%.

We have this behavior in 2 different VMware-Clusters, each in its own specific location and it only happens on one of this six datastores (lun or volume) presented to the ESXi-hosts.

I never had those kind of behavior before on this particular FAS2520 (nor on a FAS2040 or FAS2240-2 or FAS2556 or FAS8020).

This behavior first occured, when VMware was upgraded from ESXi 6.0 U2 to ESXi 6.7 U2 a couple of months ago.

Has anyone had a similar behavior or might the Data ONTAP v9.0 be not fully compatible with ESXi 6.7 U2 version?

Any useful comment is much appreciated.

Best regards,

Didi7

adimitropoulos · ‎2020-08-12

cluster:> set -privilege diag

answer (y) and press enter

cluster::*> event log show -event *offline*

provide the output

Didi7 · ‎2020-08-12

There you go ...

login as: admin
Keyboard-interactive authentication prompts from server:
| Password:
End of keyboard-interactive prompts from server
deis-alsb0st220::> set -privilege diag

Warning: These diagnostic commands are for use by NetApp personnel only.
Do you want to continue? {y|n}: y

deis-alsb0st220::*> event log show -event *offline*
Time Node Severity Event
------------------- ---------------- ------------- ---------------------------
8/9/2020 00:25:23 deis-alsb0st230 NOTICE LUN.offline: LUN at path /vol/alsb0_na230_vol_data135/alsb0_na230_lun_data135 in volume alsb0_na230_vol_data135 (DSID 1053) has been brought offline.
8/9/2020 00:25:23 deis-alsb0st230 ERROR scsiblade.lun.offline.system: LUN 80Adg+JMx0rB in volume with MSID 2159262173 on Vserver svm_iscsi has been brought offline due to lack of space in the volume.
2 entries were displayed.

deis-alsb0st220::*>

adimitropoulos · ‎2020-08-12

Its a space availablity issue as you can see from the output.
send the following output

cluster::*> volume show -volume alsb0_na230_vol_data135 -fields fractional-reserve,snapshot-space-used,snapshot-count,aggregate;df -gigabyte -aggregates;snapshot show alsb0_na230_vol_data135;df -g alsb0_na230_vol_data135

Didi7 · ‎2020-08-12

Please be informed that the volume has been re-configured to auto-grow to a maximum of 1.31TB and the LUN has been re-configured to be 1.3TB in size. Now more than 750GB are free in the datastore and around 550GB are used.

Here is the output ...

login as: admin
Keyboard-interactive authentication prompts from server:
| Password:
End of keyboard-interactive prompts from server
deis-alsb0st220::> set -privilege diag

Warning: These diagnostic commands are for use by NetApp personnel only.
Do you want to continue? {y|n}: y

deis-alsb0st220::*> volume show -volume alsb0_na230_vol_data135 -fields fractional-reserve,snapshot-space-used,snapshot-count,aggregate;df -gigabyte -aggregates;snapshot show alsb0_na230_vol_data135;df -g alsb0_na230_vol_data135
vserver volume aggregate fractional-reserve snapshot-space-used snapshot-count
--------- ----------------------- ----------- ------------------ ------------------- --------------
svm_iscsi alsb0_na230_vol_data135 st230_aggr1 100% 0% 0

Aggregate total used avail capacity
st225_aggr0 368GB 350GB 17GB 95%
st225_aggr0/.snapshot 19GB 0GB 19GB 0%
st225_aggr1 5614GB 1687GB 3927GB 30%
st225_aggr1/.snapshot 0GB 0GB 0GB 0%
st230_aggr0 368GB 350GB 17GB 95%
st230_aggr0/.snapshot 19GB 0GB 19GB 0%
st230_aggr1 5614GB 2631GB 2983GB 47%
st230_aggr1/.snapshot 0GB 0GB 0GB 0%
8 entries were displayed.

There are no entries matching your query.

Filesystem total used avail capacity Mounted on Vserver
/vol/alsb0_na230_vol_data135/
1116GB 1017GB 99GB 91% --- svm_iscsi
/vol/alsb0_na230_vol_data135/.snapshot
0GB 0GB 0GB 0% --- svm_iscsi
2 entries were displayed.

deis-alsb0st220::*>

adimitropoulos · ‎2020-08-12

Its because the volume is thick and the fractional reserved space is set to 100% by default
Fractional reserved is generally used for volumes that hold LUNs with a small percentage of data overwrite.

If you want to ensure that the LUN will be up and running as long you don't have snapshots configured or snapshot reservations, snapmirrors and clones or other "snapshot" features enabled, set the fractional reserved space from 100 to 0, and set the volume size to be at least 1 byte more that the size of the lun if the lun is thin (non-space reserved).

volume modify -vserver vservername -volume volname -fractional-reserve 0

Hope it helps!

Didi7 · ‎2020-08-14

Hello adimitropoulos ,

first of all thanks for replying so fast. Our thick volumes are always bigger than the thin LUNs inside. We tend to create a seperate volume for each single LUN.

I noticed that Fractional Reserve = 100% is not the default, when creating volumes on a FAS2040 with System Manager, but it is the default on a FAS2520. Unfortunately, I did not pay much attention to that parameter, when those volumes where created.

After reading different articles about the fractional reserve, I now know how sensible this can be for a LUN.

As we do not use Snapshot Copies on our volumes at all, I wonder if it's possible to disable the FAS2520 default fractional reserve = 100% for volumes via System Manager on existing volumes or do I need to change the fractional reserve to 0 from the command line first?

Let's assume we have a 1.01TB thick volume and a 1.00TB thin LUN inside of this volume, fractional reserve = 100% and the LUN is filled up with 400GB. Can we disable the fractional reserve = 100% via System Manager and we definitely have another 600GB available space in the LUN afterwards, no matter how much of that fractional reserve was used?

As far as I understand the fractional reserve = 100% ...

If you have a 1.01TB thick volume and a 1.00TB thin LUN and fractional reserve=100%, it means that you can only use half of the 1.00TB in the LUN and if you configure the fractional reserve to be 50, it means that you can only use 75% of the 1.00TB LUN space (the remaining 25% is reserved for Snapshot Copies), right?

Regards,

Didi7

adimitropoulos · ‎2020-08-14

7-Mode
What is the proper configuration for a volume containing a LUN?

https://mysupport.netapp.com/site/article?lang=en&type=question&page=%2FAdvice_and_Troubleshooting%2FData_Storage_Software%2FONTAP_OS%2FWhat_is_the_pr...

For Clustered Data Ontap
https://docs.netapp.com/ontap-9/topic/com.netapp.doc.dot-cm-sanag/GUID-2F5C9474-FFE9-4E59-84DB-1B9D6D134688.html

Editing Flexible volumes (System Manager)
https://docs.netapp.com/ontap-9/topic/com.netapp.doc.onc-sm-help-900/GUID-31A3B5C7-5D7D-4F52-8978-6D354F8C8399.html

If even reading these KBs there are still questions, try this
https://netapp.sabacloud.com/Saba/Web_spf/NA1PRD0047/common/ledetail/cours000000000022668

Didi7 · ‎2020-08-17

Hello adimitropoulos,

thanks for the documentation links. If I find time, I may read some of the documentation. I noticed on several threads that the Fractional Reserve topic always creates a lot of misinterpretation for those who start dealing with it.

The main question from my former post was unreplied. I wanted to know, if disabling the Fractional Reserve 100% by editing a volume via System Manager will help as well to completely disable the Fractional Reserve and releasing the occupied space by the Fractional Reserve in the volume.

On Friday, when I did some tests on a newly created test volume and lun, which was filled up by overwriting files again and again, the VMware datastore still had a lot of free space left, the LUN was filled up and went offline. I switched the LUN online and disabled the Fractional Reserve 100% via System Manager, but not much changed at all. This morning I noticed that lot of occupied space in the volume was released but according to the Deduplication Savings of 40% and Deduplication taking place at sunday-night, the released space is just Deduplication Savings. The LUN still has 0 Bytes free. Now I start overwriting the same files again and see what happens.

Right now, I am still asking myself, what happens with the occupied Fractional Reserve space, when it was in use and gets disabled via the System Manager.

Regards,

Didi7

Ontapforrum · ‎2020-08-12

It's a very a long question, don't have patience to read it fully but just glancing over.

I checked the compatibility, even though ONTAP 9.0 has reached EOVS (End-Of-Version-Support) it is compatible with the ESX build you have, so I rule this out.

Coming to LUN going offline:
LUN cannot go offline unless the containing volumes is full. Even though your host file-system is showing enough space, it has no bearing on NetApp WAFL side, b'cos blocks that are already written is taken-away [Unless you are doing hole punching (freeing of blocks)]. Further, anything you delete from host-side is going to occupy space in the volume space (as part of snapshots), and don't forget 'new writes' will go to new location (i.e on the volume). Therefore, if the volume is 'full', it will take the LUN offline to prevent any further writes from the host to the storage.

Suggestion: Either make the 'auto-grow' to a large size so that you don't have to worry about the LUN going offline, and/or select 'auto-delete' snapshots as par of the 'auto-row' settings or perform hole-punching (using vaai). Also, I think having thick-volume has no benefits as far as NetApp storage is concerned (It has no bearing on performance), why not make it THIN on the go, and just monitor the 'Aggregate growth'.

Didi7 · ‎2020-08-12

Good to know that Data ONTAP 9.0 is still compatible with ESXi 6.7 U2.

To be honest, it's still one question but I tried to describe our environment as precise as possible.

If you would have read my post completely, you would have known, that our thick volumes can always auto-grow to a size, which is bigger than the underlying LUN. All volumes on one aggregate cannot be as big as the complete size of the aggregate itself. And we even do not use Snapshot Copies on our NetApp volumes. Therefore the LUN should go offline, when the Datastore is full and not how it is in our current situation and I never saw such a behavior in those 10-11 years, I work with NetApp-storage.

Why do we use thick volumes? Because, we started with thick volumes on our first FAS2040 filers, which used thin LUNs inside of every volume and we never changed that configuration on more recent filers and therefore i can say, volumes do not fill up our aggregates. It's a matter of controlling.

Of course, we have different environments, specifically in our datacenter, where things are different but in all our small locations, we never used thin volumes up to now.

Anyway, thanks for giving your input.