Need some help!!

ARMYofONE · ‎2019-01-02

I am not a Storage Administrator I work Red Hat servers and systems. However, since I am "IT" it is now in my lane to work the issues. I was experiencing issues with the snapshots building up and being a 6-800% on all my /vol ,snapshots. I have since set the ones to autodelete on that were set to off. I deleted ALL snapshots.

on my SC1 my aggr0 is at 95% and aggr1 100%

and on my SC2 aggr0 is at 95%, aggr1 98% and aggr2 97%.

All the aggregates /.snapshot is 0%.

My version is NetApp release 8.2.2 7-Mode NetApp DS 4243 DS 4246. All command line since I cannot seem to get the GUI to work via Firefox.

Any help is greatly appreciated.

I do show SIS errors where SIS is requesting 4kb of space and only 1kb available. Like I said I am new to NetApp. And the Case number for this is 2007726187, but since I am DOD it is assigned to a secured team for support. But I do not think I need onsite engineer support from NetApp. Never needed it for HP, Windows, VMware, Red Hat, Pure Storage etc.

Thanks

radek_kubka · ‎2019-01-02

Hi,

Your aggregates are full / almost full and this is never good on a NetApp system.

Aggregate level snapshots are at 0%, but it doesn't mean your volume level snapshots are not using any space. You can look at df output to see space utilisation broken down per each volume and its snapshot (you can post it here if it isn't very long).

SIS is a deduplication process, which most likely can't complete due to lack of space within the aggregate.

JGPSHNTAP · ‎2019-01-02

Ok, let me see if I can try to assist you.

100% aggr is very bad..

Are all your vols snapshot reserve to to zero or do you have a reserve. Are all your volumes thin provisioned?

Why do you have so many snaps?

Sis is your dedup, not snaps.

snap reserve

snap sched

vol status (look for guarantee = none)

You're going to need to SSH into it to see some data

ARMYofONE · ‎2019-01-02

df shows

all volume .snapshots are at 0%. Theses use to be over 4,5 and 600 percent. But I did delete them all. could they still be stored on the aggregates?

sc1 /vol/vol0/ 0% capacity

sc1 /vol/vol0/.snapshot is 0%

sc1 /vol/data/ 73%

sc1 /vol/data/.snapshot 0%

sc2 /vol/vol0 0%

sc2 /vol/vol0/.snapshot 0%

sc2 /vol/events1/ 80%

sc2 /vol/events1/.snapshot 0%

sc2 /vol/events2/ 65%

sc2 /vol/events2/.snapshot 1%

sc2 /vol/tmp_storage/ 19%

sc2 /vol/temp_strorage/.snapshot 0%

Hope this helps

ARMYofONE · ‎2019-01-02

Are all your vols snapshot reserve to to zero or do you have a reserve. Are all your volumes thin provisioned? Not sure on that at all.

Why do you have so many snaps? Unknown. I could not tell you at all

Sis is your dedup, not snaps.

snap reserve set to 5% on all

snap sched set to 0 2 6@8, 12, 16, 20

vol status (look for guarantee = none) shows

sc1 vol0 raid_dp, flex, 64 bit root, create_ucode=on

sc1 data raid_dp, flex, sis, 64 bit create_ucode=on, convert_ucode=on

sc2 vol0 raid_dp, flex, 64 bit root, create_ucode=on

sc2 events1 raid_dp, flex, degraded, sis 64 bit create_ucode=on, convert_ucode=on

sc2 events2 raid_dp, flex, sis 64 bit create_ucode=on, convert_ucode=on

sc2 temp_storage raid_dp, flex, sis 64 bit create_ucode=on, convert_ucode=on

All of themn say online

JGPSHNTAP · ‎2019-01-02

Deleting all your snaps is not a good idea, especially if you don't know exactly what you are doing on Netapp.

I would only delete the snapshots to save production, which sounds like you might have done inadvertly.

After deleting your snaps are your aggrs still 100%.

Also do

priv set adv

sis stat

ARMYofONE · ‎2019-01-02

Yeah I was told to run snap delete -a I think and delete all snapshots.

Aggregates are still 95-100% I beleive the aggregates were not that high prior to deleting volume snapshots.

priv set adv did not work but priv set diag did

sis stat shows the following

sc1 /vol/data 40TB allocated Saving 143GB %Saved is 0%

sc2 /vol/events1 45TB allocated Saving 361GB %Saved is 0%

sc2 /vol/events2 27TB allocated Saving 2848GB %Saved is 9%

sc2 /vol/temp_storage 911GB allocated Saving 1566GB %Saved is 063%

JGPSHNTAP · ‎2019-01-03

Make sure your snapshots are off at the aggr level

snap sched -A

ARMYofONE · ‎2019-01-03

They are currently on. But snap list -A shows 2% and 3% max. Is it possible when I deleted the volume snapshots that were over 500, 600 and 700% that they are somehow still being stored at the aggregate level?

ARMYofONE · ‎2019-01-03

I am unsure how to turn off the snapshots on the Aggregates.

Also, SIS keeps getting and error but the NetApp tech support said I should never use disk compression or inline deduplication since I have HDD spin disks and not SSD's.

I do believe inline compression is on though, not sure about inline deduplication.

JamesIlderton · ‎2019-01-03

BTW, ONTAP 8.2.2 didn't support any inline efficiencies, so you likely only have post-process deduplication (SIS).

JGPSHNTAP · ‎2019-01-03

Forget about SIS. You don't have enough headroom in aggr or vol for metadata. That's not your issue.

So what is your current status? How much room is in your aggr?

Turn off off snaps below

snap sched -A 0

They are usually useless.

But what is your current space status, that's more important.

Also, is your system SATA/SAS or flash, b/c inline dedupe and inline compression aren't really good on the first two.

Also, you are on 7-mode so things are different anyways and your SIS commands show you aren't saving much anyways.

Are these LUNS or Files?

ARMYofONE · ‎2019-01-03

So what is your current status? How much room is in your aggr?

SC1 my aggr0 is at 95% and aggr1 100%

and on my SC2 aggr0 is at 95%, aggr1 98% and aggr2 97%.

All the aggregates /.snapshot is 0%.

Turn off off snaps below

snap sched -A 0

Done

They are usually useless.

But what is your current space status, that's more important.

df -A shows the following:

sc1

Aggregate         kbytes               used             avail        capacity

aggr0                    3509265040         3345887780    163377260           95%

aggr0/.snapshot            0                          0                       0                    0%

aggr1                   63166770432         63166770432          0                 100%

aggr1/.snapshot            0                          0                       0                    0%



sc2

Aggregate          kbytes               used            avail          capacity

aggr0                    3509265040         33458240384    163366964           95%

aggr0/.snapshot            0                          0                       0                    0%

aggr1                   66676035480         65041293188          1634742292        98%

aggr1/.snapshot            0                          0                       0                    0%

aggr2                   56148240384         54437240556          1710999828        97%

aggr2/.snapshot            0                          0                       0                    0%

Also, is your system SATA/SAS or flash, b/c inline dedupe and inline compression aren't really good on the first two.

Definately not flash.

Are these LUNS or Files?

Unsure how to check for this. I am all Red Hat systems. I want to say Files but do not know how to check for the answer.

JamesIlderton · ‎2019-01-03

For the GUI issue, do you have OnCommand System Manager installed on your admin station? That version of ONTAP did not have an embedded GUI, but if you install it locally you can add the filers and manage them via your browser.

Along the same lines as the other replies, please look at the output of these commands:

'snap sched -A' (this will show the scheduled snapshots for all aggregates, along with the retention)

'snap reserve -A' (this will show any space reservation for snapshots at the aggregate level)

'snap list -A' (this will show any existing snapshots at the aggregate level)

'snap sched -V' (this will show the scheduled snapshots for all volumes, along with the retention)

'snap reserve -V' (this will show any space reservation for snapshots at the volume level)

snap list -V (this will show any existing snapshots at the volume level)

What you should have for maximum available space would be 0% snap reserve across the board, no snapshots scheduled/existing on the aggregates and hopefully minimal snapshots on the volumes. Also, when you list the volume snapshots youshould be able to identify snapshots created by the schedules by their names (hourly, daily and weekly with an ordinal number after to indicate the generation). Additional snapshots you see may be manually created or created by other tools (such as SnapManager).

Also, if you have block LUNs on these systems, they may also be thick provisioned. You can run 'lun show -v' to list all of your LUNs and look for the attribute Space Reservation and make sure they show Disabled so they are thin.

If everything is already configured properly and you did remove all the excess snapshots (of all of them period), then you may be in a situation where you have to migrate some data off quickly. You can likely expand your aggregates with some additional disk shelves, but that's not typically a quick decision unless you happen to have some laying around unused.

Once you get past this intial emergency space issue, do make sure you setup some monitoring and alerting to ensure you don;t get in this situation again. NetApp has some good tools, and you should be able to use OnCommand Core (previously named DFM) to monitor these 7-mode systems.

Hopefully these help, and good luck!

ARMYofONE · ‎2019-01-03

For the GUI issue, do you have OnCommand System Manager installed on your admin station? That version of ONTAP did not have an embedded GUI, but if you install it locally you can add the filers and manage them via your browser.

Along the same lines as the other replies, please look at the output of these commands:

'snap sched -A' (this will show the scheduled snapshots for all aggregates, along with the retention)

set to 0

'snap reserve -A' (this will show any space reservation for snapshots at the aggregate level)

all set to 0

'snap list -A' (this will show any existing snapshots at the aggregate level)

All are deleted

'snap sched -V' (this will show the scheduled snapshots for all volumes, along with the retention)

snap sched set to 0 2 6@8, 12, 16, 20

'snap reserve -V' (this will show any space reservation for snapshots at the volume level)

snap reserve set to 5% on all Volumes

snap list -V (this will show any existing snapshots at the volume level)

shows all from the schedule above for each volume

What you should have for maximum available space would be 0% snap reserve across the board, no snapshots scheduled/existing on the aggregates and hopefully minimal snapshots on the volumes. Also, when you list the volume snapshots youshould be able to identify snapshots created by the schedules by their names (hourly, daily and weekly with an ordinal number after to indicate the generation). Additional snapshots you see may be manually created or created by other tools (such as SnapManager).

So the volumes should not be set to 5% snap reserve?

Also, if you have block LUNs on these systems, they may also be thick provisioned. You can run 'lun show -v' to list all of your LUNs and look for the attribute Space Reservation and make sure they show Disabled so they are thin.

lun show -v returns a blank command line

If everything is already configured properly and you did remove all the excess snapshots (of all of them period), then you may be in a situation where you have to migrate some data off quickly. You can likely expand your aggregates with some additional disk shelves, but that's not typically a quick decision unless you happen to have some laying around unused.

Well since my cpu's are spiking high when i run a sysstat -M, I do think that adding an additional shelf may crash this large NetApp. We have moved over all mission systems and data to a VERY old NetApp shelf that is much smaller and things are running excellent there not issues and no aggregates filling up.

The data currently on there is nothing near what the capacity is. We also Zip and Compress data monthly off to another NetApp disk shelf.

Once you get past this intial emergency space issue, do make sure you setup some monitoring and alerting to ensure you don;t get in this situation again. NetApp has some good tools, and you should be able to use OnCommand Core (previously named DFM) to monitor these 7-mode systems.

Does this install on a Red Hat system?

JGPSHNTAP · ‎2019-01-03

aggr status -s (show spares)

disk show -n (show unowned)

df -Ag shows aggr in gb

ARMYofONE · ‎2019-01-03

@JGPSHNTAP wrote:

aggr status -s (show spares)

sc1 has a FSAS and a SAS spare (I justrecently replaced those two drive because they were bad)

sc2 has no spares

disk show -n (show unowned)

no unassigned discs

df -Ag shows aggr in gb

sc1

Aggregate        GB total              used             avail        capacity

aggr0                      3346GB                   3190GB              155GB           95%

aggr0/.snapshot            0                          0                       0                    0%

aggr1                      60240GB0              60240GB              0                100%

aggr1/.snapshot            0                          0                       0                    0%



sc2

Aggregate       GB total                used            avail          capacity

aggr0                     3346GB                    3190GB             155GB            95%

aggr0/.snapshot            0                          0                       0                    0%

aggr1                     63587GB                62028GB            1559GB         98%

aggr1/.snapshot            0                          0                       0                    0%

aggr2                   53547GB                 51915GB              1631GB         97%

aggr2/.snapshot            0                          0                       0                    0%

JGPSHNTAP · ‎2019-01-04

Ok bad news.. Since you don't have spares.. and your aggrs are 100%, you are up **** creek. You have no outs here, epsecially if all your vols are already thin provisioned.

If your VOLS aren't thin provisioned you might be able to squeeze enough time to get out of this mess, other than that, that system will be full in no time and you will be out of luck... Sorry

ARMYofONE · ‎2019-01-04

Thanks. I am unsure how all those TB's are gone. 3 months ago when it started becoming laggy prior to deleting the massive amount of snapshots that were maxing out the storage system we were only at 50% max storage usage. And nothing has changed since then. And the exact same data is replicated to a much older and smaller single NetApp shelf. That one is fine, it is just this newer one. And since we are looking to buy another one for our other network the Engineers are looking into other Vendor companies anbd although I have learned a lot tro;uble shooting on CLI (no expert at all) I am not wanting to try and learn a whole other system.

It is just odd that after deleting the massive amount of snapshots that were not set to autodelete on the aggregates filled up within a week of deleting Volume .snapshots. I will call NetApp tech support again. I have like 20 hours of phone time with them already and being a DOD mission system Our NetApp is getting a black eye lol.

ARMYofONE · ‎2019-01-04

Found BUG DETAIL BUG ID 211569 where container block reclamation
snap delete
snapshot entries will be removed from .snapshot and snap list.

HOWEVER - space used by snapshots will not be reclaimed. df will reflect this. (Since my snapshots on my volumes were above 5, 6 and 700% it spilled over into the agggr's I believe.

Aggregate space usage may remain abnormally high after flexvol snap deletion
or volume snaprestore.

Now trying to figure out how to reclaim the space that snapshots hogged up on my aggrs

Trying to work this and see if it helps.

bsnyder27 · ‎2019-01-04

If your VOLS aren't thin provisioned you might be able to squeeze enough time to get out of this mess, other than that, that system will be full in no time and you will be out of luck... Sorry

Definitely something to check on again, I agree. I don't see any mention of checking the space guarantee option on the volumes in the data aggrs. This could potentially give you a lot of aggr space back if your volumes are not thin provisioned.

As JGPSHNTAP noted early on...you'll want to run SSH commands to obtain this information. Probably something with 'vol status' and grep for the word 'guarantee'. If you see a lot of guarantee=volume then you have the ability to thin provision them.

And to explain the option if you're unfamiliar - Volume guarantee grants all of the vol size space to the volume up front. So a 1TB volume will use 1TB from the aggr it's contained in. Guarantee None only takes the used space of the volume from the aggr so you may only have 50GB of the same 1TB volume being used and that is all that it will take from the aggr.

Just be sure to keep a close eye on your aggregates when doing this for obvious reasons.

Need some help!!

Introducing GenAI Search on NSS