Re: snapvault lag keeps growing

HOAGLALJ1 · ‎2012-05-10

Using snapvault to backup nightly 20 Windows 2008 R2 servers.

Snapvault status shows the lag time growing each day for one server, see attached screenshot

- it had been working fine for months

- there is plenty of space on the NetApp 2040

- the lag time is now over 300 hours. Normally, the nightly backups complete in an hour or less.

- pinging the server indicates "servername is alive"

That's the depth of my knowledge - can someone suggest what else I can check?

Thank you in advance.

haenggi · ‎2012-05-10

The Snapvault Service on the Windows Host is running?

What do you see in the /etc/messages file on the storage controller at the time the snapvault update should start?

HOAGLALJ1 · ‎2012-05-10

Thank you for your quick reply.

Yes, the Snapvault Service is running on the Windows Host. I also successfully restarted it, when I noticed it was missing the backups.

To see the /etc/messages folder on a regular Linux box, I would use the cd command. It appears several “standard” Linux commands are not available on the storage controller. Therefore I am not sure how to look at the files?

I apologize for my minimal knowledge, still coming up to speed on this system.

Lou Hoagland

SENIOR SYSTEMS ANALYST

O'BRIEN & GERE

333 West Washington Street | PO Box 4873

Syracuse, NY 13221- 4873

p 315-956-6100 | f 315-463-7554

w 315-956-6746 | c 315-571-8346

lou.hoagland@obg.com<mailto:lou.hoagland@obg.com> www.obg.com<about:blankwww.obg.com>+

haenggi · ‎2012-05-10

Easy, no problem. That's what the communities are for... There are several ways you can check the messages log:

CLI: user the command "rdfile /etc/messages" it will print out the messages file on the shell

CIFS: access the file in the root vol using the following path: \\storagesystemname\etc$\messages

NFS: access the file in the root vol using the following path: storagesystemname:/vol/vol0/etc/messages

Or via System Manager where you can read the log in the diagnose section (not sure about the section, but it's accessible in system manager)

Cheers Michael

HOAGLALJ1 · ‎2012-05-10

Ah, that was my error. I thought messages was a folder that contained the log files.

After reviewing the messages file, it appears the issue started here, “could not read from socket.”

(this entry was the first entry associated with the Windows Host in question, landoversvr)

Tue Apr 24 21:25:29 EDT : SnapVault: destination transfer from landoversvr.obg.corp:D:\Data to /vol/winbackups2/landoversvr : could not read from socket.

Tue Apr 24 21:26:05 EDT : SnapVault: destination transfer from landoversvr.obg.corp:D:\Data to syrna01:/vol/winbackups2/landoversvr : base Snapshot copy for transfer no longer exists on the source.

I couldn’t find anything in the Community KB specific to the “could not read from socket.” regarding this error.

Once again, thanks in advance!

haenggi · ‎2012-05-11

It looks like it lost the status on the windows host. But to be sure, here you find some KB articles regarding this error on kb.netapp.com. This is the official NetApp Knowledgebase, manged by our Global Support accessible via Support Portal http://support.netapp.com or directly http://kb.netapp.com.

https://kb.netapp.com/support/index?page=answers&startover=y&question_box_status=changed&question_box=ossv+could+not+read+from+socket&ichbox[]=en-US

You should be able to log on with the account you user for the communities.

Hope this helps

stuartwatkins · ‎2012-05-11

I assume this is an OSSV backup?

I have seen the "could not read from socket" error before. From memory I changed the the port that the client was using, restarted the OSSV service and that seemed to fix it.

Hope that works for you too

HOAGLALJ1 · ‎2012-05-11

Thank you for your reply. Yes, it is an OSSV backup.

Using https://kb.netapp.com, I could not find anything about changing the port. Do you know of a reason why the port would require changing?

Five days before this failure started, the Windows Host server was physically moved, but all the IP settings remained the same. The OSSV backup was successful in its new location for four nights, before this error occurred.

haenggi · ‎2012-05-11

I doubt this is a port issue, that would surprise me especially as you did not change anything on the port settings. Have you checked the link I have posted in my previous post? There you should find information about the message you have in the log, not about changing the port.

stuartwatkins · ‎2012-05-11

Just a thought thats all. I have seen it before in my case another piece of software was installed that used the same port (10000 i think by default). Just something else to try

GVM666GVM · ‎2012-07-23

Most backup products use port 10000 for ndmp iirc (Syncsort BackupExpress, BackupExec, ...)

GVM666GVM · ‎2012-07-23

Hi Lou,

I just passed by this thread by accident.

I have seen different causes for lag(note that I am replicating LUNs cross-dc, but that shouldn't differ).

It seems quite obvious that there is no connectivity issue.

This rule actually says it all!

Tue Apr 24 21:26:05 EDT syrna01: replication.dst.noSrcSnap:error: SnapVault: destination transfer from landoversvr.obg.corp:D:\Data to syrna01:/vol/winbackups2/landoversvr : base Snapshot copy for transfer no longer exists on the source.

You might have to read it twice.

What the filer does is create a snap on the source filer (primary) and when that is done it copies over all data at the moment of the snapshot.

This is a common state that both the secondary and the primary have and from which they can do "diffs"(so actually just another snapshot that is copied over).

If you wouldn't have a common starting point consistency over the 2 sides would never be possible.

In your case the base snapshot on the source actually got deleted(you can check with snap list <vol>).

The base snap should be marked with "snapvault". If you can't find any snapshot marked this way you don't have your base snap anymore.

Now that we know this, the question raises why the base snap got deleted.

This can be:

- user initiated (maybe someone with not enough knowledge or an accident)

- volume space management

Netapp has different features to ensure you keep having enough space on a volume. This can be done by using reservations or autogrow.

If for some reason the volume can't autogrow, depending on your settings the filer will start deleting snapshots first and try to free up space this way.

In the end this actually might delete ALL snapshots for that volume (I already had this a few times).

You can resolve this by restarting the sync. I don't think there is another way.

You can prevent this from happening in the future by using snapshot reservations, autogrowing the volume, ...

I hope this helps.

I am aware that this is a rather old thread, however it maybe helps you understand what went wrong and might help others as well.

keithsanderson · ‎2012-11-06

Hi GVM666GVM

This is the exact thing that I am seeing. The volume ran out of space and the filer has deleted the base snapshot. You say you can resolve this by restarting the sync - how do you do that? I'm not too familiar with snapvault.

Thanks in advance

Keith.

GVM666GVM · ‎2012-11-06

With restarting I mean breaking the old snapvault relationship (snapvault stop ...) and creating it again (snapvault start -S ...)

This will create a new snapshot on the primary filer and then copy all data over to your secundary.

From the moment the data is copied over the snapvault relationship will work again as scheduled.

Make sure you have enough free space in your volume. If you just stop snapvault on the secundary all snapshots are kept.

So you either delete all those snapshots or you need enough space in the volume.

Make sure before starting the snapvault that all space is freed up on the volume using df -Vh <volume_name>

This might take a long time. However I am not sure if this has to do with sis(deduplication) or not.

keithsanderson · ‎2012-11-06

Thanks for the quick response, I'll give that a go.

nagunixharbor · ‎2012-05-12

I had a similar but not exact problem. The SV lag had grown to 26 hours. I also saw port errors and thought this issue was network or DNS-related because previously we had changed some ips. Those errors turned out to be other stuff. In the end the problem was time zones. The source filer was set for EST and the destination filer was set for GMT. After the timezone issue was resolved no more problems. Hope that helps.