ONTAP Discussions

Does snapshot represent crash-consistent image?

aborzenkov
4,817 Views

There are two considerations.

1) write ordering

Consider application that must ensure that B is always written only after A (any application that is using journalling, like file system or database). In case of NetApp both A and B actually end up being in (NV)RAM and are flushed to disk only some time later. Is it guaranteed that if block A came in before block B, next CP will either include A or A and B, but never B without A?

2) write splitting

Consider application that writes in blocks of size > 4K and relies on the fact that either full block or none is written (any database with block size > 4K). Let's say it is block of 64K which is passed to NetApp as single IO. Is it guaranteed that during CP either full 64K or nothing is included?

This is not dependent on protocol in use actually.

Thank you!

7 REPLIES 7

alapati
4,817 Views

Hi,

In short the answer is yes. Take Oracle as an example. Oracle does dependent writes, meaning Oracle will not ack write B unless it received an ack from write A. Therefore the snapshot you create will be one three possibilities: a) A and B are present; 2. A is present but not B, 3) Both A and B are not present.

If your database is spread across multiple volumes or controllers, you can use NetApp Data ONTAP CG APIs (cg-start and cg-commit) to achieve a crash consistent image. Hope this helps.

Srinath

aborzenkov
4,817 Views

Take Oracle as an example. Oracle does dependent writes, meaning Oracle will not ack write B unless it received an ack from write A. Therefore the snapshot you create will be one three possibilities: a) A and B are present; 2. A is present but not B, 3) Both A and B are not present.

Sorry for stupid question but I miss logical connection between two parts ("therefore" above).

Yes, Oracle does dependent writes. But write IO is acknowledged when data is placed in (NV)RAM. While my concern is about data on disk.

And you did not mention split write problem Consider application which three (dependent) writes A (4K), B (8K), C (4K). How can I actually be sure that they do not end on disk as

CP1: A (4K), B' (first 4K)

CP2: B" (second 4K), C (4K)

Such case makes content of CP1 effectively corrupted from application PoV - but application has no way to know it because it already got ACK from NetApp for all three operations.

And to make extreme example - let's consider the case of

fd = open("/file/on/netapp", O_DIRECT|O_SYNC,...);

write(fd, buffer_10_MB_size, 1);

Operating system ensures that this write will not return until full 10MB of data is transferred to underlying device (NetApp in this case). So application has all rights to assume that when "write" returned, data is safe on stable storage.

But 10MB is fair amount of data which could easily be split in multiple IOs between system and NetApp. And when "write" returns some of these IOs are still sitting in NetApp memory and not yet flushed to disk. Again - on disk state is inconsistent with what application expects.

May be this is due to my misunderstanding on how snapshots work. If NetApp does flush current NVRAM state to disk when snapshot is initiated, this does seem to solve partial write issue. And actually both issues ... but I have never seen this mentioned anywhere.

radek_kubka
4,817 Views

Bear in mind NVRAM is not a write cache - it is a journal.

This is pretty old, but still good reading about how NVRAM works:

http://www.netapp.com/us/library/technical-reports/wp-3001.html

Page 10 says:

Using NVRAM to store a log of uncommitted requests is very different from using NVRAM as a disk cache, as some UNIX products do [Lyon89]. When NVRAM is used at the disk layer, it may contain data that is critical to file system consistency. If the NVRAM fails, the file system may become inconsistent in ways that fsck cannot correct.
WAFL uses NVRAM as a file system journal, not as a cache of disk blocks that need be changed on the drives. As such, WAFL use of NVRAM space is extremely efficient. For example, a request for a file system to create a file can be described in just a few hundred bytes of information, where as the actual operation of creating a file on disks might involve changing a dozen blocks of information or more. Because WAFL uses NVRAM as a journal of operations that need to be performed on the drives, rather than the result of the operations themselves, thousands of operations can be journaled in a typical storage appliance NVRAM log.

Hope it helps.

Regards,
Radek

aborzenkov
4,817 Views

radek.kubka wrote:

Bear in mind NVRAM is not a write cache - it is a journal.

I know. But this is actually unrelated to my question ...

radek_kubka
4,817 Views

This particular bit I believe is very relevant to your question:

Because WAFL uses NVRAM as a journal of operations that need to be performed on the drives, rather than the result of the operations themselves, thousands of operations can be journaled in a typical storage appliance NVRAM log.


aborzenkov
4,817 Views

Let me ask very simple question. When NetApp creates snapshot - does it take existing CP on disk or does

it create new CP by flushing current NVRAM content to disk?

radek_kubka
4,817 Views

When snapshot is taken, new CP is always created.

You may play with it by scheduling a snapshot & then issuing following command:

sysstat –x 1

Look for "S" in "CP ty" column.

Regards,
Radek

Public