Hi experts,
As per https://docs.netapp.com/us-en/ontap-sanhost/nvme_rhel_95.html, ONTAP supports NVMe multipath with round-robin policy. Is it safe to be used? Will data corruption happen when the host detects one path error and resubmits IOs on another path?
Here is my finding when reading the Linux code. When the NVMe host driver detects a command timeout (either admin or I/O command), it triggers error recovery and attempts to resubmit the I/Os on another path.
Taking the nvme_tcp driver as an example, the nvme_tcp_timeout function is called when any command times out:
nvme_tcp_timeout
-> nvme_tcp_error_recovery
-> nvme_tcp_error_recovery_work
-> nvme_tcp_teardown_io_queues
-> nvme_cancel_tagset
nvme_cancel_tagset completes the inflight requests on the failed path and then calls nvme_failover_req to resubmit them on a different path. There is no wait time before the I/O is resubmitted. This means that the controller on the old path may not have fully cleaned up the pending requests, potentially leading to data corruption on the NVMe namespace.
For example, consider the following scenario:
1. The host sends IO1 to path1, but then encounters a timeout for either IO1 or a previous I/O request (e.g., keep-alive or I/O timeout). This triggers error recovery, and IO1 is retried on path2, which succeeds.
2. After that, the host sends IO2 with the same LBA to path2, which also succeeds.
3. Meanwhile, IO1 on path1 has not been aborted and continues to execute.
Ultimately, IO2 gets overwritten by the residual IO1, leading to potential data corruption.
I noticed that the NVMe Base Specification 2.1, section "9.6 Communication Loss Handling," provides a good description of this scenario. It introduces the concept of Command Quiesce Time (CQT), which allows for a cleanup period for outstanding commands on the controller. Implementing CQT could potentially resolve this issue.