<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Possible data corruption when NVMe host is using multipath? in ONTAP Discussions</title>
    <link>https://community.netapp.com/t5/ONTAP-Discussions/Possible-data-corruption-when-NVMe-host-is-using-multipath/m-p/458240#M44613</link>
    <description>&lt;P&gt;Hi experts,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As per&amp;nbsp;&lt;A href="https://docs.netapp.com/us-en/ontap-sanhost/nvme_rhel_95.html#validate-software-versions," target="_blank"&gt;https://docs.netapp.com/us-en/ontap-sanhost/nvme_rhel_95.html&lt;/A&gt;, ONTAP supports NVMe multipath with round-robin policy. Is it safe to be used? Will&lt;SPAN&gt;&amp;nbsp;data corruption happen when the host detects one path error and resubmits IOs on another path?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here is my finding when reading the Linux code. When the NVMe host driver detects a command timeout (either admin or I/O command), it triggers error recovery and attempts to resubmit the I/Os on another path.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Taking the nvme_tcp driver as an example, the nvme_tcp_timeout function is called when any command times out:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;nvme_tcp_timeout&lt;/P&gt;&lt;P&gt;&amp;nbsp; -&amp;gt; nvme_tcp_error_recovery&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; -&amp;gt; nvme_tcp_error_recovery_work&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; -&amp;gt; nvme_tcp_teardown_io_queues&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; -&amp;gt; nvme_cancel_tagset&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;nvme_cancel_tagset completes the inflight requests on the failed path and then calls nvme_failover_req to resubmit them on a different path. There is no wait time before the I/O is resubmitted. This means that the controller on the old path may not have fully cleaned up the pending requests, potentially leading to data corruption on the NVMe namespace.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For example, consider the following scenario:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1. The host sends IO1 to path1, but then encounters a timeout for either IO1 or a previous I/O request (e.g., keep-alive or I/O timeout). This triggers error recovery, and IO1 is retried on path2, which succeeds.&lt;/P&gt;&lt;P&gt;2. After that, the host sends IO2 with the same LBA to path2, which also succeeds.&lt;/P&gt;&lt;P&gt;3. Meanwhile, IO1 on path1 has not been aborted and continues to execute.&lt;/P&gt;&lt;P&gt;Ultimately, IO2 gets overwritten by the residual IO1, leading to potential data corruption.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I noticed that the NVMe Base Specification 2.1, section "9.6 Communication Loss Handling," provides a good description of this scenario. It introduces the concept of Command Quiesce Time (CQT), which allows for a cleanup period for outstanding commands on the controller. Implementing CQT could potentially resolve this issue.&lt;/P&gt;</description>
    <pubDate>Tue, 04 Feb 2025 13:38:53 GMT</pubDate>
    <dc:creator>JavenKe</dc:creator>
    <dc:date>2025-02-04T13:38:53Z</dc:date>
    <item>
      <title>Possible data corruption when NVMe host is using multipath?</title>
      <link>https://community.netapp.com/t5/ONTAP-Discussions/Possible-data-corruption-when-NVMe-host-is-using-multipath/m-p/458240#M44613</link>
      <description>&lt;P&gt;Hi experts,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As per&amp;nbsp;&lt;A href="https://docs.netapp.com/us-en/ontap-sanhost/nvme_rhel_95.html#validate-software-versions," target="_blank"&gt;https://docs.netapp.com/us-en/ontap-sanhost/nvme_rhel_95.html&lt;/A&gt;, ONTAP supports NVMe multipath with round-robin policy. Is it safe to be used? Will&lt;SPAN&gt;&amp;nbsp;data corruption happen when the host detects one path error and resubmits IOs on another path?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here is my finding when reading the Linux code. When the NVMe host driver detects a command timeout (either admin or I/O command), it triggers error recovery and attempts to resubmit the I/Os on another path.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Taking the nvme_tcp driver as an example, the nvme_tcp_timeout function is called when any command times out:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;nvme_tcp_timeout&lt;/P&gt;&lt;P&gt;&amp;nbsp; -&amp;gt; nvme_tcp_error_recovery&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; -&amp;gt; nvme_tcp_error_recovery_work&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; -&amp;gt; nvme_tcp_teardown_io_queues&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; -&amp;gt; nvme_cancel_tagset&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;nvme_cancel_tagset completes the inflight requests on the failed path and then calls nvme_failover_req to resubmit them on a different path. There is no wait time before the I/O is resubmitted. This means that the controller on the old path may not have fully cleaned up the pending requests, potentially leading to data corruption on the NVMe namespace.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For example, consider the following scenario:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1. The host sends IO1 to path1, but then encounters a timeout for either IO1 or a previous I/O request (e.g., keep-alive or I/O timeout). This triggers error recovery, and IO1 is retried on path2, which succeeds.&lt;/P&gt;&lt;P&gt;2. After that, the host sends IO2 with the same LBA to path2, which also succeeds.&lt;/P&gt;&lt;P&gt;3. Meanwhile, IO1 on path1 has not been aborted and continues to execute.&lt;/P&gt;&lt;P&gt;Ultimately, IO2 gets overwritten by the residual IO1, leading to potential data corruption.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I noticed that the NVMe Base Specification 2.1, section "9.6 Communication Loss Handling," provides a good description of this scenario. It introduces the concept of Command Quiesce Time (CQT), which allows for a cleanup period for outstanding commands on the controller. Implementing CQT could potentially resolve this issue.&lt;/P&gt;</description>
      <pubDate>Tue, 04 Feb 2025 13:38:53 GMT</pubDate>
      <guid>https://community.netapp.com/t5/ONTAP-Discussions/Possible-data-corruption-when-NVMe-host-is-using-multipath/m-p/458240#M44613</guid>
      <dc:creator>JavenKe</dc:creator>
      <dc:date>2025-02-04T13:38:53Z</dc:date>
    </item>
  </channel>
</rss>

