Solved: REST API returns links to nonexistent jobs

mbuchanan · ‎2022-04-19

We have an application built on the ONTAP REST API that (among other things) snapshots a parent volume and creates a FlexClone volume from the snapshot. As you would expect, the API responds with an HTTP 202 code and a link to a job (/api/cluster/jobs/{uuid}) for each POST request to create a snapshot or volume. Our client polls the job until it completes.

Our problem is that occasionally the API responds to a GET on /api/cluster/jobs/{uuid} with an HTTP error 404 and an "entry doesn't exist" message (with target set to "uuid"). Obviously the job link/UUID was provided by the API in response to a prior asynchronous request. This appears to be a race condition, because a subsequent GET on the job link succeeds.

This is the specific JSON returned when the job does not exist yet:

{
  "error": {
    "message": "entry doesn't exist",
    "code": "4",
    "target": "uuid"
  }
}

In the latest failure, a POST to create a FlexClone returned HTTP 202 and /api/cluster/jobs/97385175-bc14-11ec-a7a2-00a098ecaa08. Our client immediately attempted to GET the job link, and the server returned HTTP 404. While troubleshooting the next day, I fetched the job link manually:

{
  "uuid": "97385175-bc14-11ec-a7a2-00a098ecaa08",
  "description": "POST /api/storage/volumes/9737fdf3-bc14-11ec-a7a2-00a098ecaa08",
  "state": "success",
  "message": "success",
  "code": 0,
  "start_time": "2022-04-14T12:01:56-05:00",
  "end_time": "2022-04-14T12:02:00-05:00",
  "svm": {
    "name": "[redacted]",
    "uuid": "cc908475-3718-11e6-9a24-00a0984ab4b6",
    "_links": {
      "self": {
        "href": "/api/svm/svms/cc908475-3718-11e6-9a24-00a0984ab4b6"
      }
    }
  },
  "_links": {
    "self": {
      "href": "/api/cluster/jobs/97385175-bc14-11ec-a7a2-00a098ecaa08"
    }
  }
}

The GET on the job link failed at 12:01:56, and the job's start_time is 12:01:56, so my guess is that there is a race between responding to the POST and spawning the background job. Is this a bug, or is the REST API designed to return a reference to a job that may not exist yet?

RobertBlackhart · ‎2022-04-20

This is a bug. The internal reason why this might happen is that while you are (probably) sending the request to the cluster management address, it can be round-robin distributed to any eligible node in the cluster to be handled. The node that handles it creates a job record (including UUID) and returns a response. Then the subsequent GET call gets round-robin handled by a different node in the cluster and sometimes the record of the job details doesn't exist quite yet there, thus a 404.

This bug has been seen internally in testing, but does not yet have a fix. It tends to be somewhat uncommon and has not yet been prioritized for a fix. The workaround is indeed to retry the GET on the job after a very modest wait time.

If you want to open a ticket with support because this is affecting your scripts, and I think it's a valid complaint, you can refer to bug #1424779 which is already filed for this same issue. They could help push the priority of it getting fixed.

View solution in original post

RobertBlackhart · ‎2022-04-20

This is a bug. The internal reason why this might happen is that while you are (probably) sending the request to the cluster management address, it can be round-robin distributed to any eligible node in the cluster to be handled. The node that handles it creates a job record (including UUID) and returns a response. Then the subsequent GET call gets round-robin handled by a different node in the cluster and sometimes the record of the job details doesn't exist quite yet there, thus a 404.

This bug has been seen internally in testing, but does not yet have a fix. It tends to be somewhat uncommon and has not yet been prioritized for a fix. The workaround is indeed to retry the GET on the job after a very modest wait time.

If you want to open a ticket with support because this is affecting your scripts, and I think it's a valid complaint, you can refer to bug #1424779 which is already filed for this same issue. They could help push the priority of it getting fixed.

mbuchanan · ‎2022-04-20

Thank you, Robert. I appreciate all of this information. I did open a ticket, but support referred me to the forums here since it is an API problem. I will update it with the bug ID you provided.

I migrated this code from ZAPI to REST about 35 days ago, and I'd estimate this script has run about 800 times since then. I have had three of these failures, two when creating snapshots and one creating a clone volume. Earlier this week I added code to try the GET up to 3 times and emit a warning message if the server returns a 404 on the job link. It's good to have a bug ID to pin to that workaround.

mbuchanan · ‎2022-04-21

We hit this bug again last night, but fortunately the workaround caught it. That was the fourth time in 16 days, so this bug is turning out to be a common occurrence with this particular script and this cluster.

RobertBlackhart · ‎2022-04-21

Thanks for the data point about how often you see this issue. I've noted that in the bug report and will try to push to get this fixed sooner rather than later.