Request restarts

Or: how to efficiently cache 403 responses

AWS S3 has a peculiar design choice that makes caching "not found" responses slightly more tricky in our case. By default, Fastly caches all 404s, which is great for our use case. Unfortunately, for our S3 bucket, AWS does not return a 404 when an object is not found — it returns a 403.

This discrepancy in response code effectively comes down to the fact that our S3 bucket, while allowing public GET requests for objects, does not allow public List Object calls. Listing objects for our bucket would be extremely expensive (flattened hierarchy, with very few subdirectories and millions of files). When configured this way with restricted object listings, S3 returns 403s for "Not Found" objects, instead of 404s. These do not get cached by default, and so requesting the same object that does not exist will always go directly to the origin.

We can fix this and tell Varnish that yes, 403 responses must be cached. This acts as you expect and stops origin requests, at least until the TTL expires. But that's not enough, because Nix wants a 404.

We can fix this completely in the Fastly configuration with some careful VCL programming without any S3 reconfiguration. In short, Varnish has a feature called restarting a request, that does what it sounds like: it "reboots" the above state diagram and starts request processing back at vcl_recv, when asked. You can use this to do things like follow redirects transparently from the edge (get a 302, restart, retry request at new URL, return result) without redirecting a client.

In our case, we map 403s to 404s using restarts. First off, we assume all responses, including 403s, 2xxs, etc are always cached in a POP. In vcl_deliver, which happens only after a response has been cached, we check the status code. If we see a 403, then we set a flag, and restart the request. When vcl_recv sees this flag set on the request (and that this request has been restarted at least N >= 1 times), it short-circuits and returns a synthetic, out-of-thin-air 404 to the client instead.

Again, note this logic applies in all cases, whether on a "fresh" 403 or a cached one: if a client asks for an object and then the origin returns 403: set flag, restart, serve 404 instead. If a client asks for a cached 403, the cached object will be found, but the same logic happens: set flag, restart, serve 404 instead. We do this by delaying the check for the 403 as late as possible before delivering a response to the client.

This careful restart ensures that the original 403 from the origin is acknowledged and cached in the POP, so it never needs to be re-fetched (until expiry). At the same time, we serve our own "synthetic" 404 to the client. Like any cached object, the 403 exists in the POP, and its only purpose is to soak up requests to the origin until it expires. But it doesn't matter if we actually serve it to the user, though.

<aside> 💡 Brainstorming: Will making the S3 bucket private change how 404s/403s are returned from S3? It probably depends on the role configuration used to access the bucket. If we keep denying object listings, it will probably be the same, regardless of credentials. But it might be best to assume the worst — and make the VCL more robust and handle both 404s and 403s from S3 consistently. It is possible this is mostly the case already, but this should be carefully reviewed in the configuration before adding any authentication layer.

</aside>

An example of this strategy can be completely summarized using a Fiddle. The following example demonstrates the complete approach, and you can run it directly to see how a 403 from Amazon gets turned into a 404 from the user.

https://fiddle.fastlydemo.net/fiddle/ba7d1d20

sub vcl_recv {
  if (req.http.Cache-Object-Not-Found && req.restarts > 0) {
    # Set in vcl_deliver, after the 403 response got cached
    error 404;
  }
}

sub vcl_fetch {
  if (beresp.status == 403) {
    # 404s are cached by default, but not AWS 403s, naturally...
    set beresp.cacheable = true;
  }
}

sub vcl_deliver {
  unset resp.http.server;
  unset resp.http.x-amz-request-id;
  unset resp.http.x-amz-id-2;

  if (resp.status == 403 && req.restarts == 0) {
    # by this point, the 403 response from AWS will
    # be cached in the POP. in order to return a synthetic
    # 404, however, we need to restart, and deliver that
    # from vcl_recv via 'error'. note that vcl_error
    # will come back to vcl_deliver from vcl_recv,
    # so guard this only on the first restart
    set req.http.Cache-Object-Not-Found = "true";
    restart;
  }
}

sub vcl_error {
  if (obj.status == 404) {
    # deliver a tiny 404 response to users, instead
    set obj.response = "Cache Object Not Found";
    synthetic {"404"};
    return(deliver);
  }
}

We can see the request gets to the DELIVER phase after the 403 is returned from S3, before restarting:

Finally, after the restart, we can see the new code path simply jumps directly to the error phase, and immediately delivers a 404: