Large file uploads to S3 with progress indicators - Part 1

December 12, 2020

I won’t judge you based on your background. One way or another, like me, you were at some point required to support large file uploads. Videos, 3D model files, 3D videos, Ableton Live Projects, VR, 3D Ableton Live Projects, VR games where you use Ableton Live — and the list goes on. You’ve been through the initial line of questioning: “Are you sure you want to support that?”, “Have you tried gzipping it?”, “What about using Dropbox? I hear it might do what you want, and I can get you some free space if you use my invite link.” We both know that if any of these were suitable options, neither of us would be here right now.

File uploads are such a common feature that this must be a fully solved problem, right? You wouldn’t know if you spent an hour googling this. There is good advice out there on supporting large file uploads, but all suffer from at least one of these problems:

  • Focuses on “large” files up to ten megabytes in size
  • Assumes you’re fine jacking up your server’s maximum request size to startling heights.
  • Works fine, but you have no way of showing UI progress indicators, so your users cannot distinguish between the app crashing or just beginning a fortnight long process. They could be patiently waiting for decades for an upload that will never complete in their lifetime. Your code will literally kill people. (This one hits close to home as I still have a browser tab open waiting for my upload of Lost Boys to complete that I started years ago)
  • Assumes you’re fine using your web server as a temporary cache and storing the whole file in memory or on disk
  • Shames you for not building an entirely new service with a background worker your client polls to receive progress updates (this is actually a valid solution, but I know you don’t have time for that right now)

Uggggh, are we asking too much to want all of the upsides, no downsides, and also faster performance and straightforward implementation? That’s what we will aim for in this post.

The goal

On the backend, we’ll store the files themselves in S3. Our web server won’t ever touch them on the way there, eliminating the need to adjust our max request size or store temporary files. Instead, it will generate pre-signed URLs to perform a multipart upload to S3. It will function as the gatekeeper to S3.

On the frontend, we’ll write a React Hook that handles splitting a file into chunks, sending each one to S3 using these URLs, and dynamically updating the progress for a progress bar component to consume.

Bypassing your server with presigned URLs

The first big win to be had is to reduce the amount of work asked of your webserver.

Doing so eliminates a host of potential problems. We want to end up with a file stored in S3 that our application knows about and can retrieve. You can achieve this without sending a massive file to your server. Instead, I’d recommend going straight to S3 from the browser.

To do that, we’ll enlist the help of pre-signed URLs. Your server will generate short-lived URLs that grant anyone the ability to perform sanctioned actions in S3.

PUT and POST URLs

There are two methods to get data into S3 that we care about here. POST will upload the whole thing in one shot. If your files only get up to about 15mb in size, it’s probably best to stick with this.

To deal with more dangerous game, we’re going to break the file up into chunks, send them over however we’d like, then have S3 reassemble them into the full file.

The plan

Here’s what the process looks like:

  • Chop the file up into N pieces
  • Tell the server we need to upload a file in N chunks
  • Server tells S3 to initiate a Multipart Upload and receives and UploadId
  • Server uses this important UploadId to request N pre-signed PUT URLs from S3. Each of these can upload one piece of the file
  • Server passes this information back to the client
  • Client sends N separate requests to S3, each using one of the pre-signed PUT URLs with a piece of the file
  • After every request resolves, our client tells the server that everything’s done and includes some information needed to piece everything back together
  • Server sends a final POST to S3 with the data from above, completing the upload

We’ll start with the server and then explore the frontend in the next post. Since this is an involved flow, I won’t bore you in this post with all of the code details, save for some pitfalls that aren’t well documented. More involved code could come in a later post.

Initiating an upload

We first need to tell S3 that we intend to begin a multipart upload. The result of this process is an UploadId string that we’ll use throughout the process. If you’re using something like the boto s3 client in Python, it could look as simple as this:

response = boto3.client('s3').create_multipart_upload(
    Bucket='your bucket',
    Key='object key to associate this upload with',
)
print(response['UploadId'])
Generating multipart upload URLs

To generate the multipart upload URLs, you’ll need to provide the UploadId from above and a PartNumber parameter specifying each piece’s index. To do this, we need to know how many pieces in total we plan to upload. That information will come from the client, which we’ll get to later.

Creating these URLs might look like this:

presigned_url_responses = [
  boto3.client('s3').client.generate_presigned_url(
    "upload_part",
    Params={
      'Bucket': 'your bucket',
      'Key': 'object key to associate this upload with',
      'PartNumber': i,
      'UploadId': 'upload id from above',
    }
  )
  for i in range(1, number_of_chunks)
]

The server will need to send all of this information back to the client. At this point, the client can begin processing the actual file uploads.

Completing the upload

We’re still not done. These chunks will sit in the ether of S3 until we explicitly tell S3 to merge everything into the final file. The aptly named CompleteMultipartUpload S3 method accomplishes this last piece.

This somewhat needy method requires both the UploadId, and an additional list of checksums for each upload part. We’ll talk about where this list comes from the next post. For, just know that it will look like this when represented in JSON:

[
    {
        "ETag": "d8c2eafd90c266e19ab9dcacc479f8af",
        "PartNumber": 1
    },
    {
        "ETag": "d8ceeifd90c255e19ab9dcacc479fiaf",
        "PartNumber": 2
    }
]

(The ETag is a checksum of the contents uploaded in each batch.)

With this data, we can finally complete the process:

response = boto3.client('s3').complete_multipart_upload(
    Bucket='your bucket',
    Key='object key to associate this upload with',
    MultiPartUpload={'Parts': parts },
    UploadId='upload id from earlier',
    # where parts is a list of dictionaries with ETag and PartNumber 
)
API endpoints

These methods cover the basics required to perform the multipart upload. You’ll need to expose them to your client. One trivial example setup would look like this:

  • POST /api-url/create-s3-upload/

    • Receives file name, desired s3 key, and number of chunks
    • Creates multipart upload, generates and responds with all required pre-signed URLs
  • POST /api-url/complete-s3-upload/

    • Receives upload id, desired s3 key, and parts JSON data (see above)
    • Completes the upload

Next

So far, we’ve managed to check all of the boxes we originally wanted:

  • No files are stored on the server
  • We can handle any file size that S3 can
  • No need to drastically change our server’s max request size
  • Not too much shame involved
  • Uploads happen in chunks managed by the client, allowing us to update progress easily

This approach comes with plenty of other benefits. If you want to get fancy, you could implement retrying logic. Then if our connection drops before the process finishes, our app can pick resume later with only the chunks that didn’t go through.

We’ve gotten the dry stuff out of the way. The next post will explore how to encapsulate the client responsibilities into a React Hook that manages file upload progress for your UI components.


© 2020, Castles of Code