S3 as a Storage Back-End
With their Simple Storage System (S3), Amazon Web Services has built one of the major providers of cloud storage for applications ranging from small side projects to enterprise systems. Since the introduction of flexible storage back-ends for the official tusd server, an integration with S3 has been a much desired feature by our users. We are happy to announce that we are now able to deliver on this request. During the time it took to create this, we had to deal with various peculiarities of Amazon's service and were able to gain a lot of experience. In this post, we want to focus on the downsides of building a tus server on top of S3 and share some of our recently acquired knowledge with you.
We, as the designers of tus, have to admit that the protocol uses a data model which is mostly incompatible with AWS S3. In order to understand this sentence, we need to make a small comparison: In tus, when you want to move a file to a remote location, you first create a new upload resource without pushing any of the file's data to the server. It is even possible to make this operation before you know the length or size of the entire object that you want to transfer. After this step, you are free to upload the data in chunks of any size. The first chunk could be a few MBs, followed by one that is just 100 bytes and a final upload then contains the remaining GB. While this freedom introduces the need for a flexible server implementation, which is capable of handling chunks of any size, it also lays the foundation for tus' core feature: resumability of an upload at any given time.
S3, however, does not offer this flexibility: once an object - the length of which must also be known beforehand – has been uploaded to a specific location, you are unable to modify its content without transmitting the entire new file. It is simply not possible to add a chunk to an existing object without having to perform additional operations. It may sound, then, as if the main requirement of the tus protocol is not met by Amazon's service, but that is not the case. You are certainly able to build a proper server implementation for tus, as long as you are willing to accept certain restrictions. This can, for instance, be seen in the S3 storage back-end for the tusd server.
Amazon has been aware of this limitation and therefore supports an alternative approach called Multipart Uploads:
Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object.
This approach is very similar to tus' data model described above and it provides a solid foundation to build an implementation upon. However, development would not be called development if it were as easy as mapping a tus upload one-to-one to a multipart upload. The issue is that Amazon sets certain restrictions, the most notable of which is that the minimum size of a single part is limited to 5MB. The only exception to this rule is the last part, which can be smaller. It should be mentioned here that S3 will not complain when you upload a part that is smaller than 5MB, but only when you attempt to finish the multipart upload that does the actual assembly (it will then present you with the
EntityTooSmall error message).
The solution - if you want to call it one - is to only upload parts to S3 that match or exceed the minimum size. The storage back-end for tusd achieves this by writing the body of an incoming
PATCH request to a temporary file. Once the upload from the user to our tus server reaches a size of 5MB, we are sure that we have enough data for a single part on S3 and can start moving this chunk to Amazon's service. If the tus server does not receive enough data - ensuring, of course, that it is not the last part, which is allowed to be smaller - it will simply drop the temporarily stored file and require the user to attempt a resume, in the hope that the connection is then more reliable. A look at the code that powers the implementation described above, may help to understand this.
Regrettably, this approach comes with one noticeable downside for the end user: if an upload or resume is interrupted before at least 5MB has reached the tus server, the sent data will be lost and must be retransmitted. Some may ask why we don't simply locally store the received chunk of data on the tus server, wait for the user to resume the upload and then, once we have enough data, push it to S3. This is certainly a good question, but that solution only works when you can ensure that the resumed request reaches the same tus server as the previously interrupted request. If you are running more than a single tus instance, a special routing mechanism may be required to achieve this. Another option would be to use a second storage medium, such as a shared volume, but that would also need to handle concurrent access correctly.
If this workaround is not acceptable for your application because you do not want to limit the chunks to 5MB, you may want to reconsider using AWS S3 as a storage back-end, since it simply does not offer the required functionality. However, if you are using an alternative back-end that just exposes an S3-compatible API, it may offer a configuration option to change the minimum size of a single part. Riak CS (Cloud Storage), for example, accepts the
enforce_multipart_part_size flag, which can entirely remove this constraint.
S3's eventual consistency model
Amazon's engineers wanted to provide a highly available service and were therefore unable to offer guaranteed consistency for every operation. They nevertheless do not hide this important property of S3 and instead describe it extensively in their documentation. The most interesting sentence for us, the implementers of tus servers, is the following one:
Amazon S3 does not currently support object locking. If two PUT requests are simultaneously made to the same key, the request with the latest time stamp wins. If this is an issue, you will need to build an object-locking mechanism into your application.
Locking uploads is an important mechanism to prevent data corruption and tus is not immune to it. Imagine a situation where two clients attempt to resume the same upload at the same offset. If the server simply accepts both requests, the latter one may override the data from the first request, resulting in file corruption or loss. In order to prevent this issue, the server needs to acquire an exclusive lock, e.g. a simple semaphore, on the upload resource before it starts transferring the data and then only release that lock once the data is saved. In this scenario, the server will reject the second request from the client, because a lock cannot be obtained when one is already held.
Implementing a proper locking mechanism is, however, difficult and gets even more complicated if you are working in an environment with multiple distributed servers. In this case, a service should be used that manages distributed locks while at the same time guaranteeing consistency. For example, proven technologies include ZooKeeper or Consul, but not AWS S3 as it does not offer absolute consistency. Since they do promise "read-after-write consistency for PUTS of new objects in your S3 bucket [but only] eventual consistency for overwrite PUTS and DELETES", this cannot be used to build a distributed lock upon. Therefore, you are recommended to use a third-party system for doing so.
Another option for preventing concurrent uploading is to put the responsibility on the client's side by saying it is their task to prevent multiple accesses to the same upload resource. While this may work, this approach is not able to guarantee corruption-free uploads since a client still might send two or even more requests at the same time by accident and the server does not prevent that.
With S3, engineers have an incredibly useful tool for storing data with high availability and scalability. However, it does not present the perfect storage back-end for the tus protocol and requires some workarounds. In the future, we will have a look at other storage system and cloud providers.