Worst Practice

How to get the last modified file from an S3 Bucket?

Posted on April  6, 2021 @ 16:15

Posted under the Backend category

Level: beginner

Posted with the following tags: #PHP, #PHP 7.4, #AWS, #S3 Bucket

Recently I had to make my hands dirty with the AWS S3, and I faced a problem of getting the latest/newest file from a bucket with PHP.

How to get the last modified file from an S3 Bucket?

TL;DR

If you don’t want to waste your time reading this tutorial, and you only need a working code sample, please check the source code on GitHub.

Requirement

First, it must be nailed down, if you need this regularly, probably it’s better to create a RDS table where you can do such queries easily and in a cost/CPU/time effective way.

This method is the opposite. It gets the full list from an S3 bucket, and then sort and filter on the local backend. Far from optimal.

I assume those looking for this code snippet has already some kind of access to the Amazon S3 and also has the keys and credentials for the AWS SDK S3 Client script to access it from their application.

The Adapter

I’m a PHP developer, so I will show how to do this in PHP. First I define some constants that we will need later:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<?php

declare(strict_types=1);

namespace WorstPractice\Component\Aws\S3;

class Adapter
{
    public const AWS_DEFAULT_LIST_LIMIT = 1000;

    public const OBJECT_SORT_BY_NAME = '^Key';
    public const OBJECT_SORT_BY_NAME_DESC = 'vKey';
    public const OBJECT_SORT_BY_DATE = '^LastModified';
    public const OBJECT_SORT_BY_DATE_DESC = 'vLastModified';
}

The AWS_DEFAULT_LIST_LIMIT is the default value the MaxKeys limiter of the requested list. If there are more objects in the given S3 bucket, it will return chunks. I believe the developers at AWS know why this value is the best, so I didn’t change it. If I make it smaller, the more chunks I have to request, if I make it bigger it may hit the response time. So the default limit is just fine.

Then I defined four constants for control the sorting. By default the objects are returned sorted in an ascending order of the respective key names in the list, and currently there’s no official AWS way to change this sort. So we have to do it locally.

I don’t like to put too complex logic to determine the key and sort direction, so I mixed the two using some semi-visual markers. Before the key name, I use either ^ or v to know if it’s a ascending (^ or “up”) or a descending (v or down) order.

The constructor

To instantiate the adapter, we need to pass the AWS S3 Client object, so it can communicate with the AWS when needed.

1
2
3
4
5
6
7
8
9
10
11
12
//...

use Aws\S3\S3Client;

class Adapter
{
    // ...
    
    public function __construct(private S3Client $s3Client)
    {
    }
}

I love the new features of the PHP8, for example this constructor property promotion simplifies a lot on the code.

Specify the bucket

To use the different S3 Client actions, in most cases we need to specify the bucket we want to work with. To avoid unnecessary parameters, and assuming that one wants to work on the same bucket, we decouple this setting from the constructor into a separate public method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// ...

class Adapter
{
    // ...
    
    private string $bucket;
    
    // ...
    
    public function setBucket(string $bucket): void
    {
        $this->bucket = $bucket;
    }
}

Get the bucket’s object list

In an S3 bucket we don’t talk about files, we talk about objects. An object holds various metadata like ID, key, date of modification, the filesize etc. That’s why the sorting is so difficult, and if you need a frequently used sorting, I yet again recommend You to create a table in a relational database to solve it there.

To get the last modified file from an S3 Bucket we need to do four things:

  1. set up the search options and additionally change the sort and limit arguments
  2. get the full bucket object list (filtered by prefix)
  3. apply the sorting on the full list (sort by “date modified” in descending order)
  4. get the first element and return the object’s key that we need to get the file.

It looks as the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// ...

class Adapter
{
    // ...
    
    public function getObjectListByPrefix(string $keyPrefix, string $sortBy = null, int $limit = 0): array
    {
        $options = $this->getSearchOptions($keyPrefix, $sortBy, $limit);

        $results = $this->fetchFullFileList($options);
        // Avoid sort if not needed.
        $sortBy !== self::OBJECT_SORT_BY_NAME && $this->sortFileList($results, $sortBy);
        // Avoid limit if not needed.
        $limit && $this->limitFileList($results, $limit);

        return $results;
    }
}

First we setup the basic options array for the request. If we use the default sort by value, we can skip the expensive process of custom sorting on PHP side. Also if the limit is equal to zero, we can skip the additional method call.

Now let’s see the these methods

The search options

Here we set up the basic options array, and if necessary change the $sortBy and $limit parameters:

  • If the sortBy was not set, set the default one. I could have added a constant for it, but didn’t feel necessary.
  • If the limit is a negative number, we consider it as a soft mistake and use the absolute value of it. I could have use the negative limit to control the direction of the sort, but it would have added an unnecessary complexity.

Then we check if the given sort-by parameter is the default AWS S3 sorting. We can use this information to add an AWS side result limiter, and if the limit is lower than the default AWS list limit (MaxKeys). It’s good for the performance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// ...

class Adapter
{
    // ...
    private function getSearchOptions(string $keyPrefix, ?string &$sortBy, int &$limit): array
    {
        $options = [
            'Bucket' => $this->bucket,
            'EncodingType' => 'url',
            'Prefix' => $keyPrefix,
            'RequestPayer' => 'requester'
        ];

        if (empty($sortBy)) {
            $sortBy = self::OBJECT_SORT_BY_NAME;
        }

        $limit = (int) abs($limit);

        // We can add a query limit here only when we don't want any special sorting.
        if ($sortBy === self::OBJECT_SORT_BY_NAME && $limit < self::AWS_DEFAULT_LIST_LIMIT) {
            $options['MaxKeys'] = $limit;
            // Set the parameter to 0 to avoid the unnecessary array_chunk later.
            $limit = 0;
        }

        return $options;
    }   
}

Note, we added a Prefix index to the options. In an S3 bucket the prefix is something like a path on the filesystem. Generally it can be anything that is part of the beginning of the object’s key, but with slashes (/), the S3 console on the AWS website will consider them as “folders”. This will help a lot when we request file list under a specific “sub-folder”.

The requester

Here we communicate with the AWS through the S3 Client provided by the AWS SDK. In this method we have to heavily build on the SDK documentation, So we have to believe what is written there:

  • It there is no result, then the Contents index is empty in the response array.
  • Otherwise the all the required indexes must exist.

Getting a full bucket list is a little bit tricky. We need to keep requesting the AWS, until we get all the objects, then merge the results into one array.

To achieve this, the best option is the do ... while loop.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// ...

class Adapter
{
    // ...
    
    private function fetchFullFileList(array $options): array
    {
        $results = [];
        $continuationToken = '';

        do {
            $options['ContinuationToken'] = $continuationToken;

            $response = $this->s3Client->listObjectsV2($options);

            if (empty($response['Contents'])) {
                break;
            }

            $results[] = $response['Contents'];
            $continuationToken = $response['NextContinuationToken'];
            $isTruncated = $response['IsTruncated'];
            usleep(50000); // 50 ms pause to avoid CPU spikes
        } while ($isTruncated);

        return array_merge([], ...$results);
    }
}

I’m always happy when I can use a do ... while, it’s a kind of rare occasion.

In the loop we get actual portion of the list. The ContinuationToken tells the AWS where it should continue the listing. For the first time, this token is empty, so the AWS will start in the beginning. In the response we get the NextContinuationToken which points to the next portion. We call again the AWS with this token unless the isTruncated flag is TRUE which means we reached the end of the list.

A general rule is to avoid array_merge within loops. Then how to collect all the data into a list without it or adding another loop, like foreach? Here is an optimization advice:

Collect the result arrays into an array, and after the loop simple merge them with the help of the splat operator.

Actually this part:

1
array_merge([], ...$results);

Here we use the splat operator (...) for “unpacking the argument”. Since we are sure that every element of the $results array are arrays too, we can bravely unpack it and pass all its items (arrays) to the array_merge. But since we need to explicitly add two arrays, we use an empty array as a starting. The array_merge then merges all the arrays within the $results with this empty array, and what we get is the full object list on an AWS S3 bucket starting with a specific prefix.

The sorter

The next method is the sortFileList. We call it only when want other than the default sort. This method gives us a great opportunity to practice the custom sorting ability of PHP. First we need to check if we need ascending or descending sort. As I wrote earlier, the first character should tell this. To avoid mistakes, we can add a simple validator for the available sorting values too.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// ...

class Adapter
{
    // ...
    
    private array $validSortByKeys = [
        self::OBJECT_SORT_BY_NAME,
        self::OBJECT_SORT_BY_NAME_DESC,
        self::OBJECT_SORT_BY_DATE,
        self::OBJECT_SORT_BY_DATE_DESC,
    ];
    
    // ...
    
    private function sortFileList(array &$fileList, ?string $sortBy): bool
    {
        if (empty($fileList) || empty($sortBy) || !in_array($sortBy, $this->validSortByKeys, true)) {
            return false;
        }

        $direction = $sortBy[0] === '^' ? 'asc' : 'desc';
        $sortByKey = substr($sortBy, 1);

        return usort($fileList, static function ($a, $b) use ($direction, $sortByKey) {
            $cmp = strcmp($a[$sortByKey], $b[$sortByKey]);
            return $direction === 'asc' ? $cmp : -$cmp;
        });
    }    
}

If you are not familiar with the custom sort in PHP, this is how it works. The usort function gets the array that needs to be sorted as a reference parameter. This means the function will change the parameter itself and doesn’t return a new version of it as other array functions do like the array_replace.

The second parameter is a callback function, that gets two actual elements from the array. We don’t need to know where are these placed in the original array, the usort calls this, not us. We only need to define the logic, that decides the relation between the two items. Return -1 if the first argument is considered to be respectively less than, 0 if it is equal to, or 1 if it is greater than the second parameter.

With the use statement, we can “inject” variables into the function’s scope. This way we can control if the “greater” should be 1 or -1 therefore apply the ascending and descending order without an extra array_reverse call.

The result limiter

This is the simplest: the result must be an array. If it’s not empty, then just chunk the array into pieces with the size of the limit and return the first chunk.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// ...

class Adapter
{
    // ...
    
    private function limitFileList(array &$fileList, int $limit): bool
    {
        if (empty($fileList) || $limit <= 0) {
            return false;
        }

        $fileList = array_chunk($fileList, $limit)[0];

        return true;
    }
}

Get the last modified file’s key

After having the method getting the full list sorted and chunked, the base problem of this topic is as simple as is: calling our method with the right parameters. Or we can create a method just for this special case.

1
2
3
4
5
6
7
8
9
10
11
12
13
// ...

class Adapter
{
    // ...
    
    public function getLastUploadedKeyByPrefix(string $keyPrefix): ?string
    {
        $object = $this->getObjectListByPrefix($keyPrefix, self::OBJECT_SORT_BY_DATE_DESC, 1);

        return $object[0]['Key'] ?? null;
    }
}

This will return a string with the file’s key on the S3 bucket, or NULL if the bucket with the given prefix is empty. You can also create a method that downloads the file from the S3 bucket, but let it be a homework.

Gábor Iván