A blast from my (AWS) past

It’s shouldn’t be a secret to anyone that over the past 6-ish years, I’ve predominantly been focussed on using the GCP stack to build software solutions. However, it might surprise some folks to know that I actually started my cloud adventure on AWS way before GCP was even a thing. The first cloud project I actually worked on was helping build and support a fleet of EC2 instances running ¾ of a LA[M]P stack. To this day it still powers one of Australia’s largest digital publishing websites, handling 5K request p/s at peak and tapping into other AWS services like ELB, RDS, Elasticache, S3, Lambda and Redshift. Yes, it’s also a surprise to me that it hasn’t fallen in a heap, considering I helped build it with my dodgy coding skills. Anyways, working for a technology agnostic company means that staying abreast of the main clouds and tech/tools is a crucial part of my job. Besides keeping me frosty and continuously learning, keeping an eye on the broader technology landscape means I’m in a better position to help our customers and be a trusted advisor to them.

Recently there was a question posted by someone in our internal AWS chat room asking for help on how to quickly implement a solution (or hack) that could list and count all objects per ”folder” in an AWS S3 bucket. It went like this:

”Has anyone come up with some ingenious way to list the file count of every “folder” in an s3 bucket?”

Now, before you get up on your soapbox and begin launching into a tirade about how there’s no such thing as “folders” in S3 and how everything is an object, take a deep breath and relax. I’m well aware of that, as is the person who asked the question. But, for brevity’s sake, I’m going to use “folders” as the noun to help describe the problem. It’s easier. As a compromise, I’ve put it in quotes to keep the elitists at bay.

Shell tools are you best friend(s)

I personally like solving little challenges like these. It’s fun, and also a welcome distraction from my usual day-to-day shenanigans, like pestering my boss or causing trouble because I like to play devil’s advocate. So, I promptly started to Google if I could copy and paste a solution and claim all the credit for myself. It became quickly apparent however, that it wasn’t as straightforward like I had originally assumed. There’s a lot of talk about how to do it, and many of the solutions proposed dropping into either Python or Java and using mad things like arrays and loops. What’s a loop? Ewww! See here if you don’t believe me.

Surely, there must be an easier way to do this, I thought to myself as I lay back in my rocking chair, smoking my pipe and sipping my neat scotch. My first idea was to go back to basics and try doing it all from the CLI using Bash (awk, grep etc.) and maybe some SQL too? That approach has never failed me in the past. While I pondered that idea, I thought about using Athena. I figured it might be able to read a bucket’s metadata, but alas I discovered this question on Stack Overflow and also this on Reddit that states Athena can’t read bucket metadata. So, I dropped that idea quicker than hot snot down a greasy rasher. It actually turns out that I’m a little stupid and Athena can indeed be used, but I’ll come back to that later.

Bash & SQL - is there anything they can’t do?

9 times out of 10, a tech problem can be solved using some good ‘auld Bash and SQL. That said, just because you can do it with Bash and SQL, doesn’t mean you should do it with Bash and SQL. However, the initial ask was for something quick and easy and it was only going to be used in ad-hoc manner and run once or twice. I had my free pass to get down, dirty and hacky. High five!

The idea was pretty easy. Make one API call to get the initial dump of everything in the bucket, format it, and then pipe the result into sqlite3, which is conveniently pre installed on macOS. Then it’s just a case of hitting it with some good ‘auld fashioned SQL. It’s based on the premise that “folders” are zero byte objects, so they can be filtered/identified using their size. Neat!

Step 1: recursively list all objects in the bucket

$ aws s3 ls s3://my-funky-monkey-bucket --recursive --output text | awk '{printf "%s,%s\n",$3,$4}' > objects.csv

Hitting it with a little bit of awk magic, I formatted the output to include the object size and name only. Here’s what the output of that command looks like on the CLI:

$ cat objects.csv 
0,bar/
58937,bar/1.png
0,foo/
58937,foo/2.png
0,foo/subfolder/
18562,foo/subfolder/3.png
58937,foo/subfolder/4.png
0,foo/subfolder/subfolder2/
18562,foo/subfolder/subfolder2/5.png

So, now I had all my objects and their sizes listed for the bucket. You can see that “folders” are indicated by 0 bytes in size. Let’s move on now.

Step 2: SQL joins the party (it “sequel”, not “S.Q.L”)

The next step is taking the output from step 1, and loading it into sqllite, which is conveniently pre-installed on MacOS. I’m going to assume readers don’t need me to explain the SQL below. It’s so trivial that even I was able to write it. But, in a nutshell, what’s happening here is that the results from step 1 get loaded into sqlite db/table on the fly (all in memory of course) and then hit with a simple filter and aggregation.

sqlite3 test.db <<EOF
create table objects(size, name);
.mode csv
.import objects.csv objects
.output results.csv
.headers on
select rtrim(name, replace(name, '/', '')) as folder, count(*) as num_objects from objects where cast(size as integer) > 0 group by folder;
EOF

Step 3: Tying it all together

All I needed to do next was combine step 1 & 2 into a crude Bash script, et voilà!

#!/bin/bash
aws s3 ls s3://my-funky-monkey-bucket --recursive --output text | awk '{printf "%s,%s\n",$3,$4}' > objects.csv
rm test.db 2> /dev/null
sqlite3 test.db <<EOF
create table objects(size, name);
.mode csv
.import objects.csv objects
.output results.csv
.headers on
select rtrim(name, replace(name, '/', '')) as folder, count(*) as num_objects from objects where cast(size as integer) > 0 group by folder;
EOF

And running it produces:

$ ./bucket_folder_count.sh 
$ cat results.csv 
folder,num_objects
bar/,1
foo/,1
foo/subfolder/,2
foo/subfolder/subfolder2/,1

And that’s it. One ‘elluva gnarly, hacky, brittle little solution for counting objects in S3 “folders”..sorry, objects! I’m not sure how well this is going to scale, to say, millions of objects and you’ll need to think about working around the max record count of 1000 and pagination. Otherwise, I think it’s a rather neat little script for solving the original problem.

But, the fun doesn’t stop there my friends. Remember that Athena comment I made at the start of this drivel? Well, yes, it turns out someone much smarter than me was able to figure it out using Athena. Oh dear! Maybe I should just retire early.

I am so smart, I am so smart..SMRT, SMRT!

While I was basking in my triumphant glory and thinking I was smart, someone much clever-er than me chimed in with another option. Michael Paterson was able to come up with the following solution using Athena. If you remember, I had thought about using Athena but had concluded it wouldn’t be possible. Using the “$path” function in Athena, which lists keys for all records in an athena table, he created a generic table to read from the root folder. He was able to get it working by ‘reading’ a mix of csv and parquet files though only extracting the file path, not the contents, in the query. This solution will probably scale a lot better than my gnarly hack, but it’s worth noting you’ll be slugged with Athena charges because the file contents will still be scanned. Here’s what Michael came up with in the end:

-- ------------------------
--  In AWS ATHENA
CREATE EXTERNAL TABLE nai_kingscross_zulu_raw.folder_count -- database(schema).table_name
(
  `first_field` string  -- not needed except for table compile
 )
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '|' -- not needed except for table compile 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION

 's3://bucket/subfolder/'  -- bucket and subfolder root path

TBLPROPERTIES (
  'has_encrypted_data'='false', -- not needed except for table compile
  'skip.header.line.count'='1' , -- not needed except for table compile
  'serialization.null.format'='' -- not needed except for table compile
)
;
-- RUN QUERY (separate execution to create table)
-- tests
-- 133722 Objects -- 940.7 GB -- 2 mins 4 secs
-- 2962 objects -- 2.5 GB     -- 0 mins 4 secs
WITH s1_distinct_paths AS
(select DISTINCT replace("$path", 's3://','') as full_path  
 from nai_kingscross_zulu_raw.folder_count 
)
,s2_split_path_to_array AS 
(
select split(full_path, '/') as path_elements_array , full_path 
  from s1_distinct_paths
 )
,s3_remove_last_element  AS -- ie remove filename
(
SELECT slice(path_elements_array, 1, cardinality(path_elements_array)-1) AS folder_path_array , full_path 
  from  s2_split_path_to_array
 )
,s4_count_files_for_folder_path AS 
(SELECT  folder_path_array , count(*) as file_count
 FROM s3_remove_last_element
GROUP BY folder_path_array
)
,s5_count_folder_levels AS
(SELECT array_join(folder_path_array, '/') as folder_path_string, folder_path_array, file_count, cardinality(folder_path_array) as folder_level_count  
 from s4_count_files_for_folder_path
)
SELECT * FROM s5_count_folder_levels ORDER BY folder_path_string

Granted, it’s a fair bit of SQL to wield and you you’ll also have to pay for scanning all the files in your bucket(s), but there’s no doubt that it will scale, and it is a lot less brittle than my crude hack. Good work Michael! Hat tip.

Final thoughts

And there you have it. Two solutions..cough..hacks to counting objects per “folder” in an S3 bucket. Like most software engineers, I enjoy solving little problems like these. It’s fun and keeps me somewhat frosty on the tools even though I rarely code much these days. Finally, if you’ve got another (better?) way to do this then feel free to drop me a line and let me know.

Now, I really must dash. I need to prepare for my Microsoft Azure AZ-900 exam. I keep hearing good things about Azure. It’s time to dip my toe in the big MS-DOS machine. “Developers! Developers! DEVELOPERS!”