Batch processing options on GCP through a practitioner's lens // Graham Polley

Intro

I’ve yet to work on a project that didn’t require some sort of aysnc batch processing to be part of the overall solution. Sure, working on real time event and data processing applications is where all the cool kids want to spend their time and hang out. But, every real time system that I’ve come across was not able to function without having batch processing workloads as part of its core design. From mapping or reference data that needs to be ingested daily, to running database schema management tools like Liquibase, there’s always a use case for needing some long-running compute executing in the background.

Batch processes can run for different lengths of time, but I generally like to think of them as a process that needs to execute async for N minutes or hours. They can have different CPU, IO, and RAM requirements and normally execute some custom code, shell scripts and the likes. Finally, they may be one-off processes, but the majority of the time they need to be automated and run on a schedule e.g. hourly, intra-day or daily.

In this post, I cover a few different tools and technologies you can use on GCP to run batch processing workloads. There are several options available to users, and I’ve heard a few different folks mention that they are somewhat confused as to which service they should use for their use case. So, I’ve attempted to break down each one and provide some clarity. I’m not going to give detailed examples or tutorials of each one, but rather share my opinion on them. For tutorials, check out the official docs for each one.

Pro tip - Some people are confusing Cloud Workflows as a way to run batch processes. Cloud Workflows is an orchestration service that allows you to call http endpoints e.g. other GCP services. Currently, there is no way to inject your own code into the service and have it run on some sort of compute. In other words, Cloud Workflows is not an execution engine in its own right. It just calls other services where the work actually happens.

1. GKE

First up is the venerable GKE option. It’s been a staple of every cloud engineer’s diet over the last 7 years or so. Using GKE to run batch workloads is a perfectly good option. It’s even got a whole framework/paradigm designed specifically to cater for jobs, otherwise known as finite tasks. The GKE docs on Jobs here state:

“Jobs are useful for large computation and batch-oriented tasks. Jobs can be used to support parallel execution of Pods. You can use a Job to run independent but related work items in parallel: sending emails, rendering frames, transcoding files, scanning database keys, etc. However, Jobs are not designed for closely-communicating parallel processes such as continuous streams of background processes."

Using GKE for batch processing makes perfect sense if you’re already running on GKE and have experienced engineers on your team that are familiar with running and operating Kubernetes clusters. However, if you’re not currently running GKE as part of your platform or architecture, then spinning it up to handle your batch workloads might not be prudent. That’s because Kubernetes/GKE comes with added complexity and a steep learning curve. As such it might be better to move up the stack to something that is more fully managed, or dare I say it, serverless.

Note: I’m going to make a notable mention to using GCE at this point. I felt it didn’t need its own section because spinning up VMs, and configuring some cron jobs is something we’ve all been doing for decades. Firing up GCE instances is not something I like to do anymore as I then need to manage and babysit them. For me at least, there are better solutions.

2. Cloud Build

You’ve probably just read “Cloud Build” and then asked yourself either one of the following two questions: “Is this a joke?" or “Is Graham taking crazy pills?". Well, the answer to both is no.

Whenever I tell people that you can use Cloud Build to do batch processing workloads they have a hard time wrapping their heads around this concept. That’s because Cloud Build is marketed as a CICD tool. Google’s tagline for Cloud Build is literally: “Build, test, and deploy on our serverless CI/CD platform.". So, it’s no wonder that users don’t consider this service as anything other than a place to build container images or to push a new release out.

In essence though, you can run anything you want on Cloud Build.

As I’ve been proclaiming dogmatically over the last few years, Cloud Build is much more than just a CICD tool. Many people have scolded me for “abusing the service” when I tell them how it can be actually used for other things like batch processing. But, then I show them an example of using it for something other than CICD and the penny then drops.

You see, when you take a step back and look at Cloud Build it’s nothing more than cheap, ephemeral compute running containers that can execute for up to 24 hours. You can choose VM types, use prebuilt container images that have everything you need or build or own custom ones, and the best part of all is that it’s incredibly easy to use.

Over the years, I’ve worked with many customers out in the real world that use Cloud Build extensively in their solutions for use cases such as running performance load tests, batch file pre-processing, hosting database schema management tools like Flyway or Liquibase, running PoCs, and also automating gcloud commands. Granted, if you need massive scale parallel processing using hefty GPUs and running for days with lots of interdependent steps, then Cloud Build isn’t going to cut the mustard. You’ll need something else. But, for simpler use cases that aren’t overly complex, then Cloud Build can make sense.

So, the next time you need some cheap compute to download that big gzipped CSV file using wget, inflate it, then hit it with some grep or sed commands, and finally upload it to GCS, have a look at using Cloud Build. You might be pleasantly surprised how well it can handle use cases like that.

3. Cloud Run Jobs

Don’t confuse this with the standard Cloud Run service. Although it’s all the same tech under the hood, Cloud Run Jobs is specifically designed for running batch workloads. Unlike Cloud Run, which mandates you a http request/response framework be part of your container instance (think Flask, Spring Boot etc), Cloud Run Jobs does not require this. When you think about it, this makes sense. If you running batch processes, you’re not interested in handling and responding to http requests. You just want to execute some code or gnarly shell commands, maintenance scripts etc.

Cloud Run Jobs is a relatively new service on GCP. At the time of writing this article, I see two main limitations with this service:

Firstly, it is in Preview, not GA. This means it does not currently have any SLAs wrapped around it and limited support. That might be acceptable for some users, but for enterprise customers a service that is in Preview is normally a showstopper for them. The official Google docs here state:

“At Preview, products or features are ready for testing by customers. Preview offerings are often publicly announced, but are not necessarily feature-complete, and no SLAs or technical support commitments are provided for these. Unless stated otherwise by Google, Preview offerings are intended for use in test environments only. The average Preview stage lasts about six months."

Ultimately, it’s your call on that one.

The second major barrier to adoption right now with Cloud Run Jobs is that it is currently limited to 60 minutes execution time. That will undoubtedly get raised soon (possibly when it goes GA?), but as it stands right now this is a limitation you’ll need to consider. The docs here state:

“By default, each task runs for a maximum of 10 minutes: you can change this to a shorter time or a longer time up to 1 hour, by changing the task timeout setting as described in this page. There is no explicit timeout on a job execution: when all tasks are done, the job execution is done."

In summary, if you’re happy with accepting the risk of using a Preview service and the 60 minutes max execution time is not a constraint for your workloads, then Cloud Run Jobs could be a good candidate. If not, then you’ll need another solution.

It’s worth keeping an eye on Cloud Rub Jobs as it matures as a service. It sits high up the technology stack, and the price of admission is simply a container. So, there’s no VMs or infra to set up. I like that.

4. Cloud Batch

Cloud Batch is the latest contender to enter the ring for running batch workloads. As the name suggests, it’s been designed specifically for this use case, in the same was as Cloud Run Jobs. This service was only released a few weeks ago (see here). It’s still in preview so the same warning applies that I wrote for Cloud Run Jobs above. It’s also worth noting that even for a preview service, the documentation is very light.

As of yet, I have not had a chance to try it. But, from reading the documentation it’s basically spins up GCE instances in a MIG and then manages your jobs/tasks on them for you. Yanked from the docs:

“Each Batch job runs on a regional managed instance group (MIG) of Compute Engine VMs based on the job’s specified requirements and location. If specified, a job might also use additional compute resources, like GPUs, or additional read/write storage resources, like local SSDs or a Cloud Storage bucket. Some of the factors that determine the number of VMs provisioned for a job include the compute resources required for each task and the job’s parallelism: whether you want tasks to run sequentially on one VM or simultaneously on multiple VMs."

Like Cloud Run Jobs, I think it’s worth keeping an eye on Cloud Batch as it matures. That said however, it feels like it’s dropping down another level to using GCE/MIGs which is something some users might not be comfortable with. I guess, it all depends on the use case. I’m going to take Cloud Batch for a test drive soon and put it through its paces to see what it’s like. Stay tuned for that upcoming post.