Scheduling BigQuery using Cloud Scheduler & Cloud Build
A ramble
This has probably been one of the hardest blog posts that I’ve ever tried to write. Not because of the content. Nor the fact that it’s the first time I’m posting my drivel on a new platform and I’m still learning how to use Hugo. No siree Bob! None of the above.
The reality is that between work and two young children, I can barely find time to scratch my arse, let alone write a technical blog post. But, somehow, between the nappy changes, the cleaning up of puke and the endless interruptions, I finally managed to get pen to paper - albeit it took about thirty attempts to get there.
Now, I know what you’re thinking: _“You promised us that your first post on your shiny new blog would be pictures of your dog!"_. Indeed I did, and for this I’m truly sorry. However, if you continue reading to the end you’ll find a wee little surprise waiting for you.
Ok, I’m rambling now. Sorry.
Less rambling
Recently, our team needed a simple way to schedule a bunch of BigQuery SQL queries to run. Nothing fancy, just a good ‘auld fashioned SQL pipeline really. Currently there’s a few ways (hacks) to do this.
There’s also BigQuery’s scheduling feature, but we see a few limitations with that. Most notably it’s tied to a user account, and this ain’t gonna fly with your boss when staff offboard. Also, it only allows you to schedule one query at a time. But, what if you need to run N queries sequentially and they are dependant on each other? Or if you want to run other commands like a gsutil
? Or even just some plain bash
? Oh, the humanity!
Finally, there’s the Cloud Composer option. Sure, that will work fine as long as you have deep pockets. It’s enterprise grade, which will also keep those snazzy architects up in their ivory towers real happy. However, it’s a heavyweight solution to a very simple problem. It’s got a huge infra footprint, and you need to pay for it twenty-four-seven i.e. it’s not serverless - and we LOVE serverless in our team.
So, we got to thinking that there must be a nicer, easier, and cheaper way of doing this. We also like solving these types of problems and coming up with new ways of doing things. It keeps us frosty.
Ta-da!
Let’s quickly recap. We wanted to schedule a few queries to run sequentially in BigQuery. Kind of like a serverless pipeline for SQL that we could easily build and deploy. Hmm..
_Pipeline._
_Scheduler._
_Something serverless._
_Pipeline._
_Scheduler._
_Something serverless._
_Pipeline.._
Are you getting it?
The Steve Jobs fans go wild. The rest of y’all are asking yourselves once again “I don’t know why I bother reading Graham’s nonsense.", and I’d have to agree with those of you.
The solution was to use a combination of three serverless tools on GCP:
I’m going to assume you’re somewhat familiar with all this products and move on quickly. Try to keep up please.
The architecture was straightforward. Firstly, we set up a Git repo using Cloud Source Repositories. We then pushed a cloudbuild.yaml
config file so Cloud Build could work its magic. That YAML file essentially defined our SQL pipeline i.e. a bunch of queries to execute in BigQuery.
Next up, we created a Trigger
in Cloud Build, and linked it to our newly created Git repo in Cloud Source Repositories. We then promptly disabled that Trigger
, because we didn’t want it triggering when someone pushed to the repo. Smart, eh? Finally, we fired up Cloud Scheduler and created a HTTP job, which pinged the Rest API endpoint for that Trigger
.
Et voilà!. We had a simple, automated, serverless and cheap solution for scheduling BigQuery queries. It took us less than an hour to wire it up and get it working. No kidding.
Putting it all together
The concept sounds easy, and truth be told, it is. However, there were a few little things that tripped us up, so it’s worth walking y’all through the steps. That way you don’t have to make the same mistakes we did should you be foolish enough to use this pattern.
Fist up, here’s what our cloudbuild.yaml
YAML file in Git looked like (obviously we’ve simplified this and created a demo repo for the purpose of this post):
steps:
- name: 'gcr.io/cloud-builders/gcloud'
entrypoint: 'bq'
args:
- query
- 'SELECT 1'
# Don't get hung up on this just about running BigQuery commands.
# Because you're working with containers here, you can run *anything*
# you darn well please. It doesn't even have to be part of the GCP
# ecosystem. Want to run some FORTRAN? Sure, go for it.
- name: 'gcr.io/cloud-builders/gcloud'
entrypoint: 'bq'
args:
- query
- 'SELECT 2'
Then we created a Trigger
and linked it to that Git repo. See here. Then we disabled it so it wouldn’t be triggered on every push to the repo. I think I already mentioned that earlier, didn’t I? That’s the senile symptoms kicking in. It was by 40th birthday yesterday.
The last step was to wire this Trigger
up to Cloud Scheduler and invoke it over HTTP. Now, for those of you who are already familiar with Cloud Build, you’re probably wondering why we didn’t just go straight to its API here. There were a few reasons for why we didn’t.
Firstly, I had overlooked that possibility and only thought of it later on when my much smarter colleague mentioned it to me. Let us never speak of that again. Thanks Chris if you’re reading this! Secondly, using Cloud Source Repositories forced us to have everything under version control in Git. Our config, our SQL - everything. We liked that.
Right, back to Cloud Scheduler. We needed to give it a HTTP endpoint. Here’s what that looked like (you need to insert your own project and trigger id):
https://cloudbuild.googleapis.com/v1/projects/cloud-scheduler-test-240911/triggers/f6553fe1-bfb7-4a39-b61f-2e47ada51a8d:run
Plugging that into Cloud Scheduler via the console seemed like it would work. However, it did not my friends.
The dreaded “UNAUTHENTICATED” error
We soon realised that by default Cloud Scheduler will not do any type of auth when the job is configured to hit a HTTP endpoint. That’s not to say it can’t do auth however. It can, but you just can’t do it via the Cloud Scheduler UI in the GCP console. If you try to, you’ll get this lovely error unceremoniously spat in your face:
{
httpRequest: {
status: 401
}
insertId: "102sg7zf6is66o"
jsonPayload: {
@type: "type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished"
jobName: "projects/cloud-scheduler-test-240911/locations/us-central1/jobs/foobar"
status: "UNAUTHENTICATED"
targetType: "HTTP"
url: "https://cloudbuild.googleapis.com/v1/projects/cloud-scheduler-test-240911/triggers/f6553fe1-bfb7-4a39-b61f-2e47ada51a8d:run"
}
logName: "projects/cloud-scheduler-test-240911/logs/cloudscheduler.googleapis.com%2Fexecutions"
receiveTimestamp: "2019-05-17T12:08:00.933182268Z"
resource: {
labels: {
job_id: "foobar"
location: "us-central1"
project_id: "cloud-scheduler-test-240911"
}
type: "cloud_scheduler_job"
}
severity: "ERROR"
timestamp: "2019-05-17T12:08:00.933182268Z"
}
However, the docs clearly state that you can configure the service account email address via the API when creating the Cloud Scheduler job:
_**—oauth-service-account-email=OAUTH_SERVICE_ACCOUNT_EMAIL**
The service account email to be used for generating an OAuth2 access token to be included in the request sent to the target when executing the job. The service account must be within the same project as the job. The caller must have iam.serviceAccounts.actAs permission for the service account. This flag must be specified if any of the other arguments in this group are specified._
Alas, we needed to create the Cloud Scheduler job from the command line and pass it a service account email address. We simply used the default App Engine one. It seemed like a reasonable one to try first as Cloud Scheduler is essentially running on top of App Engine anyway (AFAIR, Cloud Scheduler used to be “App Engine Tasks”). Our guess was right and it worked first time. For realz!
gcloud beta scheduler jobs create http foobar --schedule="* * * * *" --uri=https://
cloudbuild.googleapis.com/v1/projects/cloud-scheduler-test-240911/triggers/f6553fe1-bfb7-4a39-b61f-2e47ada51a8d:run --message-body={"branchName":
"master"} --oauth-service-account-email=cloud-scheduler-test-240911@appspot.gserviceaccount.com
Warning. The cron expression above is for every minute. Or is it every hour? Bah! I can never remember my cron expressions or multiplication tables.
And just like that, it started humming along. Cloud Build got triggered every minute (our prod system was scheduled to run every two hours), and it ran our BigQuery queries without breaking a sweat. Frickin’ sweet.
Here’s Cloud Build getting triggered every minute and running some BigQuery magic:
$ gcloud builds list --project=cloud-scheduler-test-240911
ID CREATE_TIME DURATION STATUS
23a06b68-2ee1-42da-a3a8-8524c756c727 2019-05-20T04:43:00 11S SUCCESS
7a4f77ae-4e8e-461b-a4ed-62184862f11f 2019-05-20T04:42:00 10S SUCCESS
299fc077-2bdc-4948-acb4-84cfd6ad9d64 2019-05-20T04:41:00 12S SUCCESS
2b6dd3f0-c150-41cb-8c6f-80749f7b3547 2019-05-20T04:40:00 12S SUCCESS
b48141be-596a-41d6-91fc-bb07c025947a 2019-05-20T04:39:00 12S SUCCESS
e3fdc735-c6f0-4c40-aac4-72ac10e89b18 2019-05-20T04:38:01 12S SUCCESS
9182b41f-670c-4028-946f-3ce6aee340a4 2019-05-20T04:37:01 10S SUCCESS
418664f3-3a0c-4187-8099-2b2fed349143 2019-05-20T04:36:00 11S SUCCESS
..
And here's the log output from one of the builds that was triggered by Cloud Scheduler:
starting build "efc2091f-6dcf-48aa-a6aa-59433d13a9fe"
FETCHSOURCE
Initialized empty Git repository in /workspace/.git/
From https://source.developers.google.com/p/cloud-scheduler-test-240911/r/foobarred
* branch 1bad7771c84ea44d66dc08cd974f864b117ae25c -> FETCH_HEAD
HEAD is now at 1bad777 foo
BUILD
Starting Step #0
Step #0: Already have image (with digest): gcr.io/cloud-builders/gcloud
Step #0:
Step #0: +-----+
Step #0: | f0_ |
Step #0: +-----+
Step #0: | 1 |
Step #0: +-----+
Finished Step #0
Starting Step #1
Step #1: Already have image (with digest): gcr.io/cloud-builders/gcloud
Step #1:
Step #1: +-----+
Step #1: | f0_ |
Step #1: +-----+
Step #1: | 2 |
Step #1: +-----+
Finished Step #1
PUSH
DONE
Step #0:
Waiting on bqjob_r7981644b429c16a0_0000016ada803ad7_1 ... (0s) Current status: RUNNING
Waiting on bqjob_r7981644b429c16a0_0000016ada803ad7_1 ... (0s) Current status: DONE
Step #1:
Waiting on bqjob_r6c3fc876b5e8f_0000016ada8048c4_1 ... (0s) Current status: RUNNING
Waiting on bqjob_r6c3fc876b5e8f_0000016ada8048c4_1 ... (0s) Current status: DONE
Wrapping up
As promised, here’s the lovely Nikko looking as cute as ever. She’s such a sweetheart. Dogs are the best.
Alright, so recapping this post: I think it’s safe to say we came up with a great little solution for easily scheduling BigQuery queries. However, because it’s Cloud Build and everything is just a ephemeral container in your pipeline, you could really schedule and run anything you want. It doesn’t even have to be just calling gcloud
commands, and GCP services. It can be any arbitrary code/software you want to run. Remember that.
A few of us in the office tend to agree on two things. The first is that I’m a charlatan. The second is that Cloud Build is an incredibly powerful, yet frequently overlooked tool on the GCP stack. It’s so much more than just CI/CD. You can use this little secret gem to get creative and solve lots of problems with it. It’s also cheap as chips and easy to use.
“Anthos? GKE? Nah, I’ll just use Cloud Build like a boss thanks."
That’s it. Must dash. I think I hear a faint cry of a baby wanting to be fed. No problem. It’s only 3:09am. The night is still young.