Pro tips for Google Cloud Dataflow & BigQuery // Graham Polley

More drivel

‘Tis the season to be kind and generous, or so I’ve been told. With that festive spirit in mind, I thought it would be a good idea to share my pro tips (and also some random fun facts) for Google Cloud Dataflow and BigQuery. These are the two tools on the Google Cloud stack that I’ve worked with the most, so I’ve accumulated quite a few of them along the way. I also felt the need to capture them in one central place, and make them easily accessible to myself and others. I’m getting old and forgetful, so this post will come in handy for me when I can’t remember if I should use Legacy or Standard SQL in BigQuery (spoiler: use Standard). It’s not an exhaustive list by any stretch of the imagination, and there’s no specific order or categorisation to them. However, you may notice a running theme of cost optimization throughout the post. That’s because I’m a miser. But, I also like saving our customers money, and helping them sidestep the pitfalls I’ve fallen into over the years.

If you’ve got a few tips of your own and would like them added, then please feel free to ping me on Twitter. Likewise if you spot something that’s wrong.

Finally, apologies in advance for the wall of text coming up. Finding suitable images for black themes/background is beyond my lacklustre computering skillz.

Cloud Dataflow:

If you’re new to Cloud Dataflow, I suggest starting here and reading the official docs first.

Develop locally using the DirectRunner, not on Google Cloud using the DataflowRunner. The Direct Runner allows you to run your pipeline locally, without the need to pay for worker pools on GCP.
When you want to shake-out a pipeline on a Google Cloud using the DataflowRunner, use a subset of data and just one small instance to begin with. There’s no need to spin up massive worker pools. That’s just a waste of money silly.
Assess the new-ish Dataflow Streaming Engine and Dataflow Shuffle services to see if reduced costs and performance gains can be made in your pipelines. Check region availability first though as not all are supported.
Dataflow has three SDKS. In order of maturity & feature parity: Java > Python > Go. Personally, I recommend to use the Java SDK whenever possible. Java also has strict type safety, so there’s that too y’all. 🤷
Beam SQL looks promising, but don’t use it in production just yet. It’s not ready, and it’s lacking some SQL features. As a side note, Cloud Dataflow SQL (which is in alpha at time of writing) is based on Beam SQL. And if you want to go even deeper, Beam SQL is based on Apache Calcite. It’s turtles all the way down folks.
This one still catches a lot of people out. Dataflow is available in Sydney. Don’t confuse it with the Regional Endpoint, which is different and not available in Sydney. The Regional Endpoint location is where your pipeline is orchestrated and controlled from, not where the actual worker VMs spin up to process your data. Got it? Great, let’s move on.
Keep your security team happy by turning off public IPs if you don’t need them. Simply set the -usePublicIps=false flag/parameter. Easy-peasy-lemon-squeezy.
Assess FlexRS for batch jobs. This feature uses a mix of regular and preemptible VMs, and might work out to be cheaper for you to use. Again, check region availability first though.
If left unspecified, Dataflow will pick a default instance type for your pipeline. For example, if it’s a streaming pipeline, it picks an n1-standard-4 worker type. For most use cases you don’t need such big nodes. This will save you quite a bit of coin. Experiment with the instance size during shake-out and testing. While we’re on the topics of instances, consider using the N2 generation of GCE instances in your pipelines for a quick performance boost. You’ll pay a but more for them, but we’ve seen good results switching from the N1 family to the newer N2 class. Ignore the official Dataflow documentation that states to use N1 for best results. It’s probably out of date.
Cap the max number of instances for auto scaling. Experiment with the best number for your pipeline. Be cautious if you allow auto scaling with no cap on the max workers. Dataflow has been known to over-provision the worker pool for no apparent reason. I’ve seen this happen quite a bit. However, I do know that the Dataflow team have been working hard recently to make the auto scaling algorithm a lot sharper. Another option is to simply turn off auto scaling altogether if you know you won’t need it.
There was a bug with older versions of Dataflow where it would leave all its files behind in a GCS bucket. Err on the side of caution, and have a post processing step that always checks and deletes any temp buckets that were created during pipeline execution. Why? Because GCS costs can rack up quickly my friends.
When loading data/files from GCS, be sure to set lifecycle policies on your buckets (e.g. Coldline after N period), or delete the files altogether. Again, this reduces costs, which can creep up quickly on GCS. See #11 too.
Be wary when updating Dataflow streaming pipelines. You can do it, but Dataflow performs a capability check first. If the change is too complex, it will be rejected and the update will not happen.
Jobs can hang. It’s just software after all. Set up a monitor (e.g. Stackdriver) to alert your ops team if a job/pipeline that should only take predetermined period of time goes e.g. 5x over normal/expected execution time.
I’m yet to see Apache Beam run well on other runners like Flink and Spark in production at scale. If you’re doing this, and it’s working well for you, then please ping me. I’d love to hear about it. Seriously!
You don’t have to use templates to pass runtime/dynamic parameters to your pipeline. Sometimes the ValueProvider framework that allows you to do this can be a little too opinionated and limiting. There’s nothing stopping you - in the case of the Java - running it as a fat JAR or even compiling and running from scratch. You can then pass whatever parameters you like. See this Tweet from Robert Sahlin for some more deets. He knows what he’s doing.
Cloud Dataflow currently only runs in one zone. Some of the customers we work with require hot failover in the case of a zone going down. Until multi-zone is supported, you’ll need to engineer a workaround yourself e.g. set up Stackdriver monitoring to detect zonal problems, and automatically redeploy the pipeline to a healthy zone.
Test, test, and test some more. Make sure you write robust unit tests for your pipeline. There’s plenty of examples in the docs about how to test your Dataflow pipeline. You can use stubs/mocks in Java world. I don’t know how it works in Python. 🐍
Use the _Dead Letter_ pattern to siphon off bad data. We like to dump our bad data into BigQuery and perform analysis on them as they come in. It could be bad data or just your dodgy code. Either way, catch it, log it, and then dead letter it. Simples y’all.
Remember, a ParDo can execute any arbitrary code you like. You just need to be smart about it e.g. don’t go creating a JDBC connection inside one of your ParDos! You can call out to other GCP services or web services from inside your ParDo. This is something that we do commonly, and it works well.
Familiarise yourself with the @Setup, @ProcessBundle, @FinishBundle and @TearDown methods/annotations, and don’t get them mixed up. You use them to solve different problems e.g. use @Setup to establish resource heavy network connections that can be reused. See here.
Fun fact: Cloud Dataflow is based on Apache Beam. We all know that. But, did you know that it’s all based on two internal products that Google built for themselves called FlumeJava (batch) and Millwheel (streaming)? Probably not, because you don’t need to know this stuff. I just find it interesting. Oh, and ever wondered what “Beam” stood for? Well, folklore has it that the “B” is for Batch and “eam” is for Streaming - alas a framework that can do both batch and streaming. 🤯

BigQuery:

If you’re new to BigQuery, I suggest starting here and reading the official docs first.

Export all your audit and billing logs back to BigQuery for analysis. I don’t know how many times this pattern has saved my butt.
Don’t be lazy with your SQL. Avoid SELECT * on big tables unless you absolutely have to. BigQuery charges on data scanned. The less columns you reference, the cheaper the query.
BigQuery likes wide, denormalized tables with nested and repeated data. So, denormalise whenever possible (also see here). However, that doesn’t mean BigQuery can’t handle normalised data and joins. It absolutely can. It just performs better on denormalized stuff because BigQuery is essentially an OLAP engine.
LIMIT does not reduce costs. It’s an anti-pattern. You still pay for the table scan. It will return your results quicker though. There is one exception to this - see #7 for more info (clustering).
Use custom quotas to control costs when using on-demand. Also, use max-bytes-billed to control individual query costs. Gotta watch those greens!
Try to avoid ingestion-based partitioning if you can. It can get cumbersome when wrangling the data and working across multiple time zones. Instead, partition by a column instead. It’s more intuitive and easier to maintain. You can currently partition by date/timestamp or integer range.
Once you’ve partitioned your data, then cluster it for a free turbo boost on your queries and some cost savings. In addition, and as my good friend Felipe Hoffa pointed out on Twitter here, a clustered table will stop counting bytes early if a LIMIT statement is used for simple statements. This is another good reason to cluster your tables!
For partitioned tables, enforce users specify the partitioned column in their WHERE clause by using setting the require-partition-filter to true. Reduces cost and speeds up query time.
Contrary to popular belief, BigQuery’s storage layer is not GCS. It’s Colossus. GCS and BigQuery both use Colossus under the hood. Oh, and again, you don’t need to know this stuff, but it’s a fun fact: BigQuery uses Bigtable for its streaming engine, and Spanner for its metadata and query result preview. Ta-da!
dry-run is your friend. Use it! It checks that your query is syntactically correct and also estimates the cost of the query.
Keep an eye on the materialized views. It’s something that the BigQuery community have been waiting a long time for.
Set TTLs on datasets/tables when you know data won’t be needed after N period. If you want to archive even further, move it out to GCS coldline. It’s much cheaper.
Don’t SELECT * to preview data. Use bq head or tabledata.list instead.
BigQuery Machine Learning (BQML) is cool ‘n all, but dang is it expensive! At time of writing, it’s ~$470(AUD) p/TB when creating/training the model if you’re using the on-demand pricing model (it’s included as part of the flat-rate model). Be warned.
Don’t use streaming ingestion if you don’t have to. You need to pay for it. Batch loading is free on the other hand.
Enable the cache (24hrs). Remember that it’s not shared though. It’s per user, per project. Use something with a service/account in the middle if you want to share the cache i.e. a proxy.
Editing - not querying - a table resets the long-term storage discount counter. You’ll have to wait 90 days again if you do edit it. Bummer.
Be wary of 3rd party chrome plugins that promise to save you lots of money or improve the performance of your queries. They require elevated permissions on your GCP project(s), which might not fly with your security/privacy teams - especially at an enterprise level. I’m also dubious about their lofty claims.
Use the Public Issue Tracker to raise feature requests and get your friends to star them. The BigQuery engineers and PMs hang out a lot there. Don’t comment with “+1”. Instead, actually star it to give it a proper vote.
Don’t use legacy SQL for new queries. No excuses - just don’t! If you’re running legacy SQL from the old days, get a plan together to migrate off it ASAP. It’s no longer maintained and nothing new is backported.
Put your SQL in source control. Don’t treat it as a 2nd class citizen. Integrate it into your CI/CD pipelines. That should be a no-brainer.
Use SESSION_USER() as a workaround to BigQuery not having row level permissions yet. Until it’s available you’ll need to do this as recommended by Google.
Currently, BigQuery supports customer managed keys (CMEK), not customer supplied keys (CSEK). See here. Don’t confuse ’client side encryption’ with CSEK. They are not the same thing.
Avoid using the native scheduled queries feature in BigQuery if you can. I don’t like how they’ve implemented on top of the BigQuery Data Transfer Service (BQ-DTS). It raises too many concerns and questions by security teams - and rightly so. It also confuses people. Finally, they are tied to user accounts and very hard to untangle should the user/employee offboard the company, and not scalable. Instead use something like Cloud Scheduler + Cloud Build or Apache Airflow.
Instead of using Dataflow for ETL, look at BigQuery as perfectly good ETL tool in its own right. It’s also more performant. But, there are trade offs e.g. it’s easier to test code (Dataflow) than SQL etc.
BigQuery is not limitless. Data skews, too many joins, ORDER BY etc. will hurt performance and queries will fail. Keep your SQL lean. Watch this great video from Jordan Tigani for a deep dive on advanced techniques in BigQuery. It’s a classic.
Another fun fact is that all queries in BigQuery are performed in memory. Bam! That said, a spill to disk is still possible on BigQuery when a slot is overwhelmed ;)
Assess the new Storage API for quicker data retrieval. You need to pay for it though. Yikes!
BI Engine is currently very immature, but keep an eye on it as it grows up. Get ready to see Looker support, APIs etc. as time goes on, I guess.
Use batch queries when your queries are not time sensitive. They don’t count toward your 100 concurrent query limit.
Use approximate functions (within 1% of exact number) when you don’t need exact results e.g. approx_count_distinct. If you’re dealing with numbers in the tens of millions upwards, do you really need the answer to be exact? Who really cares if you’re stock trading system is out by a few cents. Oh wait..
BigQuery has two pricing models: on-demand and flat-rate. Learn the difference between the two of them. If your compute/analysis monthly bill is pushing north of $10K USD, then it’s time to look at flat-rate.
The limits and quotas page are important to stay abreast of. However, some of them are soft limits. If you’re a big enough customer then they can be raised on per-case basis. Talk to your local Google rep.
Someone that I work with, and who’s a lot smarter than me, wrote a nifty little open source tool for analysing all your BigQuery views. It’s very handy indeed.
The bq command line tool is incredibly powerful. You can solve a lot of problems with it, quickly and easily. You hook in all your favourite Bash commands/tools using pipe. See here for an example. #shamelessplug
Google recently open sourced ZetaSQL, which is the SQL parser and analyzer that drives BigQuery (and others e.g. Spanner). ZetaSQL can be found here. This is very useful if you want to build, for example, a BigQuery test harness/framework ;)
You can use FOR SYSTEM_TIME AS restore previously deleted tables/rows. It supports up to 7 days in the past. Awww, snap! See here.
BigQuery now has scripting, which allows you to do all sorts of funky stuff like send multiple statements to BigQuery in one request, use variables, tap into control flow statements such as IF and WHILE, and loops. Neat.
Just a few weeks ago, the BigQuery team announced that on-demand queries can burst through the default of 2000 concurrent slots “when it can”. I’d love to see more details released around how exactly BigQuery dertermines when it can burst. Very cool, nonetheless.