COVID, leaving Australia, and a Spanner in the works // Graham Polley

“Being Irish, he had an abiding sense of tragedy, which sustained him through temporary periods of joy.” - WB Yeats

I’ve decided to leave Australia after an incredible 10 years. I gained my Australian citizenship in 2018 and both my children were born here. I have lots of fond memories of the land down under. However, the pandemic that bulldozed the normalcy of our lives in 2020 was hard. Really hard. The grueling lockdowns in Melbourne aside, raising two young kids without any family support - my family are in Ireland and my wife’s are in Germany - over the last 2 years has finally taken its toll. COVID shone a light on just how isolated we are here, and it made us reevaluate our plans and what’s best for us as a family.

After agonizing over the decision for several months, we made the call to hang up our Australian boots and relocate back to Germany. Alas, after not leaving the state of Victoria for over 2.5 years now, we mustered up the courage to sell our house and belongings, and do what we think is right for us as a family. The initial plan was to take a nice long holiday in northern Queensland (Great Barrier Reef territory for my non-Auss’y readers) before bundling onto an airplane and making the long trip back to Europe.

A Spanner in the works

But instead of sitting on a nice idyllic beach in Palm Cove sipping on Piña coladas right now, I’m actually stuck in mandatory isolation with my family, and the first leg of the holiday has been canceled. A day before we were supposed to leave Melbourne, my son tested positive to COVID, followed by myself shortly afterwards. Physically, we’re all fine and recovering, but obviously it’s scuppered our travel plans.

Being in isolation gives you time to do other things when the rug-rats are happily entertaining themselves with drawing all over the walls or stuffing non-toxic materials into each other’s ears. Yaaayy, I love parenting! Anyway, so apart from catching up on my correspondence, I also managed to get on the keyboard and write up a few lessons learned with using Cloud Spanner on a big honking enterprise platform that’s running at some pretty serious scale to boot. We’ve been using Spanner on this project for a while now, and the team have faced into a few interesting challenges. See here for more deets on the use case itself. Hopefully this will help others who plan or are using Cloud Spanner in their stacks.

Keeping it real

I’m not one to mince my words, so here’s a few real-world considerations when working with Spanner. I posted a couple of midnight threads on Twitter lately about our experiences with the tool (she what I did there?!), and a few people DM’d me asking if I could put it all together in a concise blog post. This is basically a quick - yet insightful - list that the team has accumulated over the last 18-ish months and that we wanted to share in the hope of bettering the GCP community and any other customers that may be using Spanner.

If you’re hyper sensitive to low latency writes with a high concurrency architecture, consider using an external distributed lock (e.g. Redis) instead of Spanner’s built in transactions. Not for the faint-hearted, but we ended up having to implement it for our use case. This is the way.
Spanner has two APIs for modifying data - Mutations and DML. We initially ran with DML, but pivoted to the Mutations API due to performance reasons (see #3 coming up next). My theory for this difference in performance is because Spanner initially didn’t have a SQL engine, and it was bolted on after its initial (internal) design. See here. It’s always had the Mutations API (running on gRPC) and some folks suggest that DML even gets converted on the fly to the Mutations API under the hood. See this insightful thread on Stack Overflow for more on that!
Using a Java stack for your application layer? You have a choice of two Spanner client libs. One is “good”, the other one not so much. Avoid the Hibernate/ORM one. Use the Spring Data one instead. It uses the mutations API (see #2). Or, just don’t use Java. Your call.
Do yourself a favor and use FORCE_INDEX on critical workload queries in Spanner. Shhh, now. Don’t ask questions. Just do it. Word of warning though if using a Java stack (see #3) - the FORCE_INDEX hint is not supported by Spring Data. You’ll need to drop down to Spanner templates. Bleh! FORCE_INDEX also needs all referenced columns to be set as stored columns in the index, and as such, it introduces a tight coupling of your app and schema/DB layers. See here.
We use Liquibase for schema management, and for some reason it’s slow when talking to Spanner e.g. adding new indexes or columns. Maybe we’re doing something wrong, but DDL seems to crawl when used. We’re investigating this one with the Spanner team at the moment.
Spanner has a rather inconvenient warm-up period. It caught us out in the beginning. Make sure you plan your release with this in mind. It’s REALLY important. Docs have now been updated to reflect. See here.
Federated queries are not very performant when running at serious scale. You take a hit. They can also add significant load to Spanner (CPU), so watch out if you have strict NFRs/SLAs for your platform. Schedule them accordingly, plan etc. Keep your federated queries as simple/lean as possible. Don’t push any type of analytical workload down to Spanner. It really doesn’t like it. Instead, materialize data to BigQuery native tables, and then hit it with some sweet ‘auld OLAP directly in BigQuery. Avoid cross-regional querying between BigQuery and Spanner at all costs. You’ll thank me later. Finally, enable “read data in parallel”. It makes a massive difference. For example, a query I ran to suck across 400gb from Spanner to BigQuery took 25mins without this flag. With it enabled, it was 5 mins. Our use case for federated queries is data/table replication to BigQuery. It’s working well so far.
Take note of this one. It caught us out several times. Mutations per commit (including indexes) is 20K. You need to design around this. Oh, and it included columns too, not just rows. Uff-da! See here.
Native TTLs are fine for simplistic deletion logic. But, if you need something more complex for your business rules, then it’s too narrow. You can use generated columns to get a little more creative, but they don’t support joins. So, you need joins in your business logic, you’ll need to engineer another solution.
Basic OLAP workloads e.g. count(*) and aggregations are slow. See #7 above. Spanner is a OLTP workhorse, not an OLAP beast like BigQuery. Even with an index on the table, doing basic aggregations and counts seem to cripple performance. We’d love to see Spanner be able to do both OLAP (at a basic level at least) and OLTP. Wink, wink, nudge, nudge.
Read the docs. Then read the docs again. Then recite the docs on a blackboard 100 times whilst chanting “Spanner will be our savior” and listening to death metal music. In all seriousness RTFM before building anything. They are good - and an important input to getting your solution designed correctly.
You’ll need more nodes than you initially estimated. Trust me. Keep CPU around the 40%-50% mark, not the 65% target that’s mentioned in the docs.