Tales from the trenches: 20 ways to shoot yourself in the foot using GCP
This is not a rant, I promise
There’s a lot to like about GCP. The developer experience and tooling, the free tier, competitive pricing, the sweet network that it sits atop of, their helpful and kind Developer Relations teams, and the serverless/PaaS offerings that most of us love to use. But, just like all the other public clouds, there are some pitfalls that you can unwittingly stumble into.
So, for my first post of 2020, I’ve fulfilled my promise to some nice people that asked me to turn my recent conference talk (“Tales from the trenches: 10 ways to shoot yourself in the foot using GCP”) into a blog post. Except the thing is, I’ve thought of a few more since I last gave that talk. 10 more to be exact. So, now it’s 20 ways, otherwise known as “lots of ways”.
Hopefully, this list will help folks not make the same mistakes I did, or that I’ve seen teams make. I’ll try to keep it updated with new pitfalls, or remove ones that become redundant as Google roll more and more stuff out. It’s an ever moving target. I’ve deliberately left off BigQuery & Dataflow off this list because I already covered both of them in my last post here.
If you’ve got a few GCP pitfalls of your own and would like to share them, then please feel free to ping me on Twitter. Likewise, if you spot something that’s wrong. Otherwise, get off my lawn!
Don’t do what Donny Don’t does
-
Clicking “accept” without reading the T&Cs of Dataprep. It’s is a 3rd party tool from Trifacta (Wrangler) that’s been reskinned for the GCP console. It’s actually not a bad bit of kit TBH. But, it does mean that you need to give Trifacta elevated permissions on your project(s) to make it work. Chat with your security team first before deciding to use it, and get their blessing. It will save you a lot of headaches later on.
-
Selecting the wrong region for App Engine. Before selecting your region, be absolutely sure that you won’t need to change it later on down the road. Why? Because, you simply can’t do it! Well you can, but you’re going to be in for a world of pain. You need to delete the entire project, and start again. Yikes!
-
Enabling VPC Service Controls before reading about its limitations and not properly testing it. I can’t stress this one enough folks. Seriously! It’s notoriously hard to work with, quite immature, and on a few occasions, I’ve seen teams recoil in horror as their solutions simply fall over when they enable it in production. Be warned. Mwahahaha!
-
Pushing service accounts to your repos, especially public repos. Just ask Mark Edmondson about this one. This actually happens more often than you’d think. You shouldn’t need a service account for development anyway. Instead, use
gcloud auth
on the CLI and sign in with your Google account, or use theCloud Shell
. -
Thinking k8s/GKE is a silver bullet and will solve all your software problems. It won’t. Start off by reading this. It’s a great piece of tech, especially when you need to run at massive scale. But, there are plenty of other options available out there. You don’t need k8s/GKE to be successful.
-
Using Deployment Manager. Remember: “friends don’t let friends use Deployment Manager!". Use Terraform instead. It’s that easy folks. Many of us in the community can see the writing on the wall for Deployment Manager. I find that it just gets in your way, and is very narrow. It doesn’t receive frequent updates (see here), and you’ll also struggle to hire engineers that have Deployment Manager experience. Terraform gives you the ability to move across clouds, and is ubiquitous in the IaC space. Also, from speaking with Googlers, and watching the presentations at Next etc., it’s clear - to me at least - that Terraform stands aloft the number 1 podium for IaC.
-
Unintentionally deploying public endpoints/APIs (and leaving them unsecured for extra bonus points) when using serverless tools like Cloud Functions, Cloud Run, and App Engine etc. It may sound obvious, but I still see lots of people get caught out by this one. Before deploying, double check you’re not making it public and unsecured. If you are, ask yourself “why?” and if you really need to. Engaging your security teams to advice will make them feel warm and fuzzy too. Believe it or not, this security stuff is kind of important folks.
-
Leaving GCS buckets publicly accessible. This is a no-brainer, and you should have policies in place at the org level to prevent anyone making a GCS bucket public. Prevention is better than cure! Use a tool like this (disclaimer: I haven’t personally used this tool, but it looks good) to check for vulnerable buckets in your projects. This blog post on the Google Cloud site is also a very good read.
-
Using Cloud Composer when you don’t really need it. Airflow is hard and Composer is très expensive, with a very large footprint too boot. Familiarise yourself with its gnarly/complex architecture before getting into bed with it. If you don’t have to use Composer, and something more lightweight tickles your fancy, then take a look at using Cloud Scheduler + Cloud Build instead. It’s serverless, cheap as chips, and simple to use. You may not get DAGs and fancy UIs (that bug out), but you’ll save a lot of coin and hassle. Side note: Cloud Build is my 2nd favorite tool on the stack.
-
Not capturing all your audit and data activity logs from Stackdriver and pumping them into BigQuery for archival and analysis. You may not need those logs now, but some time in the future, and when you least expect it, your security team will be standing over your shoulder asking why the company is being held to ransomware because of a public bucket misconfiguration. You’ll want logs to show them. See #8.
-
Going all in on Cloud Source Repositories. It’s very bare bones, and doesn’t come close to other offerings like GitHub and GitLab. As such, it doesn’t scale well across big teams or the enterprise. I’m sure this tool will get better with time. But, as it stands now, it’s still quite limited, and doesn’t cut the mustard when compared to the other commercial offerings out there.
-
Not knowing the difference between the different App Engine environments, and ultimately choosing the wrong one. There’s Standard 1st generation (don’t use this one), Standard 2nd generation, and Flex. What, 3 isn’t enough I hear you say?! I agree. See here and here for more info.
-
Not checking the release status of services, and deploying a solution to production that uses something that isn’t GA, or is at least public beta. This is a particularly bad booboo to make if use a service that’s in alpha status! If you’re using an alpha service, you’re essentially a tester for Google. Public beta might be OK for you, but be sure you’re happy before clicking the deploy button. If you’re not, dupe your manager into deploying it for you so they can take the heat instead if it all goes belly up. Just how you dupe them will depend on your levels of cunningness.
-
Not checking the limits and quota limits of a service. Like a few of the others listed above, this is generally good advice for the all the clouds. For the serverless options on GCP in particular, make sure you go through the “Quotas and limits” page for each one with a fine tooth comb. You don’t want to get caught with your pants down later on when you need to scale that puppy up. Also, check that it’s available in the region that you intend to use it in. BigQuery quotas and limits is a good example of what I’m talking about. That said, some of them are soft limits, and can be raised on request.
-
Using Datastore for any new projects. It’s grandfathered. Use Firestore from now on instead. Firestore addresses a lot of the limitations and shortcomings that Datastore had. If you use Datastore, it’s very hard to untangle it from your solution. Also, if you select to run Firestore in Datastore mode you can never make the change back to pedigree Firestore. That ship will have sailed right on by you!
-
Not setting budget alerts on your projects. Forgetting to do this is a very bad idea, especially when you’ve got people like myself on your team who don’t really know what the hell we’re doing, and rack up massive bills. Just ask my old colleague Gareth Jones from Shine. Also, if someone compromises your project(s), unless you’ve got alerts set up, then you may not notice the bitcoin miners spinning up hundreds of VMs with sneaky names to camouflage them until it’s too late.
-
Trying to subscribe a Cloud Function to a Pub/Sub topic in another project. This currently won’t work. See my other blog post about that here. One trick is to use a Dataflow pipeline in streaming mode to act like a proxy between the 2 projects. What are you still doing here? Go read it!
-
Thinking Binary Authorization is part of GKE, and getting excited at the prospect of using it out of the box with Google’s managed k8s offering. It’s no longer part of GKE. You’ll need a shiny Anthos subscription to use it from now on. And that’s not the only thing moving in under the Anthos marketing umbrella. Another one is Cloud Run. Sigh. Tyler Treat wrote a great blog post about all this here, and it sparked some great conversation in the community e.g. here.
-
Not knowing that default backup locations on Cloud SQL for single region go to another continent e.g. Sydney. Watch out if you have data sovereignty constraints. Here’s the quote directly from the docs: “By default, Cloud SQL stores backup data in two regions for redundancy. If there are two regions in a continent, the backup data remains on the same continent. Because there is only one region in Australia, backup data from the Sydney region is stored in a location in Asia. For the São Paulo region, backup data is stored in a US-based location."
-
Not doing your due diligence on Data Fusion. Data Fusion is positioned as the answer to ETL using a GUI so losers like myself don’t have to code. But, it’s got some serious shortcomings, I’m afraid. It’s expensive because it runs a GKE cluster 24/7 under the hood. It takes 30 minutes to spin up, the UI feels clunky, and the API is confusing (there are 2 APIs). Also, as Tyler Treat (this dude really know his shiznit) has pointed out here, it runs ephemeral Dataproc clusters, which drops you into Hadoop world when you need to debug/troubleshoot your pipelines. That’s not very good news for user who only want to use a clicky-clicky-pointy-pointy GUI. Oh dear.