Shaving some savings from our BigQuery bill

If you have a product as popular as LiveScore, with 50M monthly users, you soon accumulate a LOT of data about those users. BigQuery is practically limitless in terms of what you can do with the serverless warehouse — but it’s not a charity. Here are some missteps we made along the way, and how we eventually optimised things

As a huge GCP user, it will come as no surprise that I’ve spent my fair share of time working with BigQuery and dealing with large datasets. But when livescore began recording so much data every month, our costs started to creep, then climb! We had data flowing in from loads of sources, like google analytics 360, exporting events from firebase, and pulling in data from salesforce and apple ads using FiveTran.

One day, our eagle-eyed cost controller noticed that we’d had a recent spike in BigQuery costs. Looking in our GCP billing console, it seemed that we’d spent thousands of dellars in bigquery analysis costs, over the course of just a few days!

Cost spikes, for just one project!

We immediately go into full investigation mode, to identify the… uhh… *remembers blameless post-mortems…* cause of the issue. It looked like a few queries had been run to find a specific record from our firebase events that we were exporting. Now if we had been storing this data in mysql and using an index, the query would have run in a matter of milliseconds, and cost next to nothing. Unfortunately, this query was not using some well-known best practices of bigquery. The result was that the user unknowingly asked bigquery to scan over a Petabyte of data, in order to return about 200Kb of results — just filtered, not aggregated.

Now in situations like this, it’s easy to start pointing fingers, and some organisations at this point like to implement draconian measures:

“only queries submitted through the ticketing system 2 weeks in advance with CFO approval can be run”.

But we’re better than that. Instead, we asked, how can we take steps to ensure this problem doesn't happen again in a similar manner.

Reducing total data stored is key

It’s easy to jump to some obvious but imperfect ideas:

  • Fire this employee (We’d never do that, we’re not a soulless org!) — but we’d simply end up hiring a replacement with no guarantee they won’t turn round and do the same thing.
  • Just remove the user’s access — they’ll get someone else to run the query. Possibly better, but problem persists.
  • Educate the user in best practice — certainly a worthwhile idea, which we’ll do, BUT — we’ll eventually hire more people, people forget things, and people make mistakes. Education is great, but it doesn’t wholly solve the problem: the problem isn’t this user, the problem is that anyone is capable of making this mistake.
  • Put guardrails in place to prevent this kind of abuse — I suggested using custom quotas, but unfortunately this was viewed as unnecessary and too restrictive.

We realised that a lot of the data we had stored, simply wasn’t of much value to us any more. This was highly granular data — unaggregated events of individual user interactions, dating back years! There’s no way we’re looking at that. Data becomes less valuable as it ages — we can make it more valuable by aggregating it into something more useful, and discarding that granular data. This shifts the data elsewhere and reduces the size enormously, by at least an order of magnitude.

Here are some of the steps we took:

Deleting duplicate GA360 export data

We had data from our various clients being sent to different GA properties. After we got all these exports each day, we moved them to a “unified” table. Basically a direct copy, but instead of sharded tables, it’s a date partitioned table, but also clustered on a few fields. This makes it easier to query the table in a way that is both easier for the analyst, and also more cost effective, since bigquery can skip sections of each partition not containing the relevant data.

But now our data is duplicated. Here are some approximate numbers we calculated for cost savings in this area:

  • size of GA exports: 58TB
  • growing at a rate of ~0.12TB/day
  • For the last 90 days, that’s ~11TB of data, stored in short-term storage, or $220/month
  • That leaves ~47TB in long-term storage, or $470/month.
  • Total monthly bill for storage of this data is $690
  • We decided we’d keep 30 days worth of data for reconciliation purposes — 0.12TB * 30 days = 3.6TB in short-term storage, or $72.
  • With some effort, we can save ~ $600 per month.

In order for our eagle-eyed analyst to be happy with this, we’d have to introduce reconciliation processes, but this makes our ETL more reliable, so we welcomed it. We lose nothing deleting this data, as we already have a copy.

Deleting data exported from Firebase

These are the events as viewed from firebase, and contains very low level aggregate data. Never been cleaned out, and this is the cause of our spike in costs. Some numbers:

  • size of firebase exports: 224TB
  • growing at a rate of 0.35TB/day
  • for the last 90 days, that means we’ve got 31.5TB in short term storage, at ~ $630/month.
  • leaving ~192.5TB in long-term storage, costing $1,925/month
  • Total monthly bill for storage is $2,555
  • We decided again to keep 30 days worth of data for reconciliation resolution — 0.35TB * 30 days = 10.5TB stored, or $210.
  • With some effort, we can save $2,300 / month.

We couldn’t straight up delete this data, as we had no copy, but instead we decided it’s aggregated form was as useful as it would be. We simply had to add some recs to this aggregation stage to be more confident

Deleting staging tables

Our dag looked like this:

We had a pretty hefty query at over 1000 lines, to transform the default GA export format, to something much more suited to our business model — extracting custom dimensions of each hit into useful info, and aggregating more appropriately. We broke this into multiple steps, as 1000 lines is a bit long for anyone’s liking, and we also ran into issues with BigQuery being able to process it all at once. But these staging tables were simply left lying around. There was no value in these tables other than in the immediate aftermath of something going wrong. So we added some reconciliation steps, and if everything was good, we deleted these staging tables. A little effort, but we saved ourselves another $300/month.

Deleting stackdriver exports

Once upon a time, we had set up exports of the stackdriver logs from our load balancer for analysis. However, this wasn’t used regularly for analysis. We had 150TB of data, growing at 0.38TB/day. We decided to just keep the last 7 days, for a cost of $53. Meaning we saved around $1800/month!

We found a few other dusty areas of our warehouse to empty, totalling around $880.

If you’ve been following along, our total savings came to $6000/month. This was enough money to hire at least 1 new developer! It sounds like a no-brainer, but we still need to ask the question: should we actually do this work? We need to know how long it will take, how much that will cost, and what could we reasonably be doing otherwise, what’s the cost of opportunity of doing this work? Some estimates put this at around 6 man days of effort. We’ve got no other time-sensitive work pressuring us at the point this comes to light. Even using our most expensive developer, it’s a no-brainer, but it doesn’t become a no-brainer until you’ve done the calculations to prove it.

Why didn’t you do this in the first place?

A very fair question for any financial controller to ask, but for each of these cases:

  • they weren't a problem initially
  • it either wasn’t evident that it would become a problem, or the solution required more focus on the implementation, and it wasn't the right time to put that effort in
  • these costs grew quite slowly over the course of a few years — you could’ve look at the numbers month to month and never noticed a significant increase. Only when looking at the total and asking whether this was money well spent, did this become worthy of investigation.

What was the outcome of this?

2 things — cost reduction, and impact of our next mistake.

A few weeks later, we got a lovely sight in our GCP billing dashboard — a massive visible slump in costs! In fact, since we execute the work at the start of the month, google’s not-so-fancy little line of extrapolation estimating your billing costs for the month, was projecting that by the end of the month, we’d have spent -1,230 dollars (Unfortunately we couldn't hold them to this estimate).

Remember that mistake we wanted to prevent happening earlier? I’m sad to say it happened again, in an almost identical fashion! Unfortunately this time, our new uninformed user was even more aggressive — they ran around 50 unoptimised queries! My heart sank a little on seeing this, and my eagle-eyed cost observer almost shed a tear. We had managed to rack up another expensive bill, for what should have cost cents. I almost lost it, until I decided to do a quick calculation — what would this have cost us, if we had NOT done the optimisation work? The user was running queries akin to select * from firebase.analytics_export_* where id = ? , so this would have queried the exact data we had deleted.

Turns out, we had reduced the data by 80%, saved our company from what would have been a bill in the tens of thousands of dollars! Now our actual bill didn’t sound too bad, and the effort spent on optimising this was clearly worthwhile. Needless to say though, I was finally able to convince someone to add those custom quotas.