A cautionary tale about wasting money on Google Cloud

I was saved by rate limits. The bill still came out to around $13k, but it could easily have been $4.2MM.

tl;dr: Understanding cloud pricing - and cloud architecture - is critical to avoiding nasty surprises. I made some unsubstantiated  assumptions and got bitten. Also: BigQuery is not like a normal database.

The scenario was surprisingly simple. I wanted to do a big "group by" on about 9 billion rows of data. I had loaded the data into an unpartitioned BigQuery table on Google Cloud, which ended up being about 1.2 TB in total size. After the group-by step, I would be doing some other reformatting and analysis, so I had decided to use Google Cloud Dataflow for the grouping. There were about 700,000 evenly sized groups in the data. The group by on dataflow was freezing up, so I decided to mix the dataflow pipeline with another that would find the unique groups first, then make one query per group manually into big query (first red flag: this feels like a hack). Big query scales linearly by dataset size right? So no big deal?

Here’s the problem. Big query charges per query, which in turn scales linearly by the size of the data. So I was going to be making 700,000 queries against a 1.2 TB Big Query dataset. Had big query not rate limited me on this I would have spent $5/Tb/query * 1.2 Tb * 700,000 queries = $4.2MM.

Yikes! Google forgave the $13k, which I appreciated.