BigQuery: date sharding vs. date partitioning
- Why split tables
- date sharded tables
- partitioned tables
- when to use which
- what about clustering?
When database tables become very large, querying them eventually leads to one of 2 problems — it becomes quite difficult, as your database has a lot of data to scan and aggregate, or if you have a very scalable database, it becomes very expensive!
In order to combat this, databases offer an array of useful features —typically, an RDBMS will allow indexes to quickly find data, partitions to physically split data into multiple tables internally, and multi-dimensional clustering to allow sorting the data in a sorted manner, so that similar data is stored together. This allows us to eliminate large sections of data that no longer need to be scanned. This makes our query a lot faster on smaller systems, or more importantly on the highly scalable BigQuery system, a lot cheaper!
Bigquery offers a few features that overlap in this area, so which to use?
Date Sharded tables
These sound more intimidating than they are. You can created a date shared table, by creating 2 or more tables that have the following characteristics:
- exist in the same dataset
- have the exact same table schema
- the same prefix
- have a suffix of the form _YYYYMMDD (eg. 20210130)
When tables like these exists, BigQuery now sees these tables as a collective:
- The UI shows them grouped together, with an indicator for the number of tables, rather than a very long list filling up your dataset panel
- You can view each individual table, one at a time, with a nifty little drop down. Unfortunately, this has it’s downsides — you’re treating the whole table as a collective, but you can’t view collective information, like total size, and deleting them all can be a tedious task, or require some automation.
- You can query the tables using a wildcard to conveniently perform unions of these tables:
select * from example_dataset.data_sharded_*
. But note, this is simply syntactic sugar for the following:
select * from `example_dataset.date_sharded_20210129`
union all
select * from `example_dataset.date_sharded_20210130`
union all
select * from `example_dataset.date_sharded_20210131`
...
You can also see estimates how how much data will be processed by a query, as usual. this can give you a guide to how much data is in your partitioned tables
You can have each individual table expire after some timeframe, typically by setting the default expiry time on the dataset. This will affect all tables in the dataset.
Why do date sharded tables exist?
These are something of a legacy feature, they were more commonly used with bigquery’s legacy SQL. However they’re still somewhat common, the google analytics 360 bigquery export for example, provides a new table shard each day, for the new data from the prior day.
You can limit the amount of data you query by only using a single fully qualified table, or using a filter to the table suffix:
select *
from `example_dataset.date_sharded_*`
where _table_suffix between '20210129' and '20200130'
Partitioned Tables
A partitioned table is just a single table, but it is physically broken into chunks to make querying simpler. You can do this in a few ways:
- split by ingestion time — get a new table partition each day, and any rows you insert during that day get added to the new partition
- split by some feature of your data — this is very commonly done for a date field (or the date of some timestamp field), with one partition per day, but can also now be done based on some integer if you tell bigquery how those ranges should be defined. (eg you could have a table of people, with a column for age, and define the partitions as 0–10, 10–20, 20–30, etc).
These have advantages for a few reasons:
- They’re a more intuitive and natural format for a large collection of data — the same thing in a date sharded table but in one table. This makes it easier to manage through the UI.
- There’s just a single table for metadata and permissions, which also reduces query overhead and improves performance.
- The data-based partitioning allows for features that might be impossible to implement with sharded tables.
- Can have up to 4000 partitions, whereas a query using date sharded tables can only query up to 1000 tables at once.
- You can have single partitions in the table expire, without needing to set the option to all tables in the dataset.
The bigquery docs quite clearly advise to use partitioned tables, citing performance, quotas, and user semantics:
The recommended best practice is to use date/timestamp/datetime partitioned tables instead of date-sharded tables
Are there legitimate reasons to use date shared tables instead?
Why does GA360 export tables in date sharded tables if google also recommends using partitioned tables?
If you’re designing a data mart for analysts to use, certainly aim to use a partitioned table — they’re more fully featured than the options of sharding, easy to set up, and easier for the end user of your warehouse.
However, if you’re a data engineer and you’re moving data around, date sharded tables can often be an easy alternative that allows breakpoints in your flow, and simpler management for import/export.
Dealing with these table shards can be simpler — if you just want to export a single partition of your table with a partitioned table, you’d have to run a query to select just the relevant partition, and export the query result, which costs time and money and adds complexity. However if the table is already date sharded, you can simply export it.
Also, if you’re running a large ETL system, a common pattern might be to write some data into a table like so:
insert into `example_dataset.partitioned_table`
select
distinct *
from `example_dataset.table_with_duplicate_rows`
where date = date_sub(current_date, interval 1 day)
However this step isn’t idempotent — if this task were to accidentally be run twice, you’d end up with duplicate data in your partitioned table. However if your partitioned table were instead date sharded, you can overwrite the entire shard — even if this runs more than once, the effect is idempotent. This same mechanism also applies to importing tables from Cloud Storage or other sources. It would look like so:
create or replace table `example_dataset.date_sharded_20210130` as
select
distinct *
from `example_dataset.table_with_duplicate_rows`
where date = '2021-01-30'
This can become important when you have a large number of tasks in your ETL, possibly orchestrated by airflow. An interesting failure scenario I’ve seen in the past, is that an airflow worker will start the task, which sends the job request to bigquery, and while the bigquery job is running, the airflow worker can crash. In this scenario, the crashed worker cannot make a call to bigquery to cancel the job.
The scheduler doesn’t know about the bigquery job — it only knows that a worker was charged with executing a task, and that task wasn’t able to be execute. So if you’ve set your scheduler to automatically retry, the scheduler re-schedules the task to run again. The original job might have completed, so the exact same task runs again, which can cause havoc if you haven’t planned your ETL with idempotency in mind.
It can be wise to plan steps in your ETL processes to ensure data integrity though — this allows early detection and correction of errors. Here’s an example setup that checks each operation after execution:
What about clustering?
Clustering is a neat feature of a table, allowing us to specify fields we’ll often use as filters or group by operations, to keep that data stored in roughly that order for each table or partition. This means after bigquery has excluded all irrelevant partitions of a table, it can then also exclude large chunks from each partition when filtering, or make work easier to perform when aggregating by these fields.
A good example might be a partitioned table with user activity for each day, where the partition is based on the date, and then have the partitions clustered by the user’s geographic locations, like country, or device type.
This allows you to effectively elminate large sections of each partition from scanning. Bigquery doesn’t store metadata about the size of the clustered blocks in each partition, so when your write a query that makes use of these clustered columns, it will show the estimated amount of data to be queried based solely on the amount of data in the partitions to be queried, but looking at the query results of the job, the metadata will show a reduced amount of data queried if you filtered it out.
These options don’t affect your clustering method though, you can clster a single partitioned table, or separately cluster each table in a date sharded table, the effect will be the same.
Conclusion
Although the BigQuery docs recommend using partitioned tables when designing a warehouse structure, the date sharded option can be more useful for temporary data engineering purposes when building out more permanent or visible tables.
Enjoyed this? My next article will be an opinion piece about my work habits, and will be published on or before 1st Feb — I’ve been getting through a lot of studying and learning outside of work lately, and want to share some of my patterns. Follow along to get a little notification or something for the next thing. After that will be a technical piece about networking fundamentals I found interesting.
Other BigQuery articles: