What is Google BigQuery

Google BigQuery is a serverless scalable data warehouse that is available on Google Cloud. The greatest advantage of Google BigQuery is that it allows companies to store and analyze big data without having to setup and manage infrastructure.

Companies can analyze and transform data using famliar ANSI-standard SQL. Google BigQuery can be accessed using an REST API, the Google BigQuery web interface (browser) and a command line tool.

By decoupling storage and compute, different layers of the architecture can be executed and scale independently.

Google BigQuery provides seamless integration with other Google Cloud Platform services (Bigtable, Notebooks, Cloud SQL, Compute Engine, machine learning services, etc.)

In the following video you can see a good overview of what Google BigQuery is and it’s advantages:

Components of Google BigQuery

  • Projects: Google BigQuery projects are like folders that contain other components, like users, datasets.

  • Datasets: Datasets contain tables

  • Tables: structured data

  • Jobs: processes that run in the backgroud. Jobs are used to execute queries for analyzing, exporting or importing data.

  • ACLs: Access Control Levels.

Google BigQuery Workflow

1- Load Data (ETL) into the Google Cloud Platform (GCP). Data could come from several sources, be it a CRM, server logs, data already available on the Google Cloud Platform or another big data provider. A first step in the ETL, would be to load the data into Google Cloud Storage for preprocessing. This step can be avoided if the data is already preprocessed.

2- Proprocess the data using the Google Cloud Platform: once the data is loaded into to GCP, there are several ways to preprocess or transform with the data for further analysis, like Hadoop or Mapreduce.

3- Once the data is preprocessed (optional) we load the data into Google BigQuery. In some cases, data doesn’t need to be preprocessed and can be loaded directly into Google BigQuery.

4- Once the data is loaded into Google BigQuery, it can be used for

What is Google BigQuery used for?

  • Data Dashboards.

  • Log Analysis.

  • Data Analysis.

  • Data Science: produce predictions for business decisions, machine learning.

What are the main competitors of Google BigQuery?

Google BigQuery vs AWS Athena

Amazon (Amazon Web Services, AWS) has two competitors of Google BigQuery: Athena and Redshift. Amazon Athena has a lot of similarities to Google BigQuery. Like Google BigQuery, Athena uses ANSI-standard SQL. Amazon Athena can be used to query data stored in Amazon Simple Storage Service (also called Amazon S3), allows users to manage and analyze data without having to set up infrastructure.

If a company already uses S3 for data storage, it could make sense to use Athena.

Performance

Google BigQuery is faster than AWS Athena. Some tests showed that:

  • A simple count of a 130 GB data set with 1+ billion rows took 11.7 seconds in Athena vs 1,8 seconds in BigQuery 1
  • A simple sum of two columns (same dataset as above) took 15,85 seconds in Athena vs 4,5 seconds in BigQuery.
  • A more complex query took (same dataset as above) took 22,76 seconds in Athena vs 7,8 seconds in Google BigQuery.
  • Another query took 26 minutes on Athena and 1:29 seconds on Google BigQuery.
  • Other tests showed that BigQuery outperformed Athena in different types of queries.

Take into account that BigQuery and Athena both rely on pooled resources, which means they do not guarantee consistent performance. The same query could take one seconds and three seconds on a second run.

Costs

At this time, BigQuery and Athena both cost $5/TB. For most common data formats, Google BigQuery has lower costs vs Athena. 2

Google BigQuery vs Amazon Redshift

Redshift is another product of Amazon for big data analysis. Redshift doesn’t uses S3 as storage, it requires data preprocessing and loading. For traditional relational datasets, Redshift is a better option vs. Athena.

Regarding Google BigQuery vs Amazon Redshift, Redshift shows superior performance. 2 and 3

Regarding cost, BigQuery has a complex pricing model that discourages data usage. BigQuery charges per storage and usage, while Redshift charges by the amount of data you store and by the number of nodes.

BigQuery is simpler to use, because it abstracts the details of the hardware, database and configuration. In this sense, Redshift could mean more maintenance and setup costs.

Microsoft Azure

Snowflake

Redshift

Redshift charges hourly

References


  1. https://medium.com/cloudwithmore/aws-athena-vs-google-bigquery-81a5e885d5c6 ↩︎

  2. https://www.gab.lc/articles/bigquery-vs-redshift-vs-athena/ ↩︎

  3. https://blog.panoply.io/a-full-comparison-of-redshift-and-bigquery ↩︎