Subscribe Us

Header Ads

Loading complex CSV files into BigQuery using Google Sheets

 


Building an ELT pipeline using Google Sheets as an intermediary

BigQuery offers the ability to quickly import a CSV file, both from the web user interface and from the command line:


  bq load --source_format CSV --autodetect \
mydataset.mytable ./myfile.csv0


Limitations of autodetect and import

This works for your plain-vanilla CSV files, but can fail on complex CSV files. As an example of a file it fails on, let’s take a dataset of New York City Airbnb rentals data from Kaggle. This dataset has 16 columns, but one of the columns consists of pretty much free-form text. This means that it can contain emojis, new line characters,


new characters


Indeed, try to open this file up with BigQuery:

  01 bq load --source_format CSV --skip_leading_rows 1 \
02     advdata.airbnb_nyc_2019 AB_NYC_2019.csv \
03 id:string,name:string,host_id:string,host_name:string,neighbourhood_group:string,neighbourhood:string,latitude:float,longitude:float,room_type:string,price:float,minimum_nights:integer,number_of_reviews:integer,last_review:string,reviews_per_month:float,calculated_host_listings_count:integer,availability_365:integer

and we get the errors like:


  01 Error while reading data, error message: Error detected while
02 parsing row starting at position: 104942. Error: Missing close
03 double quote (") character.


This is because a row is spread across multiple lines, and so the starting quote on one line is never closed. This is not an easy problem to solve — lots of tools struggle with CSV files that have new lines inside cells.


Sheets to the rescue

Google Sheets, on the other hand, has a much better CSV import mechanism. Open up a Google Sheet, import the CSV file and voila …


sheets


The cool thing is that by using a Google Sheet, you can do interactive data preparation in the Sheet before loading it into BigQuery.

First, delete the first row (the header) from the sheet. We don’t want that in our data.

ELT from a Google Sheet

Once it is in Google Sheets, we can use a handy little trick — BigQuery can directly query Google Sheets! To do that, we define the Google Sheet as a table in BigQuery:


create table


Steps from the BigQuery UI

  • Select a dataset and click on Create Table
  • Select Drive as the source, specify the Drive URL to the Google Sheet
  • Set Google Sheet as the file format
  • Give the table a name. I named it airbnb_raw_googlesheet
  • Specify the schema:
  01 id:string,name:string,host_id:string,host_name:string,neighbourhood_group:string,neighbourhood:string,latitude:float,longitude:float,room_type:string,price:float,minimum_nights:integer,number_of_reviews:integer,last_review:string,reviews_per_month:float,calculated_host_listings_count:integer,availability_365:integer


This table does not copy the data from the sheet — it queries the sheet live.

So, let’s copy the data as-is into BigQuery (of course, we could do some transformation here as well):


  01 CREATE OR REPLACE TABLE advdata.airbnb_nyc_2019 AS
02 SELECT * FROM advdata.airbnb_raw_googlesheet


How to automate

You can automate these steps:

  1. Here’s an article on how to read a CSV file into Sheets using Python
  2. From then on, use dataform.co or BigQuery scripts to define the BigQuery table and do the ELT.

To import complex CSV files into BigQuery, build an ELT pipeline using Google Sheets as an intermediary. This allows you to handle CSV files with new lines and other special characters in the columns. Enjoy!









Post a Comment

0 Comments