Skip to main content

Loading complex CSV files into BigQuery using Google Sheets

 


Building an ELT pipeline using Google Sheets as an intermediary

BigQuery offers the ability to quickly import a CSV file, both from the web user interface and from the command line:


  bq load --source_format CSV --autodetect \
mydataset.mytable ./myfile.csv0


Limitations of autodetect and import

This works for your plain-vanilla CSV files, but can fail on complex CSV files. As an example of a file it fails on, let’s take a dataset of New York City Airbnb rentals data from Kaggle. This dataset has 16 columns, but one of the columns consists of pretty much free-form text. This means that it can contain emojis, new line characters,


new characters


Indeed, try to open this file up with BigQuery:

  01 bq load --source_format CSV --skip_leading_rows 1 \
02     advdata.airbnb_nyc_2019 AB_NYC_2019.csv \
03 id:string,name:string,host_id:string,host_name:string,neighbourhood_group:string,neighbourhood:string,latitude:float,longitude:float,room_type:string,price:float,minimum_nights:integer,number_of_reviews:integer,last_review:string,reviews_per_month:float,calculated_host_listings_count:integer,availability_365:integer

and we get the errors like:


  01 Error while reading data, error message: Error detected while
02 parsing row starting at position: 104942. Error: Missing close
03 double quote (") character.


This is because a row is spread across multiple lines, and so the starting quote on one line is never closed. This is not an easy problem to solve — lots of tools struggle with CSV files that have new lines inside cells.


Sheets to the rescue

Google Sheets, on the other hand, has a much better CSV import mechanism. Open up a Google Sheet, import the CSV file and voila …


sheets


The cool thing is that by using a Google Sheet, you can do interactive data preparation in the Sheet before loading it into BigQuery.

First, delete the first row (the header) from the sheet. We don’t want that in our data.

ELT from a Google Sheet

Once it is in Google Sheets, we can use a handy little trick — BigQuery can directly query Google Sheets! To do that, we define the Google Sheet as a table in BigQuery:


create table


Steps from the BigQuery UI

  • Select a dataset and click on Create Table
  • Select Drive as the source, specify the Drive URL to the Google Sheet
  • Set Google Sheet as the file format
  • Give the table a name. I named it airbnb_raw_googlesheet
  • Specify the schema:
  01 id:string,name:string,host_id:string,host_name:string,neighbourhood_group:string,neighbourhood:string,latitude:float,longitude:float,room_type:string,price:float,minimum_nights:integer,number_of_reviews:integer,last_review:string,reviews_per_month:float,calculated_host_listings_count:integer,availability_365:integer


This table does not copy the data from the sheet — it queries the sheet live.

So, let’s copy the data as-is into BigQuery (of course, we could do some transformation here as well):


  01 CREATE OR REPLACE TABLE advdata.airbnb_nyc_2019 AS
02 SELECT * FROM advdata.airbnb_raw_googlesheet


How to automate

You can automate these steps:

  1. Here’s an article on how to read a CSV file into Sheets using Python
  2. From then on, use dataform.co or BigQuery scripts to define the BigQuery table and do the ELT.

To import complex CSV files into BigQuery, build an ELT pipeline using Google Sheets as an intermediary. This allows you to handle CSV files with new lines and other special characters in the columns. Enjoy!









Comments

Popular posts from this blog

Use Vault for Gmail Confidential Messages and Jamboard Files

Google vault will be supporting two new formats in the future, Gmail confidential mode emails & Jamboard files stored in Google Drive. Google Vault gives you a chance to retain, hold, search, and export data to support your organization’s retention and eDiscovery needs. This dispatch includes support for new information types with the goal that you can thoroughly oversee your association's information. What happens when individuals in your association sends confidential messages? Vault can hold, retain, search, and export all confidential mode messages sent by users in your association. Messages are constantly accessible to Vault, notwithstanding when the sender sets a termination date or denies access to private messages. Here’s an example of what admin@ink-42.com will see in Vault when they search for sam@ink-42.com and preview this email sent by lisa@ink-42.com . But It’ll not work vise versa. Admins can hold, retain, search and export message headers and s

Zoom’s Work Transformation Summit on Jan. 19: Fresh Approaches for Moving Forward

These past two years have undoubtedly reshaped work. More specifically, these past two years — shuffling between remote, in-person, and hybrid work scenarios — reshaped what employees expect out of their jobs, how they want to work, and what the office means to them.  Organizations are challenged with making big decisions to meet those expectations, and those decisions will dramatically alter how they hire, manage their facilities, buy technology, and maintain productivity. Simply adjusting policies and retooling previous work models won’t do. It takes a comprehensive reimagining. To help organizations navigate this next phase of work, Zoom is hosting our  Work Transformation Summit  on Jan. 19, a free, half-day virtual event designed to provide you and your organization with meaningful strategies, creative approaches, and innovative solutions for redefining work.  Summit attendees will have the opportunity to hear from peers and industry experts on the importance of embracing technolo

Access well-known educational technology tools straight from Google Classroom.

  We're making it simpler for instructors to use popular EdTech products that are most effective for their class right in Google Classroom with a new seamless integration of single sign-on, assigning, and grading. With the help of this feature, teachers can find, assign, and grade interesting content for their classes, and both teachers and students can access their EdTech tools without needing to navigate to other websites or apps or go through a cumbersome login process that requires remembering numerous usernames and passwords. This offers a more simplified experience when using technology to affect learning, in addition to saving instructors and students time. We partnered with 15+ EdTech companies to build custom add-ons, including Kahoot!, Pear Deck, IXL, and Nearpod.  Admins :  In order for educators to use add-ons, district administrators must provide access to them. For further information on how to install the add-ons functionality and specific add-ons for a domain, OU, o