Why should I learn how to use the =QUERY function?
I know what you’re thinking - why bother learning just another Google Sheets function?
On the surface, you’re right, it is just another Google Sheets function but dig deeper and you’ll learn that =QUERY is more like the gateway to big data.
The query function operates very, very similarly to Structured Query Language (aka SQL). SQL is often referred to as foundational in learning how to make use of, manipulate and access big data.
Learning how to use =QUERY in a familiar setting like Google Sheets is a pretty good way to start your journey towards working with big data. The best way? Learn SQL. Why? Well SQL is widely used - it’s a lot more likely that SQL is applicable in your home grown and managed database (or even via services like Microsoft’s Azure SQL or Google’s BigQuery).
In addition to learning a foundational big data language, query is powerful in its own right within Google Sheets. It not only makes finding data within spreadsheets easier, opens a world of possibilities that simply aren’t possible or easy to do using functions like IF, SUMIF, FIND, VLOOKUP and others.
If you’ve been introduced to =QUERY before, feel free to skip ahead to the Google Sheets’s tutorial that will help you apply what I’m laying out below. For those unfamiliar, continue reading for a breakdown of the syntax and function.
Introduction to =QUERY
QUERY Function Syntax
Let’s break down the syntax or parts of the function. Via Google’s documentation we can see that the function accepts three parameters or arguments:
- QUERY(data, query, headers)
Defining the parameters in plain english we get:
- data - the reference for where our data lives within the Google Sheets
- query - where we’ll specify what we’re looking for aka the query
- Note that this parameter is the SQL part, and will be the biggest area of learning
- headers - optional, the number of rows that are headers in your data
In addition to their definitions there are a few rules to keep in mind related to each parameter:
- In data:
- The values of a column within data have to be of the same type and can only be a boolean (true/false), numeric (including date/time stamps) and strings.
- If values aren’t all of the same type, the majority type will define the type for that column and minority values will be considered null
- In plain english: the data within a column needs to be the same flavor of data and if not, Google will look at all the values, determine what type has the most appearances, make that the type for column and any data that isn’t of that type will become an empty or null value, more or less meaning it will act as if it’s not there
- In query:
- This value needs to be within quotes and it can be a reference to another cell as long as that one has quotes wrapping it
- In plain english: we’ll need to put our query within quotes or Google won’t know how to interpret it
- In headers:
- Optional value if left blank, or set to -1, Google will make a guess at the headers within your data
- In plain english: this tells Google to not consider x number of rows as data, but instead recognize them as labels we gave to our columns. If we leave this out or set it to -1 Google will scan the data and estimate the number of rows that are header.
Now that we’ve learned about the basic syntax let’s break down an actual query to give you a feel for what this looks like in action. In the below query we’re not going to do anything special - return a few columns of data from a different tab (called “data”) in our spreadsheet.
=query(data!A1:Z1000, “SELECT A, B, D, I”, 1)
Breaking this down parameter by parameter we get:
- data = data!A1:Z1000
- In plain english: our data lives in the tab called data, in column A - Z and row 1 - 1000
- query = “SELECT A, B, D, I”
- In plain english: Grab full columns of A, B, D, and I from our data. Note that we surrounded this in quotes
- headers = 1
- In plain english: Hey Google, the first row of our data is a label for our data, don’t include it with the rest of our data and instead use it as labels for our returned set of data
See that was pretty easy, right? But now you’re probably thinking why would I use this function instead of just a reference to the data tab, a pivot table or a vlookup -- great question, query really shines when you start layering in clauses and data manipulation functions. We’ll cover clauses first, then data manipulation functions, and finally jump into a Google Sheets that will help you apply all this stuff.
QUERY Function Clauses
Clauses are essential to every query (literally, it won’t run without them!) and in essence tell our query what to do. It’s where specify the data that we’re aiming to get returned to us.
We’ve actually already used one, the SELECT clause. This clause specifies the columns to return, and in what order. But there are multiple other, arguably more exciting, clauses we have at our disposal. Here’s a table via Google’s documentation that covers all of the available clauses:
|select||Selects which columns to return, and in what order. If omitted, all of the table's columns are returned, in their default order.|
|where||Returns only rows that match a condition. If omitted, all rows are returned.|
|group by||Aggregates values across rows.|
|pivot||Transforms distinct values in columns into new columns.|
|order by||Sorts rows by values in columns.|
|limit||Limits the number of returned rows.|
|offset||Skips a given number of first rows.|
|label||Sets column labels.|
|format||Formats the values in certain columns using given formatting patterns.|
|options||Sets additional options.|
Note that these clauses have to be used in the order of the above table or you’ll throw an error, and nobody likes to throw errors.
On the surface these may seem pretty foreign but these are things we do all the time, just not explicitly. For instance, we’ve all sorted a table of data by a specific column -- this is no different than using the ORDER BY clause. For example:
=query(data!A1:Z1000, “SELECT A, B, D, I ORDER BY I”, 1)
This would return the same as the above query we covered, but rank everything from lowest to highest based on column I. Note that by default things are ordered ascending order. If we wanted our data from highest to lowest, we’d add DESC after the column we want to order things by, like this:
=query(data!A1:Z1000, “SELECT A, B, D, I ORDER BY I DESC”, 1)
The WHERE clause is a common concept too -- in plain english it’s saying return to me the rows that meet this condition, where the condition can be mathematical (e.g.: <, >, >=, <=, <>, etc) or string comparison operators (e.g.: contains, starts with, ends with, etc) For example:
=query(data!A1:Z1000, “SELECT A, B, D, I WHERE B CONTAINS ‘some criteria we want to match’ ORDER BY I DESC”, 1)
The above query would filter the data at the row level of the prior query, and only allow rows that matched our criteria get returned.
The LIMIT clause is relatively straightforward -- simply limiting the number of rows of data returned. For example:
=query(data!A1:Z1000, “SELECT A, B, D, I WHERE B CONTAINS ‘some criteria we want to match’ ORDER BY I DESC LIMIT 10”, 1)
The above query would return the first ten rows of data that it found vs. the prior query.
The GROUP BY clause aggregates values for each unique combination of values in the group by clause. In plain english let’s pretend you’ve got a bunch of data where the rows represent individual sessions on your website. The first column lists the date the session happened, the second column lists the landing page the user entered on, and the third column has the number of sessions that occurred on that date and started on that landing page. We could use a GROUP BY clause here to aggregate the number of sessions by date. To do this, though, we’ll need to use a data manipulation function. For example:
QUERY Function Types of Data Manipulation
Data manipulation functions fall into three categories, defined via Google as:
- Aggregation functions - intake a single column of values and perform an action across all values in each group (where groups are defined by the group by or pivot clauses, or all rows if not specified)
- In plain english: aggregation functions combine values into groups, e.g. returning the number of sessions in each month for two years worth of data
- Scalar functions - operate over zero or more parameters to produce to produce another value
- In plain english: that definition isn’t very helpful is it? Here’s what you need to know -- scalar functions take a value and turn it into something else related to it, e.g. take a date in this format: YYYY-MM-DD and return to me only the year, return the difference between date x and date y, and return this string but in all upper or lower case
- Arithmetic operators - perform mathematical operations upon anything that evaluates to a single number (i.e. the output of appropriate aggregate functions, operators, or constants)
- In plain english: you can do the math in your query
I’m not going to spend a lot of time going through aggregation functions in this post, mainly because it’s a rabbit hole we could spend a lot of time in and Google’s documentation does a good job at providing context and examples for each.
Having said that, aggregation functions can be used to, well, aggregate or combine values. This is done by wrapping the column we’re calling with the function. For instance, if we wanted the average of values in column D, we’d do that by using this syntax: avg(D). Similarly you can call the max, min, count and sum of values.
Scalar functions are most easily conceptualized by thinking of them as reformatting a value from one state to another. Just like aggregation functions we wrap the value we’re aiming to alter with a scalar function. For instance, if we wanted to return the month that a date falls in we could use this syntax: month(date “2019-06-04”).
Note that values are zero-based, which means they start at an index of 0 so this would return a value of 5 because January = 0, February = 1, etc.
Arithmetic operators are relatively straightforward and have four operators defined: +, -, *, and /. You probably already guessed but the + returns the sum of two numeric values, the - returns the difference between two values, the * returns the product of two numbers and the / returns the quotient of two numbers.
Google Sheets QUERY Tutorial
Before we dive in to our tutorial, I want to note one thing related to authoring queries that isn’t exactly intuitive - when SELECTing multiple columns, string them together using a comma between each. You don’t need to use commas when going from one clause to another though.
Let’s hop over to the Google Sheet I’ve setup to put what we’ve learned here into action.
Start on the instructions tab for a quick overview, the review and get comfortable with the data in the “data” tab and then head over to the “query - easy” tab to test your new query skills.
Lastly, I’m working on adding a more advanced section which will be in the “query - advanced” tab, but you’ll have to check back for that so be sure to subscribe to our newsletter if you’re interested. Also - don't forget to read another one of our posts on Essential Formulas for Google Analytics Data!