Guess what? We’re only a few months away from losing Universal Analytics. If you’re reading this, I’d imagine this isn’t news to you.
I’ll skip the history lesson on why this transition is so monumental for any and all organizations relying on Google Analytics - a.k.a. the data platform we’ve all come to know, love, and trust for over a decade now.
In this post, I'll address what you should expect when using native GA4 reports vs how that data compares to what you’ll get from Looker Studio (formerly known as "Google Data Studio").
In preparation for the change, I started to look at what to expect between reports in Universal Analytics and Google Analytics 4 using Looker Studio and noticed something peculiar about the data coming in between the platforms.
Spoiler alert: UA matched, but GA4 didn’t.
UA and Looker Studio
I said I’d skip the history lesson, but we’re going to need a little bit of context before we dive in. For some time now, Looker Studio provided a great native connection with UA that allowed you to pipe your data into beautiful custom dashboards and reports.
The majority of metrics had a 1:1 match between what you saw in the UA interface and what was displayed in Looker charts - as you would probably expect.
Example 1, Property A
As a quick example, here is a screenshot of Scorecard charts totaling Pageviews, Sessions, and Users over a 15-day period compared to the same period in Universal Analytics. As you can see, these metrics directly match.
You might be asking yourself now - why even bother showing this? Because many organizations use Looker Studio for their KPIs and general performance reports and then jump into Universal Analytics for more detailed analysis and explorations.
Since you’re jumping between tools, it's important to understand whether or not the data is a 1:1 match so you can use either one for ad hoc reports and requests.
Before we jump right into GA4 and Looker, we need to understand the different reporting models featured in GA4. There are currently three different Reporting Identities available - properties use Blended by default.
- #1 - User ID
- #2 - Google Signals
- #3 - Device ID
- #4 - Modeling
- #1 - User ID
- #2 - Google Signals
- #3 - Device ID
- #1 - Device ID (i.e. Client ID for websites and App Instance ID for apps)
The Blended model uses everything available in GA4’s arsenal to match sessions to users that previously visited the site. The user matching goes in order of the methods listed. For example, if User ID cannot match the session to a user, GA4 will use Google Signals to find a matching visitor, and so on. You will, of course, need to collect User IDs and enable Google Signals in order to fully receive the benefits of the Blended model.
While the models all rely on Device ID at some point, they can vary greatly in reports and explorations depending on which is selected. While I would expect to see different figures for users, sessions also change dramatically. As soon as we have a better understanding of why, we’ll let you know.
Now, how these different reporting models change data requested from the API and in Looker Studio is yet to be seen. Editing these reporting models changes the data presented within the interface of GA4, but it does not appear to alter what’s sent to Looker Studio.
Sampling and Data Thresholds
Another feature in Google Analytics 4 that you can expect to influence your data in the interface are data thresholds and sampling.
Thresholds are added to low-volume user-focused reports in order to maintain privacy. In Universal Analytics, it was possible to isolate sessions in a way to track individuals and leverage additional demographic, interest data and other signals to identify the user. To prevent this from happening in GA4, Google now applies thresholds to data whenever a minimum number of users is not met for a particular report.
Also, in the opposite direction, high-volume reports also feature sampling. While this isn’t new in GA4, it is something to keep in mind when exploring your data. To keep requests and overall loads down, Google will apply sampling to reports with 10 million events in free accounts and 1 billion for 360 accounts.
Sampling is used in everything from estimating election results to testing soil composition in a particular region - so this isn’t a GA4 exclusive.
You may see differences by a few thousand in cases where you have millions of events reported, but the overall proportions and ratios should be consistent.
Differences in Native GA4 and Looker Studio Reports
Now onto what got me started on this project in the first place - the differences in data between Looker Studio and GA4.
First off, Google may not be applying the same kind of models to requested data as in GA4 itself - data aggregation differs depending on the time frame it is requested in. One clear example of this is in time series charts compared to scorecards. In the graphs below, you can see the daily totals as exported into Excel compared to the scorecards for the same data source, during the same time period.
Example 2, Property A
For the properties tested, views always matched but sessions and active users always varied. Sessions are reported higher in the scorecard than in the chart totals, while users underreported across the board in the scorecard.
This leads us to believe that Google measures totals for a given time period differently than day-to-day. None of the differences however were to a degree that would make the reports unusable. But, this isn’t where the discrepancies are the most alarming.
What really caught my attention is just how far off the data can be between these two tools.
For the following examples, all reports are using the same setup as above - exported time series data and scorecards from Looker Data Studio over a 15 day period compared to the same period in Google Analytics 4 explorations grouped by day. Also, none of these examples were subject to thresholds or sampling.
Example 3, Property A
Example 4, Property B
Example 5, Property C
The last one is what truly caught me off guard. Considering just how far off it was, I wanted to know how close Device-based reporting would get since it should result in the highest number of sessions and users compared to the other models.
Example 6, Property C
Even changing the Reporting ID to the most generous setting in terms of volume, the figures are still drastically different.
What Do We Do About Reporting?
Google admits that they do not benchmark data between tools, but this is somewhat expected when comparing systems as different as Google Ads and Google Analytics.
But I would personally expect figures to at least get close when pulling from the same data source. It seems however that metrics in GA4 receive other treatments aside from sampling and reporting ID calculations compared to the data you pull into Looker Data Studio.
We recommend being consistent with where you pull your data. If you start exporting reports and explorations from the GA4 interface, keep that as your source of truth. If you build detailed reports in LDS for monthly review, stick to that.
Do not expect the same exact figures in Data Studio as in GA4 - doing so will cause quite a bit of confusion for you and your organization.