Sampling in Google Analytics can be a real challenge. Read this post to enhance your knowledge on sampling, hit limits, thresholding and cardinality in Google Analytics 4.
I have been using Google Analytics for 10+ years now and one of the biggest pains in Universal Analytics is sampling. There are certainly ways to deal with it, but for the regular (non-paying) user there isn’t a very good solution.
Today, I will dive into the concept of data sampling in Google Analytics 4 and related topics you need to understand to remain confident in your data.
Table of Contents
- Sampling in Google Analytics 4
- Standard vs Advanced Reports in GA4
- Google Analytics 4 vs Universal Analytics
- Hit Limits in GA4
- Thresholds in GA4
- Cardinality in GA4
- BigQuery and GA4
- Concluding Thoughts
Let’s start with an introduction to sampling in Google Analytics 4.
Sampling in Google Analytics 4
One of the questions I often get from users starting with Google Analytics 4 is related to data sampling.
“Is the data in Google Analytics 4 still sampled as it is very challenging for me in Universal Analytics?”
Sometimes it is and we will explore this later in more detail showing some examples in Google Analytics 4.
Sampling in GA4 might occur if you use the advanced features in Google Analytics and you pass a certain events threshold.
Standard vs Advanced Reports in GA4
In Google Analytics 4, the standard reports are always unsampled. This is true even if you apply secondary dimensions, filters or other report modifications.
You can find the standard reports under “Reports” in the main navigation:
The green icon above indicates that the reports are unsampled.
Google Analytics shows an orange icon when data sampling occurs or as a warning regarding a certain threshold that might be applied.
In the example above the report is based on 100% of available data, but there is a warming about a threshold.
The standard, default report section is a good start, but very limiting if you want to get the most out of GA4. A wide range of advanced features need to be mastered to get the most out of this new version of Google Analytics.
Sampling might occur when you create an advanced analysis in GA4.
This is where you can really dig through your data and derive greater insights. This section currently contains the following report templates:
- (Blank)
- Free-Form
- Funnel exploration
- Path exploration
- Segment overlap
- Cohort exploration
- User lifetime
I will share more details on these reporting options in a future blogpost. For now, you need to understand that sampling might occur if you use these advanced analysis feature in Google Analytics 4.
Event count > 11 million and 90% of available data is used to generate this report. In this case I would still trust the data, but in general I would say be careful if this percentage is lower than 70 or 80%.
Google Analytics 4 vs Universal Analytics
You might wonder, how does sampling in Google Analytics 4 exactly differ from Universal Analytics?
In Universal Analytics, the default or standard reports are always unsampled. However, sampling occurs if you apply secondary dimensions, segments or other ad-hoc queries to your dataset. Data sampling occurs at a certain threshold and it depends on whether you are a paying customer or not:
- Analytics Standard: 500k sessions at the property level for the date range you are using
- Analytics 360: 100M sessions at the view level for the date range you are using
Read this post about Universal Analytics and sampling if you want to learn how you can best deal with sampling.
Back to Google Analytics 4. The default or standard reports are always unsampled (you can’t apply segments here). This is true even if you apply ad-hoc queries to your dataset. You might have noticed that the number and variety of default reports is greatly reduced in Google Analytics 4 compared to Universal Analytics.
The advanced reports in the Explore/Analysis section are usually sampled if you are exceeding 10 million events and the report you create is not a pre-existing standard report.
Hit Limits in GA4
Here comes the great thing…
In Universal Analytics (free) there is a hit limit of 10 million hits per account on a monthly basis.
Google Analytics 4 is also free (there will be a paid version as well) and has no hit/events limits. This is really great if your company has a high number of daily users on the site and/or app and triggers loads of events.
Thresholds in GA4
Data thresholds in GA4 are system-defined and cannot be adjusted. This occurs in Google Analytics 4 for certain dimensions to protect users privacy.
Demographic and affinity dimensions are mostly affected. Here is what Google says:
If a report or exploration includes demographic information, such as Age, Gender, or Interest Category, and the reporting identity relies on the device ID, the row containing that data may be withheld if there aren’t enough total users to prevent individual users from being identified.
Let me show you an example of the standard “Demographic details” report:
The dimension value “unkown” is applied to the Country in the majority of cases and has a strong impact on the gender data visible in GA4.
Here is a more detailed look at United Kingdom users:
Over 95% of the Gender dimension values are not visible for users from the UK. Quite a challenge to work with this data!
Cardinality in GA4
Each report dimension (e.g., User source, User medium, User campaign, Gender etc.) has a number of values that can be assigned to it. The total number of unique values for a dimension is known as its cardinality.
Gender is an example of a low-cardinality dimension. On the other hand, Page Path is a high-cardinality dimension as it usually contains many different unique values.
Analytics queries different tables before showing a table in a report. Be aware of potential discrepancies when a query of the aggregated-data or event-level tables returns more rows than Analytics can render.
The result is that part of the dataset is being aggregated as (other).
In most cases this only occurs if a dimension has around 20,000 unique values per day or more. However, I have seen exceptions to the rule:
Only 317 unique values, but there is still a cardinality (other) dimension value logged in the report.
BigQuery and GA4
Integrating BigQuery with Google Analytics 4 gives you access to the raw data (almost) for free.
BigQuery allows you to export raw data unsampled and so you can conduct much more granular analysis with confidence in your data.
- Pay for what data is collected and processed (minimum costs)
- A scalable solution
- Export custom event parameters and dimensions
- Connect GA4 data with third-party API’s
- Connect (GA4) data from BigQuery with popular data visualisation tools such as Data Studio and Tableau
If you are seeing excess data aggregated as (other) on a regular basis, you can use BigQuery Export to export your Analytics data to BigQuery and query the entire dataset.
Concluding Thoughts
Sampling in Google Analytics 4 is still present and can be challenging, but you have a great opportunity to mitigate any impact it has on your data.
Think about integrating GA4 with BigQuery to stay or become fully confident in your data. In the free version of Universal Analytics this isn’t an option!
Invest in SQL and BigQuery and add both skills to your profile if you haven’t yet! And if you are working at a smaller company and collect not so much data, you should be all good (starting out) without this BigQuery/GA4 integration. See it as an future opportunity.
This is it from my side! Happy to hear your comments on sampling and Google Analytics 4.
One last thing... Make sure to get my automated Google Analytics 4 Audit Tool. It contains 30 key health checks on the GA4 Setup.
João says
Hi! I have a couple of questions:
1 – What should we do if in a Funnel report, for examples we have a data sample of 60%? Should we trust the values? Should we do anything to “amplify” that data samples?
2 – When you say in GA4 the threshold for data sampling in Explorer Reports ins 10M events, theses events occur in the funnel we are seeing/creating or is the sumup of all events occured in that date range? Because I think it’s a bit to strange my website (National post mail company) to have 10M events a day?
Paul Koks says
Hi, please see my answers below:
1 – Statistically, the lower the data sample the less trustworthy your data. In general – as a rule of thumb, not exact science – I would try to get the data sample to 80% as a minimum. You can increase the “60%” by shortening the time period of your analysis.
2 – It’s based on the sumup of all events occured in that date range.
In general this only happens when creating/using a non pre-existing standard report (i.e. funnel).
ryan says
Cant you pull in using API via R and configure anti-sampling to true
https://cran.r-project.org/web/packages/googleAnalyticsR/googleAnalyticsR.pdf
Paul Koks says
Hi Ryan,
Haven’t tried this so I can’t advice here.
Best,
Paul
Ryan says
Hey – me again (ryan) w/an update:
I’ve found button tracking difficult via GA-4. If I try to use report explorer and drill down to a cta click by page – button – date GA4 wont show any data (low traffic page relative to rest of the site).
I have been able to pull the info via googleAnalyticsR I have not tested big query yet.
Still slightly concerned about reporting functionality of GA-4 for very specific tasks. Next step is to compare R output against BQ
Paul Koks says
Hi Ryan,
I agree that in some cases you can run into challenges the more granular you get within the GA4 UI. Let me know how BigQuery works out for you.
You might also want to ask yourself how granular you want to get at the user level (instead of creating an aggregated or partly segmented sequential analysis).
Ryan says
So ive noticed a few discrepancies b/w big query and the R package. Im not sure if this is because my sql command is incorrect (most likely) or due to a limitation of R or the API version thats being used.
Ex: total number of pageviews
R: 7th row is “(other)” w/aprox 150k pageviews (not good)
BQ: Same time frame but I dont see an (other) row and total number of pageviews is off (pretty decent amount)
Right now my plan is to use R for ad hoc requests while I continue to set up BQ. For me (no bq experience) its easier to use R to pull the data and quicker then the GA4 UI
Here was the BQ query I used
SELECT
device.web_info.hostname,
event_name,
COUNT(*) AS event_count
FROM `[table]*`
WHERE
event_name IN (‘page_view’)
— Replace date range.
AND _TABLE_SUFFIX BETWEEN ‘20221015’ AND ‘20230115’
GROUP BY 1, 2
ORDER BY event_count desc
Paul Koks says
Not sure Ryan as I have limited knowledge of R in this respect. Maybe you want to send a quick note to Mark Edmondson as he might be able to troubleshoot here.
M says
Hi,
I have a question regarding the sample exploration report you provided (the 4th screen shot with the event count and sample rate of 90%).
Where the event count total is listed, is that the total for the 90% of the data that was used? If so, how do you get the real total for all the data? Or do I assume the real total is 10% higher?
Paul Koks says
The +/- 11 million events are an estimate of the total (100%) event count which is based on 90% of all data.
John Contreras says
Hi Paul, is the limit of 25 event parameters per event the result of the sum of the number of event parameters plus the number of user properties of the same event?
Paul Koks says
Hi John, this is purely for event parameters. 25 per property is the limit for user properties in GA. This is for GA4 free version. If you are on GA360, both limits are currently set to 100.
Mahdi says
Hi Paul,
You mention this in this section: ”Google Analytics 4 is also free (there will be a paid version as well) and has no hit/events limits. This is really great if your company has a high number of daily users on the site and/or app and triggers loads of events”.
Why is the data shown sampling in the event reports with a sampling rate?!
Paul Koks says
Hi Mahdi, in the free version sampling occurs in advanced reporting when the data exceeds 10 million in event counts (and the report is not a standard report). It doesn’t mean there is a hit/event limit, it is that Google applies sampling in this case.