Insights

Find and Prevent PII in GA URLs using GTM

Google Analytics (GA) is a great tool that allows you to collect a lot of data and help extract valuable insights from it. However, some information that makes its way into your reports can be less useful for your analysis or even harmful to user privacy. One such example is Personal Identifiable Information (PII).

💡 This article explains how to modify your existing setup in Google Tag Manager (GTM) to mitigate PII in both GA4 and Universal Analytics (UA).

Many of you will be reading this for GA4, as by now most investment in UA is replete. But until 2023 H1*: if there is PII being sent to UA, this will continue to be a critical hotfix or until you no longer rely on UA for production reporting.

* Or at the latest Q3 for 360 properties, based on Google's UA deprecation timeline.

 

What is PII?

PII, or Personal Identifiable Information, is information about the user that can directly or indirectly (when tied with other data) identify them as an individual. It is against Google’s Terms of Service (TOS) to collect PII about your visitors/users in your Google Analytics account.

All data from all days where any PII was tracked is at risk of deletion.

The Office of Privacy and Open Government defines PII as:

“Information which can be used to distinguish or trace an individual's identity, such as their name, social security number, biometric records, etc. alone, or when combined with other personal or identifying information which is linked or linkable to a specific individual, such as date and place of birth, mother’s maiden name, etc.”

PII Examples

  • Names
  • Emails
  • Physical addresses
  • Phone numbers
  • Internet protocol (IP) addresses
  • Passport number
  • Social security number
  • Driver’s license number
  • Financial account information
  • Finger prints
  • Medical records

New call-to-action

 

How Does PII Get into Google Analytics?

PII usually gets into GA accidentally through URL strings.

When a user submits a form on your site or utilizes a search feature, the submitted data might be sent to the server by appending the submitted data in query parameters on the form submission request URL.

If that submission URL is not redirected to a confirmation page without the parameters, then by default everything the user typed into the form fields will get tracked in GA as part of the pageview URL. For example:

How Does PII Get into Google Analytics?

 

Checking Your Data for PII

We recommend checking your UA data (in an unfiltered view) to inform which URL parameters will need to be screened.

💡 This PII report link generator for UA will help you identify potential offending parameters from your UA data to seed your blocklist for both GA4 and UA.

For your reference, the following is a list* of regexes used by that template:

* This list is non-comprehensive / may produce false positives -- you should review matches to confirm validity.

PII Category Regular Expression
Phone Numbers - USA [=,;]\s*(\+\s*\d{1,3})?[-,.+\s(]*\d{3}[-,.+\s)]*\d{3}[-,.+\s]*\d{4}($|[,;:/?&#])
Phone Numbers - International \?.*([=:\,!]|%2[1C])(([\s+.\,)(-]|%2[0B1C89])*\d){11\,15}($|[&#:\,!%])
Physical Address (\d+\ )(([^\s]*)|([^\s]*\ ([^\s]*)))\ ((st(reet)?|ave(nue)?|dr(ive)?|(high)?way|la?ne?|r(oa)?d|b(ou)?le?v(ar)?d))
Zip Code [=,]\d{5}(-\d{4})?($|[&#,])
Email Address [^&?#/](@|%40)([^&?#/]+)\.
CC - Visa/MC [=,;]\s*(\d{4}[-\s+]*){3}\d{4}($|[,;:/?&#])
CC - Amex [=,;]\s*\d{4}[-\s+]*\d{6}[-\s+]*\d{5}[-\s+]*($|[,;:/?&#])
Common parameters - password [?&,;](pwd?|password)=[^&#]
Common parameters - name [?&,;](f|l|u|s|full|first|last|user|screen)?name=[^&#]
Common Names - 1 \b(J(im(my)?|ohn|ames)|Robert|Bob(by)?|Michael|(B|W)illy?(iam)?|Dav(id|e)|(D|R)ic(k|hard)|Ch(arl(es|ie)|uck))\b
Common Names - 2 \b(Mary|Pat(ty|ricia)|Linda|Barb(ara)?|E?liz(zy|abeth)|Jenn?(ifer)?|Maria|Su(e|san))\b
Social Security # (SSN) [=,;]\s*\d{3}[-\s+]*\d{2}[-\s+]*\d{4}($|[,;:/?&#])
IP Address - IPv4 [^vn][=,;]\s*((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)($|[-,;:/?&#])
IP Address - IPv6 [=,;]\s*(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))($|[-,;/?&#])

 

Fixing PII at the Source

If you see PII in your Google Analytics reports, you should talk to the IT or web development team who is responsible for maintenance and functionality of the website or app that is submitting the PII.

The best way to prevent this is to encrypt the request values that are coming through with PII or remove them from URLs entirely. This is a preferable solution as it also safeguards the data and protects user privacy the most.

💡 In many cases, one of the leanest and most effective solutions – if possible – is to change the form method from GET to POST.

This will send submitted data as a payload in the body of the request, instead of in the request URL. This is a general privacy/security best practice for many reasons beyond just web analytics integrity, as there are multiple ways URLs can be "leaked" in transport, on servers, and on the web.

But chances are, you need a fix as quickly as possible. So unless you're fortunate enough to have the resources to implement very quickly, we advise submitting that request and simultaneously deploying a tracking update to screen PII from data collection. And if encrypting or removing the sensitive data at the source is not possible, the following approach will be a suitable permanent solution for GA (and other PII-allergic platforms tracked via GTM).

For property/processing configuration such as filtering, GA4 looks quite different from Universal Analytics – but none of that is suitable for screening PII anyway.

Even if it's filtered out in your reports, the moment PII has reached Google's servers it has violated GA's TOS. This is also likely to violate the offending website's privacy policy. So PII screening must be done tracking-side, and we're going to show you how to use Google Tag Manager to prevent this data from passing to Google Analytics.

 

How to Remove PII From URLs Using GTM

While GA4 and UA tags look different, the solution we provide will work for both (and any other platform tracked via GTM).

Steps to Scrub PII from URLs

  1. Choose a parameter screening method: Allow vs. Block
  2. Sanitize URLs with JavaScript
  3. Send Clean Data to GA
  4. Test & Publish

The core logic is in a custom JavaScript variable that we can use to strip parameters from any given URL. This JavaScript variable will become a part of your Google Analytics configuration tags for both Universal Analytics and GA4.

Allowing vs. Blocking Parameters

First, you'll need to decide whether you'll use an allowlist or a blocklist method.

Like many, we've retired our use of the terms "whitelist" and "blacklist" – we're done associating white with good and black with bad. In their place we've adopted more inclusive terms: "allowlist" and "blocklist" (matching GTM's convention).

The method you choose will govern how you handle query parameters by default. With a blocklist, you would allow all query parameters to come through except those on the list. The allowlist would work the opposite way, removing all the parameters by default except those you would need to keep and put on the list.

Allowlist Method

Seriously consider stripping all URL parameters by default. This negates a world of potential noise and focuses you on determining which parameters provide data valuable to your measurement strategy.

With this approach, we make use of an "allowlist" to designate specific parameters as the only parameters we want to be picked up by GA. In this case, we just want to make sure we're still allowing key platform parameters such as utms, gclid, and gbraid.

💡 To help you identify all your parameters in your UA data, you can use this template from Google.

Implementing an allowlist is not something you should do hastily, as this will negate all non-specified parameters. So before publishing, consult with all teams that rely on web analytics, and ensure any parameters important to them are included in the allowlist.

Additionally, they should be aware of the publish date, because it is possible that page-level metrics may increase on pages that had previously been splintered across many report rows.

Blocklist Method

If time restraints don't allow for the coordination required to adopt an allowlist, or you are concerned about not having a backup for valuable parameters you might miss using an allowlist, then you can use a blocklist to remove only parameters you specify.

💡 Refer to Checking Your Data for PII for tips on using your analytics data to identify PII parameters.

Sanitizing URLs with Custom JavaScript

In Google Tag Manager, create a new user-defined variable. Name it 'Function - Strip Parameters' and set the type to be Custom Javascript.

The code you will use will correspond to the method you chose. Copy the code below for the version you chose to go with, and place it in a custom JavaScript GTM variable named `Function - Strip Parameters`.

Allowlist Code

function(){    var allowlist = 'FIVE,EXAMPLE,ALLOWED,CUSTOM,PARAMS'+ ',utm_campaign,utm_content,utm_medium,utm_source,utm_term, utm_creative_format,utm_marketing_tactic,gbraid,wbraid,gclid,dclid'.split(','),       replaceWith = ''; // If empty, blocked parameters will be dropped entirely,                          //   otherwise overridden with this value.    return function sanitizeUrl( url ){     return url.replace( /((\?)|&)([^#&=]+)(?:=([^#&]*))?/g, function(input,delim,qmark,key,val){       if( -1 !== allowlist.indexOf(key) )         return input;       else return replaceWith ? delim+key+'='+replaceWith : qmark||'';     }).replace(/\?&*$|(\?)&+/,'$1');   } }

 

Blocklist Code

function(){    var blocklist = 'FIVE,EXAMPLE,BLOCKED,CUSTOM,PARAMS'.split(','),       replaceWith = ''; // If empty, blocked parameters will be dropped entirely,                          //   otherwise overridden with this value.    return function sanitizeUrl( url ){     return url.replace( /((\?)|&)([^#&=]+)(?:=([^#&]*))?/g, function(input,delim,qmark,key,val){       if( -1 === blocklist.indexOf(key) )         return input;       else return replaceWith ? delim+key+'='+replaceWith : qmark||'';     }).replace(/\?&*$|(\?)&+/,'$1');   } }

 

This function is used to set the parameters that you would like to keep (allowlist) or remove (blocklist) and runs the process for accessing the URL request data and replacing the specified parameters within it.

Replace the values in the allowlist or blocklist variables inside the code with your own parameter keys, using commas as delimiters, and making sure there are no spaces around the commas. For example, if you are using the blocklist method and want to block the email, phone, and address parameters, your code can look like this:

Sanitizing URLs with Custom JavaScript

Remember that the parameter keys must be written exactly as they appear in the URL. To make it a case-insensitive match, add an `i` to the end of the first regular expression so it looks like this:

/((\?)|&)([^#&=]+)(?:=([^#&]*))?/gi

If you want to keep the parameters and only redact their values, you can set replaceWith to the replacement value. For example, this configuration: replaceWith = '[REDACTED]';'; would result in a sanitized URL that looks like this:

https://www.seerinteractive.com/thank-you?submit=true&email=[REDACTED] 

Sending the Sanitized Data to GA

Now we'll create a couple of other variables that will utilize our function to produce a sanitized URL string and update GA tracking configuration to use sanitized values.

Create a Custom Javascript Variable called “Page URL (Sanitized)” and paste the following code into it:

function(){ return (location.href); }

In this code snippet, we are calling the function we defined first and passing in location.href as an argument to get a sanitized version of the full page URL. We will use this variable in both Universal Analytics and GA4 configuration settings.

If you have custom tracking pulling in other URLs that need to be screened as well, you can use the same logic. For example, to sanitize Click URL:

function(){ return (  ); }

Now, it’s time to put these new variables to use in your tracking configuration.

For GA4

Go to your GA4 configuration tag. In the “Fields to Set”, add a new row.

Set the ‘Field Name’ to ‘page_location’, and the value to your variable. (If this field is already defined, you'll instead want to combine the new parameter stripping logic with your existing customizations.) Sending the Sanitized Data to GA - for GA4

If page URLs are passed anywhere else (i.e. in custom parameters), you should follow the same pattern to sanitize those fields as well.

For Universal Analytics

Create another Custom Javascript Variable called “UA Pageview URL (Sanitized)”, this variable will be used with Universal Analytics configuration only:

function(){ return (location.pathname + location.search); }

If you have set up your Universal Analytics tag correctly, you should have a GA Settings variable that contains your Universal Analytics configuration settings. Open this variable settings, and add two “Fields to Set”:

  • page =
  • location =

Sending the Sanitized Data to GA - for UA

Test & Publish

Test your solution in GTM Debug Mode.

Test on-page with the URL parameters that you included in your allowlist/blocklist code and take a look at how your new variables are populating. You should see your unaltered version of the URL as the value in the default Page URL variable, and the sanitized version as the value of your custom variable:

Test & Publish

You should also check your tags and make sure they fired with the right data: Test & Publish

In due diligence, before publishing you may also want to confirm that any configurations based on URL parameters (i.e. goal configurations) will not break if some parameters disappear.

For example, a form might always have appended a few parameters:

/contact?email=rex@example.com&success=1 

If the goal configuration looked for "&success=1" and we're blocking 'email', this goal configuration will break. This is because when the email is removed from the URL, the URL will contain “?success=1”.

Never rely on parameter order when matching URLs! Instead, always use this regex: [?&] right before the parameter key to match that parameter at any position. No need to escape the ? character here, since it is inside the character class brackets [ ], it gets treated as a literal question mark. For example: [?&]success=1(&|#|$).

If everything is working as expected, you're free to publish your changes. Verify your changes using GA's Realtime report(s).

Finally, whether you go with allowlist or blocklist, it's wise to maintain documentation on how you're manipulating query parameters.

UA Alternative: Sanitize All Fields via Custom Task

Depending on your setup, PII may be passed in other fields besides the tracked URL.

The above approach can be applied to any field, but it requires that each such field be identified and explicitly set with its own dedicated GTM variable. While this is the most accessible approach for GTM users since it leverages only GTM features, there is a more flexible option for UA that can strip PII from all fields using one block of code, and doesn't require identifying all offending fields upfront.

This method relies on using the customTask API that is available in Universal Analytics.

In simple terms, customTask is one of the functions that run in between the tracking process when the data is collected and the HTTP request that sends the data to the analytics server. It's designed to give you a way to access and modify the request data and also allows you to modify other standard tasks/processes that occur after it.

💡 While customTask can be utilized in a lot of creative ways, we will use it to strip the sensitive data from the analytics data collection request. Brian Clifton has a great implementation guide for this that can be found here.

Unfortunately, there is no equivalent to Custom Task for GA4.

 

Get Started

Step 1: Download Template

To get started, download this GTM container containing both blocklist and allowlist logic, and the sanitized URL variables for GA4.

DOWNLOAD TEMPLATE

Step 2: Import into Your GTM

Import this into your GTM container, and plug into the relevant template tags:

  • GA4 - Under 'Fields To Set' in the GA4 config tag, map the sanitized page URL to `page_location` (as shown above).
  • UA - Under 'Fields To Set' in the GA Settings variable(s), map the sanitized page URL to `location` and map a sanitized UA Pageview URL variable to the `page` field (as shown above).
  • Other - Determine which fields track URLs and their formatting requirements, if needed create the sanitized URL variable for the required format(s), and override the offending fields in the platform's tracking tag/configuration with the sanitized URL variable(s),

Step 3: QA and Publish

Then modify the 'Function - Strip Parameters' variable to strip the offending parameters, QA, and publish!

And remember, to make sure you're not losing any valuable parameters before you opt for an allowlist; if you're unsure it's safer to default to a blocklist.

Whichever method you choose, make sure to regularly monitor your data for privacy compliance by setting up custom GA alerts.

 

What Now?

PII Auditing Support

Seer can help! We bring our analytics expertise to the table by enabling you and your team to act upon your data. Regardless of if you’re looking for a solid foundation or a long-term partnership, we have a team to fit your needs.

VIEW ANALYTICS SERVICES

Additional Resources

💡 For more tips and tricks on using Google Tag Manager, keep reading:

 read more gtm posts


Sign up for our newsletter for more posts like this in your inbox:

SIGN UP FOR NEWSLETTER

Stephen Harris
Stephen Harris
Team Lead, Analytics
Sasha Helms
Sasha Helms
Sr. Associate, Development

We love helping marketers like you.

Sign up for our newsletter to receive updates and more: