SEO Power BI Data Cleaning & Preparation Checklist for URLs
Getting into data visualization for finding new opportunities and accelerating analysis? Me too! But there’s a few best practices I’ve learned along the way when connecting datasets by URLs.
When might you want to connect by URL?
Here’s a few outputs I find I’m generally connecting on the URL level where I need to really clean the URLs to make sure I’m getting a big picture view of my data sets:
- Google Analytics
- Google Search Console
- Google Ads
- Screaming Frog
When you’re handling BIG data it isn’t worth taking a shortcut and potentially missing connections from messy URLs. Do I even need to keep going?
URL Cleaning Checklist for Power BI (or any Data Viz Tool)
Hop into the Power BI query editor and let’s get going!
New to Power BI? Watch Lesson 1 with Wil Reynolds
MAKE A DUPLICATE URL COLUMN - OPTIONAL
I prefer to play it safe and make a duplicate column of my original URL column and then transform the duplicate.
Why? So that if I make a mistake or change my mind about the way I’m transforming my data I can just make another duplicate and start again instead of deleting my applied steps on the original and potentially affecting another column or chart.
Step 1: Right click on your URL column and select Duplicate Column
LOWERCASE LOWERCASE LOWERCASE
Are you working with mixed-case URLs? I hope not, but either way, your first step should be to transform your URL column into lowercase just in case.
Step 1: Right click your URL column and select Transform > lowercase from the dropdown.
CLEAN AND TRIM
- The Clean Query returns the original text value with non-printable characters removed. Right click your URL column and select Transform > clean from the dropdown.
- The Trim Query removes whitespace values from the beginning and end of the text value.Right click your URL column and select Transform > trim from the dropdown.
I like to separate protocols by splitting the column with a delimiter - that way I “split” the column into the protocol and the rest of the URL so I can reference http vs https later with a slicer or a pie chart if I want to know if there’s a major discrepancy between the two.
Step 1: Right click on your column, then click Split Column > By Delimiter.
- You can also click Split Column in the top navigation under the Home tab.
Step 2: Enter the custom delimiter :// and split at Each occurrence of the delimiter (technically any option should work here).
Your URL column should now be split into 2 columns -- one containing the protocol and one containing the URL.
WEIRD STUFF APPENDING TO YOUR URLS?
You're not alone.
Ahhh, the dreaded “what is this thing at the end of my URLs?”. Obviously you can’t fully identify when one data set is adding something weird on the end of the URLs, or when some URLs feature a string at the end and some don’t.
Some examples of what this could look like:
You can get rid of this one of two ways:
- Right click on your column and select Replace Values to replace the appending value with a blank.
- Use Split Column by delimiter to split it into another column, using the value as the delimiter.
Boo. Parameters. We can remove parameters by canonicalizing the way search bots would, but this method won’t take into account the difference between parameters for unique URLs vs tracking URLs.
Step 1: Use the Split by Delimiter function and split using a “?” to create a column holding your parameters and a canonical version of the URL.
TRIM THOSE TRAILING SLASHES
Trimming trailing slashes is a must but you can’t use split by delimiter here - if there isn’t a trailing slash on the URL you’ll just lop off the last path.
To remove trailing slashes, we’ll use a formula that will trim any URLs with a trailing slash and leave any URLs without a trailing slash alone.
Step 1: Under the Add Column tab, select Format > Trim
- This will automatically create a new column named “Trim”. To enter our formula, select the gear to the right of Inserted Trimmed Text in your Applied Steps to edit the formula.
Step 2: Enter the following formula to create your Custom Column
NEXT STEPS: MAKING A BRIDGE
- We should now have our original URL column, the protocol column, the “Clean” URL column, the parameter column, and the Trimmed URL column.
🎬 Catch up with us on YouTube for more Power BI goodness!
Now that you’ve cleaned your URLs into canonical versions you may have duplicates in your Trimmed column which could prevent you from creating a relationship through this column. To learn how to properly clean a list and de-duplicate - subscribe below to read our upcoming blog post on cleaning & de-duping your data!