Scraping for Dummies with Outwit (a Marketer’s Best Friend)
Pulling data sucks. It’s mind-numbing. Outwit is a fantastic tool that’s saved me hours of manually pulling data and helped build links, create assets, identify architecture issues and so much more.
Outwit is meant for non-programmers.
It is great for:
- harvesting infographic data that might not be easily accessible
- pulling prices from websites in bulk
- pulling Google Suggest for scalable keyword research
- extracting data from xml (think outputs from APIs)
- creating content your clients don’t have the time or resources to make
- so much more
I’ve used it on:
- Site Architecture
- Alchemy API
- Google Suggest, Shopping, Images
Recently, I used Outwit to find sold out products on a client website which would have taken me hours, but ended up taking me only 30 mins with Outwit. We 301 redirected the sold out pages that had the most links so link juice could pass to relevant, priority product pages.
Here is a map of newspapers in Pennsylvania I harvested from a directory using Outwit:
If I am doing outreach for a marketing campaign in Pennsylvania, and I want to find the contact info for newspapers in a 20 mile radius, it might be a bit difficult to just look at a list.
With the above map, all I have to do is zoom in on a town to see all the newspapers around it.
Let’s create an asset with Outwit
At a high level, we’re finding the URLs of Pennsylvania newspapers on http://www.usnpl.com/, pulling their address info and plugging it into Google Fusion tables to map for us. It’s simple! Again, you do not need to know programming to pull info from the web.
You’ll see a lot of steps and images below. Don’t be intimidated. There are a lot of steps because I go in detail with each one.
A HUGE thank you to Steve over at http://www.usnpl.com/ for letting us use his directory as an example.
1. Let’s start with the PA page: http://www.usnpl.com/panews.php
2. We need to find the link on the page that leads to addresses for each newspaper:
Here is an example of an address page: http://www.usnpl.com/addr/aaddressresult.php?id=2613. We need to pull the address info here:
3. Download Scrape Similar from the Chrome Store: https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd?hl=en.
4. Once, it has been downloaded, just right click on the “A” link on the PA page:
and you should get something like this:
Export to Google Docs. Here is my Google Doc: http://bit.ly/Z7GfCv.
5. We have a lot of useless links going to the newspaper home page or Twitter feed. We need only the address link:
Let’s use the filter function to only select cells in Col A that contain “A”:
6. We have our pages, we need to make URLs, because addr/aaddressresult.php?id=2613 is not a web address. We need to append the domain name to the page to get the URL:
7. Test out a few pages to make sure that it worked. Next step is to build our scraper in Outwit. Go here to buy Outwit for $59.99 on your Mac or PC: http://www.outwit.com/products/hub/.
8. Open Outwit, paste a URL from our Google Doc into the browser bar:
9. Highlight and copy what we want to scrape:
Go to the scraper section and paste your address in the search field:
10. This is the tricky part… You need to select code before what you want to scrape and after. For the name of the newspaper, I don’t want to choose
because when I apply this scraper to other URLs, mcall.com will not appear in the HTML.
11. Here is what it looks like:
12. Execute your scraper to make sure that it’s working properly:
This is exactly what I needed:
Now I just need to apply it to other URLs from my Google Doc to make sure it works.
13. Copy and paste all of your newspaper URLs into Text Edit for Mac or Notepad for PC:
Save the file.
14. Open the txt file in Outwit:
and right click to find your scraper:
Now Outwit will go to each newspaper URL to find the information you’re looking for.
15. All of your scraped data should be in the “scraped” section and look like:
Export your data:
16. We need to separate the address, city, state and phone number so let’s use text to columns in Excel to separate everything out:
Don’t worry about overwriting your other columns.
17. Let’s map our addresses! Copy the newspaper name, address, city / state and phone columns into Google Spreadsheets.
18. Go to http://drive.google.com, select Fusion Tables (will not work in Google Apps):
Select the spreadsheet with your data:
Select next, then finish.
19. Go to the Map of City / State tab:
Google Maps will automatically find the addresses and begin geocoding them:
20. Voila! The map is complete:
Now that you’ve created a map from scraped data, you can pretty much do any simple web scraping. Outwit is an invaluable tool for finding information and Excel for aggregating it.
I’d love to get your feedback on the above and any projects you’ve worked on using Outwit.
Last word: be respectful and DO NOT use Outwit carelessly. You can get blocked, banned and take down websites.