You’ve heard of web crawlers, spiders, and Googlebot, but do you know that these crawlers have limits to what they can and can’t crawl on a website? Keep reading to learn about this important budget within SEO (search engine optimization).
Crawl budget is the number of pages that Googlebot (and other search engine crawlers) can crawl in a given amount of time. Managing your crawl budget can support the overall indexation of your site.
💡 Remember! In order for Googlebot to crawl your website, you need to make sure it is allowed to crawl in your Robots.txt file.
While it is uncommon for Google to crawl and index every page on your site, we want to make sure that all of our important pages are indexed and able to appear in the SERPs. Unfortunately, we are not in 100% control of what pages Google crawls. Google identifies which pages are the most important, and list them in terms of priority. Some of the factors at play are internal linking structure, XML sitemaps, and website/site authority.
An easy way to understand SEO crawl budget is with these two examples:
- Small Business: You own a small business that sells plants and your website has 100 pages and a crawl budget of 1,000 pages (meaning you can still meet your crawl budget if you create 900 new pages!). You can optimize your crawl budget for increased efficiency, and be prepared if the total amount of pages ever surpasses your current budget.
- Ecommerce: You own an international eCommerce business where you have 100,000 pages and a crawl budget of 90,000 pages. The problem lies within your crawl budget because there are 10,000 pages that will not be crawled or indexed. While some of those pages might have a noindex tag, you could be losing visibility in the search engine results pages (SERPs), as well as customers by them not being indexed.
Don’t remember the difference between crawling vs indexing vs ranking? Don’t worry, we’ve got you covered!
Google Search Console defines crawl rate as “how many requests per second Googlebot makes to your site when it is crawling it: for example, 5 requests/second.”
While you cannot increase the number of requests per second Googlebot makes when crawling it, it can be limited, if needed. You can also request that Google recrawls a page. A few reasons why you might want to recrawl a page are:
- The page is new and has not been crawled yet
- Content or metadata on the page has been updated
- The page was not properly indexed during the last crawl
In order to check when your page was last crawled, head over to Google Search Console. After navigating to your property, you will insert your URL in the search bar at the top of the page. Then, you will be directed to URL inspection which helps you understand when your page was crawled, what the referring URL was, any issues that arose during indexing, and more!
Within URL inspection, Google Search Console will tell you if your URL is in Google’s index. If it is not indexed, there could be a variety of problems that need to be looked into. It might be as simple as the page not being crawled/indexed yet, or could be as serious as an issue with the Robots.txt file or manual action. You can also view how your page is viewed by Googlebot by utilizing the “Test Live URL” feature.
Don’t Forget! While you can ask Google to re-crawl a page, requesting indexing multiple times does not prioritize your crawl.
To learn more about your page and crawl details, open the Coverage tab. This is where you identify if your page was indexed, submitted in a sitemap, if a crawl or index is allowed in your Robots.txt file, and what user agent crawled the page.
Make sure to review the referring URL because this is the page that led Google to crawl your page. Your page might be found through a variety of sources like internal/external links, or a crawl request.
To see more of the nitty-gritty details like the crawler-type and time of the last crawl, focus on the crawl section. While there are two types of Googlebot Crawlers (mobile and desktop), as we continue to move towards mobile-first optimization and mobile-friendliness, your website will more than likely be crawled exclusively by a Googlebot Smartphone if it is not already.
One important thing to note within the crawl section is whether a page can be crawled and indexed. Moz has identified that there are also cases when a page is crawled, but not indexed, meaning that the page has not been included in the index library (yet) and therefore is not eligible to be shown in search results. .
If your page is not allowed to be crawled or indexed, often shown by “disallow user-agent”, double-check your source code or connect with a web developer.
You want to make sure that if your page is blocking a crawler, it’s intentional, and not an accident in the code.
There are a few pages or areas on your site where there is no need for Google to index: Some reasons why you might not want Google to index your pages, also known as a noindex tag are:
- Login pages
- Internal search results
- Thank you pages
- Form submission pages
There are also a few methods you can utilize in order to prevent pages from being added to the index:
- noindex tab
- Robots.txt (if the page hasn’t been crawled/indexed yet)
- GSC removals tool
There are a few helpful tools you can use to learn more about your site’s crawl stats or see how many pages of your site Google crawls per day.
Within Google Search Console, you can navigate to your domain property > settings > crawl stats and this will show you the number of crawl requests, download time, and average page response times. This crawl stats report can be helpful when working to optimize your crawl budget, which we will cover a little later.
We can also review server logs to see EXACTLY what Googlebot is crawling. Check out these tools that all offer log file analysis solutions:
So we have identified the basics, and outlined where to check for crawl statuses – but you might be wondering, why should I care, and is it really important for SEO?
When we create a new page or update an old one, we want individuals to see it! Whether the user is someone planning on buying a custom bike, or an individual looking for a degree program to enroll in, we want those pages to be accessible to users, preferably on page one of their search engine.
If our crawl budget only covers 50% of our website (100,000 pages, 50,000 allotted in crawl budget), 50% of our website will not be discoverable in the search results. And yes, someone might be able to find your URL by typing it in word for word, but that’s not always the case – and quite frankly, that’s not a risk SEOs are willing to take when we can work to optimize our crawl budget!
Now, optimizing your crawl budget is not a one-day task. You might get frustrated along the way, but we are here to help!
To begin, let’s review what we can do in order to help improve your crawl budget:
Site speed is important for a variety of reasons. We want pages to load timely for users so they engage with our site, but we also want it to be fast so Googlebot can crawl our content as quickly as possible.
Don’t you love how you feel waiting for a minute to pass when doing a plank … ? Yeah, we don’t like that feeling either!
We want to avoid that long wait for Googlebot, too, because the quicker our pages load, the quicker Googlebot can crawl and index our pages.
While we aren’t increasing the crawl budget, if we can get 10 pages to load in one minute compared to 1-page loading in one minute, we are going to see visual improvements.
Internal and external links are a key part of any SEO strategy. Internal links, which are links pointing to different pages within the same domain, are incredibly important for both user experience and site structure.
For starters, If blog A includes a DO-FOLLOW tag, GoogleBot CAN access the internal link and will navigate to and crawl blog B.
If blog A has a NO-FOLLOW tag enabled in the source code for that link, GoogleBot can see the link exists, but WILL NOT navigate to or crawl blog B. Don’t fret, we can learn about no-follow links another time.
You might be wondering, why do I need to know about internal links for my crawl budget? Because enabling no-follow or do-follow links is another way to help optimize the crawl budget! If you are internally linking to a page that provides no value to Google, and you don’t need it ranking in the SERPs, like a thank you page, why would you waste your valuable budget that could be dedicated to crawling pages that help drive ROI?
It is also important to identify any orphan pages that might be lingering on your site. An orphan page is a page that does not link to, or have any internal links pointing to them. The only way they can be crawled is by manually requesting they are indexed since Google will not be able to find them naturally.
💡 Remember! If you are in the process of building a new website or redoing your site structure, make sure to avoid creating orphan pages. If you notice too late that those pages are floating around with no links to keep them afloat, create an internal link to help it easier for GoogleBot to reach them when they crawl your site next.
If you have duplicate content that it is imperative to keep it live on your site, utilize a canonical tag to make sure Googlebot only crawls the priority page.
Canonical Tip! Say you have a pair of tennis shoes that come in blue, red and yellow. While you want users to be able to find the shoes in blue, size 12, or yellow, size 4, you only need GoogleBot to crawl the main product page. By cutting out all of the different variations (size, color, etc) and utilizing a canonical tag, you can decrease the unnecessary fluff that needs to be crawled and indexed.
Google Search Central identified what can negatively affect the crawl budget:
“According to our analysis, having many low-value-add URLs can negatively affect a site’s crawling and indexing. We found that the low-value-add URLs fall into these categories, in order of significance:
- Faceted navigation and session identifiers
- On-site duplicate content
- Soft error pages
- Hacked pages
- Infinite spaces and proxies
- Low quality and spam content
Wasting server resources on pages like these will drain crawl activity from pages that do actually have value, which may cause a significant delay in discovering great content on a site.”
While some of the issues like duplicate content can be addressed with a 301 redirect or an audit, some of the factors like hacked pages require a deeper dive to solve the root issue at hand. In addition to optimizing for the crawl budget, you want to make sure to address and low-value add URLs identified by Google.
Need additional help optimizing your crawl budget? Need to know how to fix crawl errors? Want to identify other areas that could use further optimization? Contact us to learn how the Technical SEO Team at Seer can help!