SEO

Host Resources On Different Hostname To Save Crawl Budget

December 3, 2024

0 0 2 minutes read

Host Resources On Different Hostname To Save Crawl Budget

Google Search Central has introduced a new series called “Crawling December” to delve into the workings of Googlebot as it crawls and indexes webpages.

Throughout this month, Google will release a new article each week highlighting various aspects of the crawling process that are often overlooked but can have a significant impact on website crawling.

The first post in the series focuses on the fundamentals of crawling and sheds light on crucial yet lesser-known details about how Googlebot manages page resources and crawl budgets.

Crawling Basics

Modern websites are intricate with advanced JavaScript and CSS, making them more challenging to crawl compared to older HTML-only pages. Googlebot operates similarly to a web browser but follows a different schedule.

When Googlebot visits a webpage, it initially downloads the HTML from the main URL, which may contain links to JavaScript, CSS, images, and videos. Subsequently, Google’s Web Rendering Service (WRS) utilizes Googlebot to download these resources to create the final page view.

The sequence of steps includes:

Initial HTML download

Processing by the Web Rendering Service

Resource fetching

Final page construction

Crawl Budget Management

Crawling additional resources can deplete the primary website’s crawl budget. To address this, Google mentions that “WRS attempts to cache every resource (JavaScript and CSS) used in the pages it renders.”

It’s crucial to understand that the WRS cache lasts up to 30 days and is not influenced by the HTTP caching rules established by developers.

This caching strategy aids in conserving a site’s crawl budget.

Recommendations

This post offers site owners recommendations on optimizing their crawl budget:

Reduce Resource Usage: Minimize resource consumption to enhance user experience and save crawl budget during page rendering.

Host Resources Separately: Store resources on a distinct hostname, such as a CDN or subdomain, to alleviate the crawl budget load on your main site.

Use Cache-Busting Parameters Wisely: Exercise caution with cache-busting parameters as altering resource URLs can prompt Google to recheck them, potentially wasting your crawl budget.

Google also warns against blocking resource crawling with robots.txt as it can hinder Google’s ability to adequately render pages and rank them.

Monitoring Tools

The Search Central team suggests that the most effective way to monitor the resources crawled by Googlebot is by examining a site’s raw access logs.

You can identify Googlebot by its IP address using the ranges outlined in Google’s developer documentation.

Significance of Understanding Googlebot Processes

This post elaborates on three key factors that influence how Google discovers and processes a site’s content:

Resource management directly impacts crawl budget, emphasizing the importance of hosting scripts and styles on CDNs.

Google caches resources for 30 days irrespective of HTTP cache settings, aiding in preserving crawl budget.

Blocking essential resources in robots.txt can impede Google’s ability to properly render pages.

Comprehending these mechanisms enables SEOs and developers to make informed decisions regarding resource hosting and accessibility, directly influencing Google’s ability to crawl and index their sites effectively.

Frequently Asked Questions

Q: How does Googlebot handle page resources?

A: Googlebot downloads HTML from the main URL and utilizes the Web Rendering Service to fetch additional resources like JavaScript and CSS.

Q: What is crawl budget management?

A: Crawl budget management involves optimizing resource consumption to prevent depletion of the primary website’s crawl budget.

Q: How long does the WRS cache last?

A: The WRS cache lasts up to 30 days and is independent of HTTP cache settings.

Q: How can site owners optimize their crawl budget?

A: Site owners can optimize their crawl budget by reducing resource usage, hosting resources separately, and using cache-busting parameters judiciously.

Q: Why is blocking resource crawling with robots.txt risky?

A: Blocking resource crawling with robots.txt can hinder Google’s ability to properly render pages, impacting content indexing and ranking.

Featured Image: ArtemisDiana/Shutterstock

Host Resources On Different Hostname To Save Crawl Budget

Crawling Basics