Skip to main content
SerpGem
Crawlers & indexing · 9 tools

Crawlers & indexing tools

The plumbing that decides whether Google crawls, indexes, and consolidates your pages correctly. These 9 tools generate, test, and validate: robots.txt (per RFC 9309), XML sitemaps (per sitemaps.org), canonical tags, noindex directives, hreflang matrices, and redirect rule syntax for Apache/Nginx/Next.js/Netlify.

About these tools

Crawlers & indexing questions

What is RFC 9309 and why does it matter for robots.txt?
RFC 9309 (published September 2022) formalized the Robots Exclusion Protocol as an official internet standard after 28 years as a de-facto convention. Key requirements it codifies: (1) file at /robots.txt, served text/plain, UTF-8; (2) max 500KiB; (3) specific Disallow/Allow precedence rules; (4) User-agent matching is case-insensitive. Our Robots.txt Tester and Generator both honor RFC 9309 strictly.
Should I use robots.txt or noindex to block a page?
Noindex (meta or X-Robots-Tag) — always, if the goal is 'not in the index.' Robots.txt only blocks crawling, not indexing: if the page has inbound links, Google can still index it based on anchor text alone (no content). Use robots.txt for: reducing crawl budget on low-value URLs, blocking sensitive endpoints. Use noindex for: filter pages, staging, low-quality content.
How do I implement hreflang for 20+ regional variants?
Every page must reference every other regional variant — including itself — plus an x-default fallback. For 20 variants, that's 21 <link> tags per page (20 regions + x-default). At scale, use HTTP headers instead of <link> tags (served from Cloudflare Workers or Next.js middleware). Our hreflang Generator outputs both formats for any number of URL+region pairs. Codes must be ISO 639-1 + ISO 3166-1 (IETF BCP 47).
What's the right canonical setup for pagination?
Self-referencing: each page 2, 3, etc. has rel=canonical pointing to itself (NOT to page 1). Google confirmed in 2019 they ignore rel=prev/next and handle pagination automatically via other signals. Canonical to page 1 causes Google to drop all paginated content from the index. Exception: if pages 2+ are near-duplicate thin content (which shouldn't exist), then consolidate.

More in Technical SEO

Related sub-groups