Skip to main content
SerpGem
Technical

XML Sitemap Parser

Paste sitemap XML — get URL count, lastmod gaps, priority issues, and domain breakdown.

How to use this tool3 quick steps
  1. Find your sitemap

    Most live at /sitemap.xml. For multi-sitemap setups (over 50,000 URLs), paste any sub-sitemap.
  2. Copy the full XML

    Right-click the sitemap URL in your browser → View Source (or Ctrl+A → Ctrl+C in Chrome). Paste the XML below.
  3. Read the breakdown

    We count URLs, group by directory, surface lastmod recency, and flag oversized sitemaps (over 50,000 URLs or 50MB raw).
InputSitemap XML
OutputSitemap analysis

Use this with

See all 9 tools

Sitemap Health & Crawl Budget Reference

How to analyze and fix your XML sitemap for maximum crawl efficiency

The XML Sitemap protocol was jointly developed by Google, Yahoo, and Microsoft and published at sitemaps.org in November 2006 — later formalized as an internet standard in RFC 8288 (web linking). Google Search Console allows up to 500 sitemap files per verified property and processes each independently, reporting crawl and index status per file. A well-maintained sitemap accelerates discovery of new content, communicates freshness signals via lastmod dates, and guides crawl budget allocation — Google's term for the rate at which it crawls your site. Most sites have silent sitemap errors: URLs returning non-200 codes, missing lastmod attributes, or noindex pages included that waste crawl quota.

Finding your sitemap and the Sitemap directive

Most sitemaps live at yoursite.com/sitemap.xml. If absent, check robots.txt for a `Sitemap:` directive — Google reads this on every crawl. WordPress with Yoast SEO generates sitemap.xml automatically at /sitemap_index.xml. Next.js App Router generates it via app/sitemap.ts (Next.js 13.3+). Shopify generates at /sitemap.xml covering products, collections, pages, and blogs. Astro, Nuxt, and Gatsby all have official sitemap integrations. For large e-commerce sites, Shopify and Magento auto-include out-of-stock and noindex products — these should be filtered out via sitemap configuration before submission.

lastmod: the freshness signal Google uses for recrawl scheduling

The lastmod attribute specifies when the page content was last modified, using W3C Datetime format (ISO 8601: YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS+00:00). Google's Gary Illyes stated in 2022 that Googlebot uses lastmod to prioritize recrawl frequency — pages with recent lastmod dates are recrawled sooner. The critical caveat: if your lastmod dates are inaccurate (always set to today, or never updated), Google learns to ignore them. Set lastmod programmatically from your CMS's actual `updated_at` field or your build system's last-modified timestamp, not the server response date. Pages with missing lastmod get default (low) recrawl priority.

Priority: relative crawl guidance (not a ranking signal)

The priority field accepts values from 0.0 to 1.0 and hints to crawlers about the relative importance of pages within your sitemap. Google explicitly states it is a hint, not a directive, and does not use priority as a ranking signal — it only influences crawl order. The default value is 0.5 when unspecified. Recommended allocation: homepage = 1.0, key landing pages and pillar content = 0.8–0.9, blog posts and category pages = 0.6–0.7, archived or low-traffic content = 0.3–0.4. Setting all pages to 0.5 or 1.0 is equivalent to providing no signal — Google ignores uniform distributions.

Sitemap index: the 50,000 URL protocol limit

The sitemaps.org protocol specifies a hard limit of 50,000 URLs and 50MB uncompressed per individual sitemap file. Sites exceeding this use a sitemap index file (sitemapindex XML format) — a master document linking to multiple child sitemap files, each within the limit. Google Search Console reports crawl and indexing status per individual sitemap file, making content-type-separated sitemaps (posts, products, categories, images, videos) easier to diagnose. Gzip compression is supported: a 50MB file compresses to ~5–10MB, allowing more URLs per transfer. The sitemap index itself does not count against the 50,000 URL limit.

What to exclude: noindex, canonicals, and redirects

Including non-indexable URLs wastes crawl budget and sends contradictory signals to Google. Exclude: pages with `<meta name="robots" content="noindex">` or `X-Robots-Tag: noindex` headers, pages with canonical tags pointing to a different URL (submit only the canonical), URLs returning 301 or 302 redirects (submit only the destination), paginated pages (pages 2+) if the primary content is on page 1, and parameter-based URL variations (?sort=, ?filter=) if the canonical is the clean URL. A Screaming Frog crawl of your sitemap URLs filtered for non-200 status or noindex will surface all violations.

Image and video sitemaps: extended protocol for rich media

Google supports two sitemap extensions beyond standard HTML pages. Image sitemaps use the `<image:image>` namespace (xmlns:image="http://www.google.com/schemas/sitemap-image/1.1") to declare images embedded in pages — helping Google discover images loaded via JavaScript or lazy-loading that its crawler may not execute. Video sitemaps use `<video:video>` (xmlns:video="http://www.google.com/schemas/sitemap-video/1.1") to declare video metadata including title, description, thumbnail, duration, and publication date. Both extensions are embedded within standard `<url>` entries. For e-commerce product image indexing, an image sitemap can increase image search traffic by 20–40% according to Google's Search Central documentation examples.

Pro Tips

Submit to Search Console after every major content update

Submit your sitemap at Google Search Console → Sitemaps → Add a new sitemap. Google reports crawl errors and indexing status per file — a multi-file sitemap structure with one file per content type makes diagnosing indexing issues dramatically faster. Resubmit whenever you publish more than 10–20 new URLs or restructure significant sections. Google's API typically processes resubmitted sitemaps within 24–48 hours, accelerating discovery compared to passive crawl scheduling.

Audit sitemap URLs with a crawler before submitting

Before submitting to Search Console, verify that every URL in your sitemap returns HTTP 200 and is indexable. Run your sitemap through Screaming Frog (free up to 500 URLs) or Sitebulb: Load Sitemap → crawl all URLs → filter for non-200 status, noindex directives, and canonical mismatches. Any URL failing these checks should be removed from the sitemap. Submitting a sitemap with 15% non-indexable URLs teaches Google your sitemap has low signal quality and may reduce the crawl priority it assigns to your new content.

Automate sitemap regeneration in your publish pipeline

A static sitemap updated manually gets stale within weeks on any active site. Configure automatic regeneration: Next.js app/sitemap.ts rebuilds on every deployment, WordPress Yoast updates on every publish event, Shopify regenerates on product/collection changes. For headless or custom architectures, trigger a sitemap rebuild as a post-deploy CI/CD step — GitHub Actions, Vercel Build Hooks, or Netlify Build Plugins all support this. The sitemap should always reflect the exact live page inventory at deploy time.

?

Frequently Asked Questions

Does Google require a sitemap to index a site?
No — Google discovers pages through links and does not require a sitemap. However, a sitemap significantly accelerates discovery for new pages that have few or no inbound links, ensures Google knows about all important pages even if internal linking is sparse, and lets you communicate lastmod dates and relative priority. Google's own documentation states sitemaps are "particularly valuable" for: new sites with few external links, large sites (1,000+ pages), sites with rich media content (video, images), and sites that update frequently. For small static sites with good internal linking, a sitemap provides marginal benefit.
How many URLs can be in a sitemap?
The sitemaps.org protocol (supported by Google, Bing, and Yahoo/Verizon) specifies a maximum of 50,000 URLs and 50MB uncompressed per sitemap file. Google Search Console allows up to 500 sitemap files per property and processes each independently. For sites with more than 50,000 indexable pages, use a sitemap index file that links to multiple child sitemaps — each staying within the 50,000/50MB limits. Gzip compression is supported by all major search engines, typically reducing sitemap file size by 70–85%.
Why are some of my sitemap URLs not being indexed?
The most common reasons (in order of frequency): (1) The page has a noindex directive (check both meta robots and X-Robots-Tag header), (2) The page has a canonical tag pointing to a different URL — only the canonical gets indexed, (3) The URL returns a redirect or error code — Googlebot does not index non-200 responses listed in sitemaps, (4) The content is thin, duplicate, or low-quality — Google chooses not to index it regardless of sitemap inclusion, (5) The page was recently added and Googlebot has not yet processed the sitemap update. Use Google Search Console's URL Inspection tool for the specific exclusion reason for any URL.
Should I include image URLs in my sitemap?
Yes, if image search is a relevant traffic source for your site. Image sitemaps use the `<image:image>` extension within standard `<url>` entries to declare image URLs, titles, captions, and geographic locations. Google's image search index covers billions of images — submitting an image sitemap helps Googlebot discover images rendered by JavaScript or loaded lazily that its crawler may not fully execute. E-commerce (product images), photography, recipe, and news sites see the most benefit. Image sitemaps do not require a separate file — they are extensions of your existing HTML page sitemap entries.