Content Similarity Checker
Compare two pieces of text and measure how similar they are. Uses Jaccard similarity to calculate shared vocabulary. Shows shared words, unique words per text, and common phrases.
Use this with
Related counting & analysis tools
Similarity Guide
When duplicate content actually hurts you
Google's duplicate content filter de-indexes or ranks down pages that are substantially similar to other indexed pages. The threshold isn't a hard number, but research suggests above 40-50% word-level similarity starts to create problems — especially if the competing page is on another domain.
The Jaccard similarity method
This tool uses Jaccard similarity: shared unique words ÷ total unique words across both texts. It measures vocabulary overlap, not sentence structure. 10% = mostly different topics. 50% = significant overlap. 80%+ = essentially the same content.
Same-site duplicate content
Google can handle same-site duplicates using canonical tags. What it can't always handle is thin content — pages that are low-word-count AND highly similar to other pages. Check similarity + word count together.
Cross-domain plagiarism check
Paste your content and a competitor's to see if they've copied you (or you've inadvertently mirrored them). Above 40% similarity on topical content warrants checking with a full plagiarism tool like Copyscape.
URL parameter duplicates
E-commerce sites often create duplicates via sort/filter parameters (?sort=price, ?color=red). These pages can score 90%+ similarity. Use canonical tags and URL parameter handling in Google Search Console to consolidate.
Intentional vs accidental similarity
Product spec pages are legitimately similar because they cover the same product attributes. What Google flags is page-level intent overlap — two pages trying to rank for the same keyword with near-identical content.
How to differentiate similar content
Check the "unique to A" and "unique to B" word lists. Add more of those unique terms to whichever page you want to rank. Different vocabulary signals different intent, which is what Google looks for.
Pro Tips
Before worrying about competitors, paste two similar pages from your own site. If they score above 60%, consolidate them or make them more distinct. Thin duplication hurts your whole domain.
Two articles about "SEO basics" will share vocabulary — SEO, ranking, keywords, Google — without being duplicates. Context similarity matters more than pure word overlap. Use your judgment above 40%.
Common words like "the", "and", "is" are excluded from the comparison. The score reflects meaningful content vocabulary only, not filler words.
Frequently Asked Questions
- Is this the same as Copyscape?
- No — Copyscape searches the entire web for matches. This tool compares two specific texts you paste in. Use this for: checking your own pages against each other, verifying a freelancer's submission, comparing against a known competitor page.
- What similarity percentage triggers a Google penalty?
- Google doesn't have a published threshold. The risk starts meaningfully above 40% for page-level similarity (whole pages targeting the same keyword). Identical boilerplate across pages is handled differently — Google just picks the canonical.
- Why does my score seem high even for different topics?
- Industry-specific vocabulary creates natural overlap. Two healthcare articles will both use "patient", "treatment", "diagnosis". This is normal and expected. Check the "shared words" list — if they're all generic, the pages are actually distinct in content.
- How is this different from a plagiarism checker?
- Plagiarism checkers compare against a database of known sources. This tool compares two specific texts. For web-scale duplicate detection, use Copyscape or Siteliner. For comparing specific pages or drafts, use this.