Skip to main content
SerpGem
Content Analysis

Word Frequency Counter

Rank every word in your content by how often it appears. Surfaces overused terms, keyword stuffing, and which topics your content actually covers.

InputYour content
OutputWord frequencies
WordCount%Frequency

Use this with

See all 7 tools

Word Frequency, TF-IDF & Content Analysis Guide

What word frequency reveals about your content and SEO alignment

Word frequency analysis is a foundational technique in computational linguistics and information retrieval — the academic disciplines behind modern search engines. Google uses TF-IDF (Term Frequency-Inverse Document Frequency), a formula developed by Karen Spärck Jones in 1972, as one component of how it determines which words are semantically central to a document. TF-IDF = TF(t,d) × log(N/df(t)), where TF is how often the term appears in the document, N is the total number of documents in the corpus, and df(t) is the number of documents containing the term. A word that appears frequently in your article but rarely across the web carries a high TF-IDF weight — signaling it is a defining characteristic of your content's topic.

Keyword stuffing: the 5% threshold and Panda history

Google's Panda algorithm — launched February 24, 2011 — specifically targeted content with unnatural keyword density alongside thin quality signals. Before Panda, SEOs routinely pushed keyword density to 5–10% for primary terms. Post-Panda, content above 4–5% frequency for any single non-stop word risks triggering quality filters. The more precise modern indicator is naturalness: human-written English text on a topic has a characteristic vocabulary distribution following Zipf's Law. Stuffed content produces an abnormal spike — one term at 8% while typical co-occurring terms are at 0.1–0.5% — detectable algorithmically.

Zipf's Law: the natural distribution of language

Zipf's Law (George Kingsley Zipf, 1935) states that in natural language, the most frequent word appears approximately 2× as often as the second most frequent, 3× as often as the third, and so on — a power law distribution. In English text: 'the' appears ~7% of the time, 'of' ~3.5%, 'and' ~2.8%, and so on. Well-written content on any topic follows a similar power law distribution for its content words. Content that violates this distribution — with one word appearing 3–4× more than expected by the power law — exhibits the statistical fingerprint of keyword stuffing that modern NLP-based spam detectors identify.

Topic gap detection via frequency distribution comparison

The practical application of TF-IDF analysis: export a top-ranking competitor's article frequency distribution and compare it against yours. Words with high TF-IDF in their article but low or zero frequency in yours are semantic gaps — concepts the topic demands that your article does not cover. A 2019 Searchmetrics study found that the #1 ranking article covers an average of 45% more topic-related terms than the #10 ranking article on the same query. These "missing terms" are the concepts that make the difference between a comprehensive authoritative article and a partial one.

TF-IDF vs. keyword density: the right metric

Keyword density (keyword count ÷ total words × 100) is a simple ratio that ignores the document corpus context. TF-IDF weights frequency against rarity across all documents — the same word at 2% density means very different things if it appears in 90% of all web pages ("the", "is") versus 0.1% of web pages ("creatine phosphokinase"). Google uses IDF to identify words that are distinctively associated with your topic versus universal filler. Optimizing for raw keyword density is less effective than ensuring your article covers the full semantic vocabulary of the topic, which naturally produces appropriate TF-IDF weights.

Stop words: linguistic noise vs. semantic signal

Stop words are function words ("the", "a", "of", "and", "is") that carry grammatical structure but no topical meaning. In computational linguistics, stop words are typically removed before frequency analysis for topic detection purposes. The standard English stop word list used by Lucene (the search engine library that powered early Apache Solr and Elasticsearch) contains 33 words; the NLTK (Natural Language Toolkit) stop word list contains 179. Disabling stop word filtering in this tool reveals prose rhythm and formality register — useful for comparing writing style between two documents rather than topic alignment.

Semantic diversity: the vocabulary breadth signal

Google's Natural Language API (publicly accessible) identifies entities, their categories, and salience scores from text. High-quality content about a topic uses that topic's full canonical vocabulary: core terms, related concepts, common co-occurring entities, and domain-specific jargon where appropriate. A narrow frequency distribution — one or two terms at 3–4% with everything else below 0.5% — signals shallow coverage. A broad, rich distribution where 20–30 topically-relevant terms appear at 0.5–2% signals comprehensive expertise. Semrush found that #1 ranking content uses 3.2× more semantically related words than #10 ranking content for the same query.

Pro Tips

Export to CSV for comparative spreadsheet analysis

Hit 'Copy CSV' to paste the full frequency table into Google Sheets or Excel. Create two columns: your article's top 30 words and a competitor's top 30 words (run their article through this tool separately). VLOOKUP or COUNTIF to find words in column B absent from column A — these are your semantic gaps. This workflow approximates professional TF-IDF gap analysis at zero cost. For sites with 10+ competing articles, this comparison becomes a systematic content upgrade checklist.

Compare against the top-ranking competitor article

Copy the full text from the #1 ranking article for your target keyword (right-click → View Page Source → find the article text, or use a reader-mode browser extension to extract clean text). Paste it here, export the top 50 content words. Then paste your article and export the same. The words prominent in their distribution but absent from yours are what Google's TF-IDF weighting rewards in that article. Adding those terms naturally to your content can shift your topical relevance score.

Use color-coded frequency bars to triage rewrites

Red bars flag content words above 5% frequency (Panda-era stuffing threshold). Amber bars flag 3–5% (elevated, monitor and diversify through synonyms). Blue/primary bars indicate healthy density (0.5–3%). Focus rewrites on red-flagged terms first: substitute 30–40% of their occurrences with synonyms, pronouns, or conceptual variants. This produces a more natural Zipfian distribution while preserving topical relevance — the same concept at half the raw density but with richer vocabulary breadth.

?

Frequently Asked Questions

How is Word Frequency different from Keyword Density?
Keyword Density is a targeted check: you specify 1–3 keywords and it reports their percentage of total word count. Word Frequency is a full distribution audit: it lists every unique word with count and percentage, revealing the entire topical profile of your content. Use Keyword Density when you have a specific target phrase to verify is present at the right level. Use Word Frequency when you want to audit overall topical alignment, identify unexpected high-frequency terms, compare against competitor content, or export data for spreadsheet-level analysis.
Why exclude stop words from frequency analysis?
Stop words ("the", "and", "of", "is", "a") follow Zipf's Law and dominate every text sample — removing them reveals content words that carry actual topical meaning. In the NLTK stop word list (179 words), the top 10 function words account for 40–50% of words in typical English text. Without filtering, the frequency list is dominated by grammatical structure words that carry zero SEO or topical signal. Enable stop words only when analyzing writing style, formality register, or comparing sentence structure between two documents rather than topic coverage.
Does keyword frequency directly affect rankings?
Raw keyword frequency (density) is not a direct ranking factor — Google confirmed this publicly and the Panda algorithm was specifically designed to reward quality over keyword density. What matters is TF-IDF weighting (frequency relative to the corpus), not raw percentage. However, a word's presence or absence has binary importance: if a term central to your topic never appears in your content, it will not contribute to your topical relevance score for that term. Natural frequency in a well-written article on the topic is usually sufficient — forced density additions are detectable and counterproductive.
What is a healthy word frequency distribution for SEO?
A healthy distribution follows Zipf's Law adapted for content words: your primary topic terms appear 1.5–3% of the time (after stop word filtering), supporting semantic terms at 0.5–1.5%, related concept terms at 0.2–0.8%, and specific details at 0.1–0.3%. No single non-stop word should exceed 4–5% frequency. The vocabulary breadth matters more than any single term's density: Semrush found #1 ranking content uses 3.2× more semantically related words than #10 content. Richness and diversity within your topic domain is the target profile.