TF-IDF Keyword Extractor
Paste any text to see which keywords are most significant. TF-IDF measures how important a word is to a document — not just how frequent. Stops treating 'the' as a keyword.
Use this with
Related keyword research tools
TF-IDF Guide
TF-IDF: the smarter way to find what a document is actually about
TF-IDF stands for Term Frequency-Inverse Document Frequency. It's the foundational algorithm behind search engine keyword analysis, document clustering, and content summarization. Unlike simple word frequency, TF-IDF down-weights common words (like 'the', 'is', 'and') and up-weights terms that are both frequent in this document and rare in general language — revealing what the content is genuinely about.
What TF means
Term Frequency is how often a word appears in the document, normalized by document length. A word appearing 10 times in a 100-word document has a TF of 0.1 — regardless of whether it's 'the' or 'algorithm'. By itself, TF is just a word count ratio.
What IDF means
Inverse Document Frequency measures how rare a word is across all documents. Common words like 'the' appear in virtually every document — high document frequency, low IDF. Rare technical terms appear in few documents — low document frequency, high IDF. Multiplying TF × IDF rewards words that are both frequent in this document AND rare in general.
How this tool approximates IDF
True IDF requires a reference corpus of thousands of documents. This tool uses a word-length proxy: longer words are generally rarer in natural language. Combined with stop word removal (which eliminates the most common words), this provides a useful approximation of keyword significance without needing a corpus.
Using TF-IDF for content optimization
Compare the top TF-IDF terms from a high-ranking competitor page against your own page for the same keyword. Terms that appear in their top 20 but not yours represent topical gaps. These are often the LSI (Latent Semantic Indexing) keywords that signal comprehensive coverage to Google.
TF-IDF vs keyword density
Keyword density tells you how often one specific keyword appears. TF-IDF tells you which words define the document — the full topic vocabulary. For SEO, you want both: your target keyword at appropriate density, plus high TF-IDF coverage of related terms that signal topical depth.
Entity extraction and topic modeling
TF-IDF analysis often surfaces named entities (proper nouns, brand names, technical terms) that pure frequency misses. These high-IDF terms are what NLP systems use to understand what a document is about. If your top TF-IDF terms match the topic you want to rank for, your content is semantically aligned.
Pro Tips
Run TF-IDF on a competitor's page that ranks #1 for your target keyword. Their top 10-20 TF-IDF terms are the semantic vocabulary that likely signals topical authority. Cross-reference with your own content.
Strip HTML before pasting for cleaner results. Use the HTML Tag Stripper tool to convert a page's HTML to plain text, then paste that text here to avoid TF-IDF picking up CSS class names and HTML attributes.
TF-IDF becomes more meaningful with longer texts. For very short content (under 100 words), the results are dominated by whatever unique words appear and may not be representative. For best results, use with articles of 500+ words.
Frequently Asked Questions
- How is this different from a word frequency counter?
- Word frequency counts every word equally. TF-IDF weights terms by their importance: a word that appears 5 times in a technical article about 'neural networks' scores higher for 'network' than for 'is', even if both appear 5 times. Stop words are removed, and longer/rarer words are weighted higher.
- What are stop words and why are they removed?
- Stop words are common function words that carry no topical meaning: 'the', 'a', 'is', 'and', 'but'. They appear in almost every document and would dominate TF scores without removal. Stripping them reveals the substantive vocabulary — the words that actually describe what a document is about.
- Can I use this to find keywords to add to my content?
- Yes — compare TF-IDF results between your content and a competitor's top-ranking page. Terms with high TF-IDF on their page but absent from yours are topical gaps. Adding them naturally (not forcibly) can improve semantic relevance. Don't add terms just because they appear — they should fit contextually.
- Why might important keywords score lower than I expect?
- Short words get lower IDF approximations (since our IDF proxy uses word length as a rarity signal). A one-word brand name or acronym may score lower than a descriptive phrase even if it's thematically central. For short/specific terms, cross-check with the raw count column.