Not every page on a site should be in Google. Thank-you pages, internal search results, faceted filter URLs, staging environments, PPC landing pages — these are useful to humans in the right moment but pollute the index if they show up in organic results. The robots meta tag is the per-page instruction that controls this. It's one line, it's easy to write, and it's also one of the easiest ways to accidentally wipe a site out of search.
This guide covers what the robots meta tag is, every directive it accepts, the critical and frequently-misunderstood difference between noindex and robots.txt, the real-world cases where you actually reach for it, and how to verify it's doing what you think.
What the Robots Meta Tag Looks Like
A single line in the <head> of a page:
<meta name="robots" content="noindex, follow">
The name="robots" targets all compliant crawlers. The content attribute holds one or more comma-separated directives. The tag must live in the <head> — search engines do not reliably honour robots directives placed in the body.
If there is no robots meta tag at all, the default behaviour is index, follow: the page is eligible for indexing and its links are eligible to be crawled. This matters — you do not need to add <meta name="robots" content="index, follow"> to get pages indexed. That tag is a no-op. You only add a robots tag when you want to change the default.
The Directives
The two that matter most are the indexing and link-following pair:
- index / noindex — whether the page may appear in search results.
noindexremoves it from the index. - follow / nofollow — whether crawlers may follow the links on the page and pass signal through them.
nofollowtells them not to.
Beyond those, Google supports a set of finer-grained directives:
- noarchive — don't show a cached copy of the page in results.
- nosnippet — don't show any text snippet or video preview for the page.
- max-snippet:[number] — limit the text snippet to N characters.
max-snippet:0means no snippet;max-snippet:-1means no limit. - max-image-preview:[setting] — cap the size of any image preview. Values are
none,standard, orlarge. - max-video-preview:[number] — limit video preview to N seconds.
0means a static image only;-1means no limit. - noimageindex — don't index images on the page.
- none — shorthand for
noindex, nofollow. - all — shorthand for
index, follow(the default; rarely needed explicitly). - unavailable_after:[date] — stop showing the page in results after a given date/time. Useful for time-limited content like event or promotion pages. Use an RFC 822 or ISO 8601 date.
You combine them as needed:
<meta name="robots" content="noindex, nofollow">
<meta name="robots" content="index, follow, max-image-preview:large">
<meta name="robots" content="noarchive, nosnippet">
Targeting Specific Crawlers
The name attribute selects which crawler the rule applies to. robots targets all of them; a specific user-agent token targets just that one:
<meta name="robots" content="nosnippet">
<meta name="googlebot" content="noindex">
<meta name="bingbot" content="noarchive">
If both a generic robots rule and a crawler-specific rule exist, the crawler obeys the more specific one combined with the generic one. The most restrictive directive wins — if robots says index but googlebot says noindex, Googlebot does not index the page.
The Critical Distinction: noindex vs robots.txt vs X-Robots-Tag
This is the part that trips up even experienced developers, so it's worth being precise. These three things sound similar and do completely different jobs.
robots.txt controls crawling, not indexing
A Disallow rule in robots.txt tells crawlers not to fetch a URL. It does not tell them not to index it. A page blocked in robots.txt can still appear in search results — typically as a bare URL with no description ("No information is available for this page") — because Google learned about it from links even though it was never allowed to read the content.
noindex controls indexing, and requires crawling
The noindex directive removes a page from the index. But here's the catch: Google has to crawl the page to see the noindex tag. The directive lives inside the page's HTML. If the crawler can't fetch the page, it never sees the instruction.
The classic mistake
This is the single most common robots error, and it's worth stating loudly:
Blocking a page in robots.txt AND adding a noindex tag does not de-index the page.
The logic seems sound — "I'll block it in robots.txt so crawlers stay away, and add noindex to be safe" — but the two cancel each other out. The robots.txt Disallow stops Google from ever fetching the page, which means Google never reads the noindex tag, which means the page can remain indexed indefinitely as a bare URL.
If you want a page out of the index, do the opposite of what feels safe: allow it to be crawled, and add noindex. Let Google fetch the page, read the directive, and drop it. Only once it has been de-indexed can you later consider blocking it in robots.txt to save crawl budget — though there's rarely a reason to.
X-Robots-Tag for Non-HTML Files
The robots meta tag only works in HTML, because it's an HTML element. You can't put a <meta> tag inside a PDF, an image, or a text file. For those, the same directives are delivered through the X-Robots-Tag HTTP response header:
X-Robots-Tag: noindex
To stop a PDF from being indexed, configure the server to send the header for that file type. In Apache:
<FilesMatch "\.pdf$">
Header set X-Robots-Tag "noindex, nofollow"
</FilesMatch>
The header accepts every directive the meta tag does, and you can target crawlers the same way: X-Robots-Tag: googlebot: noindex. It's also a clean way to apply noindex across a whole directory or pattern without editing every file — useful for staging environments and bulk content. As with the meta tag, the page still has to be crawlable for Google to read the header.
Real-World Use Cases
Where you actually reach for noindex in production:
- Thank-you and confirmation pages. Post-checkout, post-signup, post-contact-form pages. No search value, and you don't want them appearing out of context.
noindex, follow— drop the page but still let link signal flow. - Internal search results.
/search?q=...pages generate infinite low-quality URLs. Google explicitly recommends keeping these out of the index. - Staging and development sites. A staging environment should be entirely
noindex(ideally also behind HTTP auth). Never let a clone of production compete with production in search. - Paginated archives — carefully. Modern guidance is usually to let page 2, 3, 4 of an archive index normally (self-canonical, indexable) so their links are discovered.
noindexon deep pagination is a deliberate choice, not a default. - Tag, filter, and faceted pages. Filter combinations can produce thousands of thin, near-duplicate URLs.
noindexthe low-value combinations while keeping the valuable category pages indexable. - PPC and email landing pages. Pages built for a paid campaign often duplicate organic content or are intentionally thin.
noindexkeeps them out of organic search while remaining live for the ad traffic they're built for.
Note the recurring pairing: noindex, follow. You usually want the page gone from search but still want crawlers to follow its links so the rest of the site stays discoverable.
Common Mistakes
1. The catastrophic sitewide noindex
The single most damaging robots error. A staging site is built with a global <meta name="robots" content="noindex"> in the template — entirely correct for staging. Then the site is pushed to production and the noindex comes along with it. Within days, Google de-indexes the entire site and organic traffic falls off a cliff.
This is the dev-to-prod leak, and it has taken down real businesses. Whatever mechanism adds noindex to staging must be guaranteed to remove it on production — and the production deploy checklist should include verifying the homepage is index, follow. Run the live URL through Meta Tag Checker immediately after every launch.
2. noindex + canonical on the same page
These two directives contradict each other. canonical says "consolidate this URL with another page"; noindex says "drop this URL entirely". Google can't do both, treats the signals as conflicting, and the outcome is unpredictable. Pick one: canonical for genuine duplicates you want to consolidate, noindex for pages that shouldn't be in search at all.
3. Blocking in robots.txt and expecting de-indexing
Covered above, but it's the most common misconception so it bears repeating: Disallow in robots.txt does not remove a page from the index. To de-index, the page must be crawlable and carry noindex. If it's already blocked, unblock it first.
4. noindex left on after launch
The reverse of the staging leak: a page was temporarily set to noindex during development of a feature, and nobody removed it when the page went live. The page silently never ranks. Audit for unexpected noindex tags whenever a page underperforms.
5. Conflicting directives across tags
A CMS template adds one robots tag and an SEO plugin adds another, with different directives. Google reads both and applies the most restrictive combination — which is often not what either author intended. Audit for duplicate robots tags and consolidate to one.
How to Verify
- Run the URL through Meta Tag Checker — it surfaces the robots directive along with the rest of the page's meta tags, so you can confirm at a glance whether a page is indexable.
- View Source in the browser and search for "robots" — visual confirmation of the meta tag.
- curl the headers to catch an
X-Robots-Tagthat View Source won't show:curl -sI https://example.com/page/ | grep -i robots. - Google Search Console → URL Inspection — the authoritative check. It reports whether Google considers the URL indexable and names the exact directive blocking it if not.
The Search Console check is the one that matters most, because it tells you what Google actually sees and decided — not just what your template emits. If a page you expect to rank shows "Excluded by 'noindex' tag", you've found your problem.
The Production Habit
For every launch and template change:
- Confirm the homepage and key landing pages are
index, follow— never assume the deploy kept them that way. - Guarantee staging
noindexcannot survive into production. Tie it to an environment variable, not a hard-coded tag. - To de-index a page, make it crawlable and add
noindex— do not block it in robots.txt. - Never put
noindexandcanonicalon the same page. - Use
X-Robots-Tagfor PDFs and other non-HTML files. - Spot-check the live URL after every deploy.
The robots meta tag is small, but the blast radius when it's wrong is the whole site. A thirty-second check after each launch is the cheapest insurance in SEO — and Meta Tag Checker makes it a single paste-and-go for any URL.