WooCommerce filter URLs are a crawler trap: the fix
A 2026 postmortem on WooCommerce filter URLs as a crawler trap: 41,000 unique URLs/hour from Facebook and MJ12bot, the diagnostic, and a three-layer fix.
WooCommerce filter URLs are a crawler trap: the fix
The page arrived a little after midnight on a Wednesday. A small
WooCommerce store on cpanel-host was timing out on the
shop archive, then recovering for a minute, then timing out again.
The owner had checked the obvious things by the time we logged in:
no promo running, no campaign live, no abnormal order volume in
the admin. The cPanel resource graph told a different story. The
account's per-user PHP-FPM pool had been at 100% utilisation in
two-minute waves for the past six hours, and the Apache access log
for harborbeansco.com was the size of a feature film.
We pulled the access log, ran a one-line count on unique request
paths, and stopped reading at line two. In a single hour the site
had served 41,000+ unique URLs, every one of them a permutation
of the same six product categories with three to seven attribute
filters appended. A real human shopper had visited maybe forty of
those URLs across the night. The other forty-thousand-nine-hundred-
and-sixty had been requested by two bots: Facebook's external-hit
crawler from a known public subnet, and MJ12bot from one of its
usual addresses. Neither of them was malicious. Both of them were
treating the site's faceted navigation the way a brute-force script
treats wp-login.
This post is the postmortem for that incident, and for a sibling
incident on the same server three weeks earlier. If you have found
your way here from Googling woocommerce facebook crawler or
mj12bot wordpress block or cpanel block bots .htaccess, you are
in the right place. We will cover the combinatorics that make
WooCommerce filters uniquely dangerous, the diagnostic flow we ran
on the box, the three-layer fix that actually closed the ticket,
and an honest description of what ServerGuard does and does not
do for this scenario today.
The symptom: minute-scale FPM exhaustion on a single account
The first thing you see is not a crawler error. There is no crawler
error. What you see is a per-user PHP-FPM pool that climbs from
60% to 100% in roughly twenty seconds, holds there for two minutes,
drains for forty seconds, and climbs again. The shop archive times
out during the spikes and serves in 800ms in between. The error log
has nothing useful in it, because the requests that are filling the
pool are technically legitimate. They are GETs to
/shop/?color=red&size=L, just with thirty thousand variations on
which color, which size, and which material.
The graph that gives the game away is the request-rate panel by
user agent, not the FPM panel. When we filtered the Apache log by
the User-Agent field for the previous six hours, two strings
accounted for 94% of all requests against this single account:
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)The rest of the requests were ordinary browser traffic. Forty-six humans had visited that night. The remaining several tens of thousands of requests were two crawlers walking the same product catalogue in slow motion, exhaustively, one filter combination at a time.
The math: twelve attributes by eight values, twelve thousand URLs per product
WooCommerce attribute filters look harmless in the admin. You add
"color" with six values, "size" with five, "material" with four,
and "brand" with eight. That's four filters. From a shopper's point
of view the catalogue offers 6 + 5 + 4 + 8 = 23 checkboxes. From
a crawler's point of view the catalogue offers
6 × 5 × 4 × 8 = 960 distinct URL combinations per category, plus
every subset, plus every ordering of those parameters in the query
string. A reasonable WooCommerce store with twelve attribute types
each having eight values produces, very approximately, this many
unique filtered URLs per category page:
For 12 attributes with 8 values each, the count of non-empty
parameter subsets (ignoring order) is:
sum over k=1..12 of C(12,k) * 8^k
= 9^12 - 1
= 282,429,536,480
URLs per category. There are usually six categories.Most of those URLs return empty result sets. WooCommerce still
renders them. Each render is a full PHP-FPM request, a session
write, a database query against wp_posts, a join against
wp_term_relationships, and a render of the (empty) results
template. Every one of those URLs is a unique cache key in any
sensible caching plugin, which means caching does not save you
either: the bot is requesting URLs no human has ever requested,
and no human ever will.
If you have one customer browsing the store and one Facebook crawler walking the filter combinations, the crawler is doing roughly eleven billion times the work the customer is. That ratio is not hyperbole. It is the geometry of faceted navigation, and it is the single biggest reason a small WooCommerce store with no notable traffic can run a cPanel server out of FPM workers.
Who's hitting you, and why
The two crawlers that surfaced in our incidents are not unusual. They are the two we caught at scale. The category is wider.
Facebook external hit
Facebook fetches a URL when a user shares it, when a user hovers
over it in the news feed, when the Open Graph cache expires, and
when their internal index decides to re-validate. The crawler
identifies itself as facebookexternalhit/1.1 and originates from
a known public subnet. The relevant facts about it, for our
purposes, are these:
- It hits in bursts. A single product URL shared on a Facebook page with 40,000 followers produces a 600 to 2000 req/sec burst as the news feed pre-fetches Open Graph metadata for each view.
- It does not honour
robots.txtfor Open Graph fetches. Facebook treats those as a user-initiated action even though the user has done nothing except scroll past the link. - It rotates IPs within the subnet. Blocking a single IP within the crawler's subnet accomplishes nothing.
- The subnet cannot be blocked outright without breaking link previews for that store's marketing posts. For an e-commerce store that depends on social, that is a non-starter.
MJ12bot
MJ12bot is the crawler for Majestic, the SEO backlink index. It
identifies itself as MJ12bot/v1.4.8 and runs from a cluster of
IPs of which one is typically the noisier member. Its behaviour is
different from Facebook's:
- It crawls at a low rate per IP, typically one request every few seconds.
- It does honour
robots.txt, but only when it next refreshes its copy of yourrobots.txt, which can be days. - It exhaustively explores filter URLs because it is looking for unique pages to back-link-score. Faceted navigation looks like a goldmine of unique pages to a backlink crawler.
A single MJ12bot IP at one request every three seconds is 1,200 requests an hour. A cluster of ten such IPs walking the filter space of a store with five categories and twelve attributes will produce roughly the 41,000-unique-URLs-per-hour rate we observed in the incident.
Bingbot, Yandex, Baidu, AhrefsBot
These hit the same trap at lower volume. The mechanism is identical: a search-engine or backlink crawler treats every unique URL as a unique page, and faceted navigation manufactures unique URLs faster than any crawler can sensibly index. The fix below covers them by category, not by IP.
The diagnostic flow
When the page comes in and you have an FPM pool at 100% on a single account, the diagnostic is four commands. We run them in this order and we have not yet found an incident of this type that needs a fifth.
The first command counts unique request paths per hour from the Apache access log. This is the one that tells you whether you are looking at a crawler trap or at organic traffic that just happens to be heavy:
# Count unique URL paths per hour over the last 24 hours
awk '{print $4, $7}' /etc/apache2/logs/domlogs/harborbeansco.com \
| sed 's/\[//;s/:..:..$//' \
| sort -u \
| awk '{print $1}' \
| sort | uniq -c | sort -rn | head -24The output for a legitimate-traffic site looks like a few hundred unique paths per hour. The output for a crawler trap looks like this:
41827 10/May/2026:01
39204 10/May/2026:00
37892 09/May/2026:23
38110 09/May/2026:22
39554 09/May/2026:21Forty thousand unique URLs per hour, every hour, on a store with roughly a hundred actual products. That is not a traffic spike. That is the geometry of faceted navigation interacting with a crawler that does not know to skip query strings.
The second command identifies which user agents are responsible:
# Top user agents in the last hour
awk -F'"' '{print $6}' /etc/apache2/logs/domlogs/harborbeansco.com \
| tail -100000 \
| sort | uniq -c | sort -rn | head -10We expect the answer here to be a long list of browser strings. When the top two lines are crawlers and they account for more than half of the requests, the diagnosis is confirmed.
The third command is the one that distinguishes a filter-trap from other crawler behaviour. We count the request paths grouped by their path without the query string:
# Group requests by path, ignoring query string
awk '{print $7}' /etc/apache2/logs/domlogs/harborbeansco.com \
| awk -F'?' '{print $1}' \
| sort | uniq -c | sort -rn | headA crawler trap shows a tiny number of unique paths (usually the six or seven WooCommerce category archives) with very high hit counts. The actual product URLs barely show up because the crawler is hammering the archives with filter parameters and not following through to product detail pages.
The fourth command costs nothing and is worth its weight in diagnosis time. It pulls the same access log into GoAccess for a visual breakdown of request volume by URL, user agent, and origin subnet:
# 30-second visual breakdown of the access log
goaccess /etc/apache2/logs/domlogs/harborbeansco.com -a \
--log-format=COMBINED \
--output=/tmp/client-b-report.htmlFor more reusable awk one-liners on Apache access logs, including the per-subnet aggregation we use to confirm a Facebook subnet hit versus a single-IP MJ12bot hit, the WP-Cron stacking on cPanel postmortem has a section on log triage that pairs with this one. The diagnostic instincts overlap even though the underlying cause is unrelated.
Why your existing protections do not help
The instinct on an incident like this is to reach for CSF or Imunify360 or the caching plugin. None of them solve this problem for the reasons listed below, and the time you spend tuning them is time the FPM pool stays exhausted.
CSF
CSF blocks IPs you have denied. You can deny single IPs and small CIDRs. You cannot sensibly deny the Facebook crawler subnet because doing so will break Facebook share previews for every link the store posts to its own marketing pages. You can deny the MJ12bot IP, and you probably should, but the cluster will move to another IP within hours and you are back where you started.
Imunify360
Imunify360 is a WAF for known attack patterns. Faceted-navigation URLs are not an attack pattern. They are technically valid HTTP requests against a publicly accessible endpoint, returning HTTP 200, with a user agent that identifies itself honestly. There is nothing for the WAF to match against, and tuning the WAF to match will produce false positives on legitimate shoppers using filters.
WP Super Cache and W3 Total Cache
Caching plugins generate a cache file per unique query string. A crawler trap that produces 40,000 unique URLs per hour produces 40,000 cache files per hour. We have seen this fill a 50GB cPanel account in eighteen hours. The cache file is not the bottleneck; generating it is. The disk-fill problem turns this from an FPM incident into a full-stop site outage.
robots.txt
robots.txt does the right thing on the bots that obey it,
eventually. MJ12bot will obey, on its next refresh cycle, which
can be days. Facebook external-hit will not obey for Open Graph
fetches. Misbehaving SEO crawlers ignore it entirely. We add a
robots.txt directive as part of the fix below, but it is not
the fix on its own.
The shape of the right fix is: change what the bots see in the HTML, change what the server lets them ask for, and change what the canonical URL says when they do ask.
Our fix, in three parts
The fix is three changes that take roughly forty minutes to deploy on a typical WooCommerce store and that eliminate the entire incident class. Each part addresses a different layer.
Part 1: stop showing filter URLs to crawlers
The most effective single change is to add rel="nofollow" to
every filter link in the WooCommerce template. Well-behaved
crawlers (Googlebot, Bingbot, Yandex, Ahrefs, MJ12bot when it
refreshes) will skip URLs marked nofollow. Facebook external-hit
ignores nofollow for Open Graph fetches, but for everything else
this single attribute removes the crawler trap from the indexable
surface of the site.
For stores using a standard WooCommerce theme with the built-in
"Filter Products by Attribute" widget, the change is a filter on
woocommerce_layered_nav_term_html. Add this to the active
theme's functions.php:
<?php
// Add rel="nofollow" to every layered-nav filter link.
// Place in your child theme's functions.php. Never in
// the parent theme, which is overwritten on updates.
add_filter(
'woocommerce_layered_nav_term_html',
function ($term_html, $term, $link, $count) {
return preg_replace(
'/<a /',
'<a rel="nofollow" ',
$term_html,
1
);
},
10,
4
);For stores using a custom-built filter UI or one of the popular
filter plugins (YITH WooCommerce Ajax Product Filter, WOOF), the
filter hook name varies but the pattern is identical: intercept
the output that renders filter <a> tags and add rel="nofollow".
The plugin documentation calls this out, usually under a heading
like "SEO settings".
A robots.txt directive complements this for the bots that read
it:
# /robots.txt
User-agent: *
Disallow: /shop/?
Disallow: /product-category/*?
Disallow: /*?filter_
Disallow: /*?attribute_
Disallow: /*?orderby=The path patterns above match every URL with a filter or attribute
query string. Combined with the rel="nofollow" change, this
removes filter URLs from the indexable surface for every crawler
that obeys either signal (which is the long tail of crawlers).
Part 2: block the misbehaving bots at the edge
Some bots ignore rel="nofollow", ignore robots.txt, or hit
hard enough to matter even when they obey the rules an hour later.
For those, we block at the edge. There are two reasonable places
to do this. We use both, in different scenarios.
The Apache-level rule is appropriate when the store does not sit
behind a CDN. It goes in the site's .htaccess or, better, in a
per-vhost include managed through WHM's "Apache Configuration →
Include Editor":
# Rate-limit filter URLs by user agent.
# Place in the vhost include for the WooCommerce site.
<IfModule mod_security2.c>
SecRule REQUEST_URI "@rx ^/shop/\?|^/product-category/.*\?" \
"id:9000401,phase:1,chain,deny,status:429,\
msg:'WooCommerce filter URL from suspect UA'"
SecRule REQUEST_HEADERS:User-Agent \
"@rx (MJ12bot|AhrefsBot|SemrushBot|DotBot|BLEXBot)" \
"t:none"
</IfModule>The rule reads as: when the request URI matches the shop or product-category endpoint with a query string, and the user agent matches one of these known aggressive crawlers, return 429. We return 429 rather than 403 so that well-behaved crawlers back off on the next request rather than retrying immediately.
We deliberately omit facebookexternalhit from this rule. Open
Graph previews matter for the store's Facebook marketing, and the
volume from a single shared URL is a burst we want to absorb at
the cache layer rather than reject at the edge.
When the store sits behind Cloudflare, the same intent expresses as a single Page Rule or a Worker. The Cloudflare Bot Fight Mode feature catches the long tail of misbehaving crawlers without hand-rolled rules. For the specific case of MJ12bot and AhrefsBot, a Cloudflare Firewall Rule like the following is sufficient:
(http.request.uri.path matches "^/shop/" and
http.request.uri.query contains "filter_" and
cf.client.bot eq false and
http.user_agent contains "MJ12bot") then blockCloudflare's bot detection is better than anything we can build
in .htaccess. We use it whenever the client is willing to put
Cloudflare in front of their cPanel. We do not push Cloudflare on
clients who have considered it and declined, usually because of
DNS-management workflow, SSL workflow, or compliance considerations
that are valid for their business.
Part 3: normalise URLs at the WooCommerce level
The third change is the durable one. It collapses the URL space that the previous two parts have hidden from crawlers, so that even if some crawler sneaks through both, every filter URL canonicalises back to the unfiltered category page:
<?php
// Force canonical to the unfiltered category URL on
// filter-applied archive pages.
add_action('wp_head', function () {
if (! is_product_category() && ! is_shop()) {
return;
}
// Build the canonical URL without query parameters.
$canonical = strtok(
home_url(add_query_arg(null, null)),
'?'
);
printf(
'<link rel="canonical" href="%s" />' . "\n",
esc_url($canonical)
);
// Tell crawlers not to index filtered views explicitly.
if (! empty($_GET)) {
echo '<meta name="robots" content="noindex, follow" />' . "\n";
}
}, 1);The effect of this code is that
/shop/?color=red&size=L&material=cotton emits a <link rel="canonical" href="/shop/" /> and a <meta name="robots" content="noindex" />. Crawlers that respect canonical signals
(Google, Bing, and most reputable backlink crawlers)
consolidate all of those filter URLs into a single indexable
page. The crawler trap effectively disappears from their queue.
A more complete solution exists in the form of plugins like
WooCommerce Permalinks Manager, which rewrite filter URLs into
clean canonical paths. We have tested the plugin route and it works,
but it pulls in a dependency that the store owner has to keep
updated. The five-line wp_head filter above is usually enough
and adds no plugin surface.
When a CDN is the right answer and when Apache rules are
We are asked some version of this question on every WooCommerce
crawler-trap incident. The honest answer is that Cloudflare's bot
detection is significantly better than anything you can build in
.htaccess, and on stores that already use Cloudflare or are
willing to we always prefer the Cloudflare route.
Apache-level rules are the right answer in three situations:
- The store sits behind cPanel and the client is not willing to move DNS to Cloudflare. This is common with stores whose mail is on the same domain and whose email-deliverability story depends on DNS records they manage in cPanel.
- The store is on a managed hosting product that proxies in front of Apache but does not offer bot management features. Some regional managed-hosting providers fall into this bucket.
- The client has a compliance constraint that disallows non-EU proxying. Cloudflare's free tier does not let you pin to EU POPs; the Apache approach keeps traffic on the cPanel host.
For everyone else, Cloudflare's bot-fight mode plus the three-line
firewall rule above will do more than the four hours of
mod_security tuning that the Apache approach demands.
How ServerGuard handles this
ServerGuard's use case for the WooCommerce filter URL crawler
trap covers detection and the in-flight remediation today, and the
structural remediation upcoming. We are deliberate about
the line between the two: modifying a WooCommerce theme on a
client site is exactly the kind of change that demands a human
approval, and pushing mod_security rules across an Apache vhost
is the kind of change that demands a paired set of human eyes on
the diff.
Today, ServerGuard's safe-action use case handles the in-flight incident:
- Detect. When per-account PHP-FPM utilisation crosses the 90%
threshold for a WooCommerce account on a cPanel host, SGuard
correlates the spike with the rolling Apache access log for
that account. If the unique-URL count in the last fifteen
minutes exceeds the rolling baseline by a factor of twenty, and
the top user-agent strings include a known crawler from the
built-in list, the incident is classified as
woocommerce_crawler_trapand the runbook activates. - Act, automatically. SGuard adds the offending source subnet
to a soft rate-limit (
mod_evasive-style, 10 req/sec per source IP) scoped to the WooCommerce vhost. This is a Safe action under the spec. Rate-limiting is reversible, scoped to a single vhost, and does not modify application code. The limit is logged to the audit trail with the source subnet, the matched user-agent pattern, and the inferred crawler family. - Diagnose. SGuard pulls the top fifty filter URLs from the Apache log, the per-attribute hit distribution, and the FPM slow log for the same window. Those land in the incident ticket so the on-call engineer has a single timeline and a single diff-ready set of facts.
The structural remediation (Part 1 and Part 3 of the fix above)
sits behind a Moderate-tier approval gate and is upcoming
roadmap. Editing a WooCommerce theme on a client site is the kind
of change that needs a human signing off on the diff before it
ships. When this ships, SGuard will present the proposed
functions.php change as a unified diff, the corresponding
robots.txt update, and the rollback plan as a single approval
prompt in Telegram or the web dashboard.
Part 2 (the mod_security rules) is a Moderate action
because pushing rules into an Apache vhost include can break sites
in subtle ways and the rollback needs to be one click. We will
not ship a Moderate-tier action until that rollback is tested.
We are honest about one thing the use case deliberately does not do: SGuard does not install WordPress plugins on client sites and does not modify WooCommerce templates without an approval. The boundary of automation is the server, not the application running on it. WooCommerce theme-level fixes are outside our intervention surface by design.
For the related "what happens when the bot traffic also stacks WP-Cron" failure mode, see the WP-Cron stacking on cPanel postmortem. The same crawler trap can trigger that cascade on a WooCommerce site that runs heavy scheduled tasks. And for the adjacent failure mode where the crawler trap shows up alongside a firewall flap, see CSF, lfd, and Imunify360: why your firewall is killing itself. When CSF deny lists fill up faster than LFD can rotate them, the two incidents collide in a way that is harder to diagnose than either on its own.
The ten-minute audit
Four commands to detect this on any cPanel and WooCommerce site, before the page arrives instead of after.
- Is the unique-URL rate per hour wildly out of proportion to the product count? Run the first awk command above. If the hourly unique-URL count is more than ten times the number of products in the store, you have a crawler trap brewing.
- Is the top user agent on the access log a crawler? Run the
second awk command. The top user agent should be a browser
string. If it is
facebookexternalhit,MJ12bot,AhrefsBot,SemrushBot,DotBot, orBLEXBot, and that user agent accounts for more than 30% of requests, you are in the trap. - Are filter URLs canonicalised to the unfiltered archive?
Open
/shop/?color=redin a browser, view source, find the<link rel="canonical">tag. If it points back to/shop/?color=redinstead of/shop/, the site is feeding every crawler the trap as an indexable page. - Does the layered-nav filter widget emit
rel="nofollow"on its<a>tags? View source on the shop archive, search for the filter widget output, check whether the filter links includerel="nofollow". If they do not, the trap is open to every crawler that follows links honestly.
The honest version of this audit is that most WooCommerce stores in the wild answer "no" to at least three of the four. The fix on each is cheap. The fix together prevents the entire incident class, including the variant where the crawler is one we have not named in this post yet.
If you operate a WooCommerce store on cPanel and any part of this post made you wince, join the ServerGuard waitlist. We are onboarding agencies in cohorts, and the WooCommerce crawler-trap use case is one of the runbooks we ship today.
مقالات ذات صلة
- قراءة 6 دقيقة
When you have to suspend a WooCommerce client: anatomy
Anatomy of a forced suspension on a shared cPanel server The decision to take a paying client offline to protect fourteen other paying clients is the worst part of running a small hosting agency. There is no scripted version of it that feel
- قراءة 14 دقيقة
xmlrpc.php abuse and the 27-site one-shot fix on cPanel
xmlrpc.php abuse and the 27-site one-shot fix on cPanel The first time floods one of your servers, you Google the symptom, find a guide called "how to disable xmlrpc.php in WordPress", install a plugin, click a checkbox, and move on. The se
- قراءة 8 دقيقة
Hardening every WordPress site on cPanel in one loop
Hardening every WordPress site on cPanel in one loop You manage twenty-seven WordPress sites on one cPanel server. A clean hardening pass on a single site (disable xmlrpc, lock down file editing, force SSL on the admin, security headers int