15 min read

WooCommerce filter URLs are a crawler trap: the fix

A 2026 postmortem on WooCommerce filter URLs as a crawler trap: 41,000 unique URLs/hour from Facebook and MJ12bot, the diagnostic, and a three-layer fix.

WooCommerce filter URLs are a crawler trap: the fix

The page arrived a little after midnight on a Wednesday. A small WooCommerce store on cpanel-host was timing out on the shop archive, then recovering for a minute, then timing out again. The owner had checked the obvious things by the time we logged in: no promo running, no campaign live, no abnormal order volume in the admin. The cPanel resource graph told a different story. The account's per-user PHP-FPM pool had been at 100% utilisation in two-minute waves for the past six hours, and the Apache access log for harborbeansco.com was the size of a feature film.

We pulled the access log, ran a one-line count on unique request paths, and stopped reading at line two. In a single hour the site had served 41,000+ unique URLs, every one of them a permutation of the same six product categories with three to seven attribute filters appended. A real human shopper had visited maybe forty of those URLs across the night. The other forty-thousand-nine-hundred- and-sixty had been requested by two bots: Facebook's external-hit crawler from a known public subnet, and MJ12bot from one of its usual addresses. Neither of them was malicious. Both of them were treating the site's faceted navigation the way a brute-force script treats wp-login.

This post is the postmortem for that incident, and for a sibling incident on the same server three weeks earlier. If you have found your way here from Googling woocommerce facebook crawler or mj12bot wordpress block or cpanel block bots .htaccess, you are in the right place. We will cover the combinatorics that make WooCommerce filters uniquely dangerous, the diagnostic flow we ran on the box, the three-layer fix that actually closed the ticket, and an honest description of what ServerGuard does and does not do for this scenario today.

The symptom: minute-scale FPM exhaustion on a single account

The first thing you see is not a crawler error. There is no crawler error. What you see is a per-user PHP-FPM pool that climbs from 60% to 100% in roughly twenty seconds, holds there for two minutes, drains for forty seconds, and climbs again. The shop archive times out during the spikes and serves in 800ms in between. The error log has nothing useful in it, because the requests that are filling the pool are technically legitimate. They are GETs to /shop/?color=red&size=L, just with thirty thousand variations on which color, which size, and which material.

The graph that gives the game away is the request-rate panel by user agent, not the FPM panel. When we filtered the Apache log by the User-Agent field for the previous six hours, two strings accounted for 94% of all requests against this single account:

facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)

The rest of the requests were ordinary browser traffic. Forty-six humans had visited that night. The remaining several tens of thousands of requests were two crawlers walking the same product catalogue in slow motion, exhaustively, one filter combination at a time.

The math: twelve attributes by eight values, twelve thousand URLs per product

WooCommerce attribute filters look harmless in the admin. You add "color" with six values, "size" with five, "material" with four, and "brand" with eight. That's four filters. From a shopper's point of view the catalogue offers 6 + 5 + 4 + 8 = 23 checkboxes. From a crawler's point of view the catalogue offers 6 × 5 × 4 × 8 = 960 distinct URL combinations per category, plus every subset, plus every ordering of those parameters in the query string. A reasonable WooCommerce store with twelve attribute types each having eight values produces, very approximately, this many unique filtered URLs per category page:

For 12 attributes with 8 values each, the count of non-empty
parameter subsets (ignoring order) is:
 
  sum over k=1..12 of C(12,k) * 8^k
 
  = 9^12 - 1
  = 282,429,536,480
 
URLs per category. There are usually six categories.

Most of those URLs return empty result sets. WooCommerce still renders them. Each render is a full PHP-FPM request, a session write, a database query against wp_posts, a join against wp_term_relationships, and a render of the (empty) results template. Every one of those URLs is a unique cache key in any sensible caching plugin, which means caching does not save you either: the bot is requesting URLs no human has ever requested, and no human ever will.

If you have one customer browsing the store and one Facebook crawler walking the filter combinations, the crawler is doing roughly eleven billion times the work the customer is. That ratio is not hyperbole. It is the geometry of faceted navigation, and it is the single biggest reason a small WooCommerce store with no notable traffic can run a cPanel server out of FPM workers.

Who's hitting you, and why

The two crawlers that surfaced in our incidents are not unusual. They are the two we caught at scale. The category is wider.

Facebook external hit

Facebook fetches a URL when a user shares it, when a user hovers over it in the news feed, when the Open Graph cache expires, and when their internal index decides to re-validate. The crawler identifies itself as facebookexternalhit/1.1 and originates from a known public subnet. The relevant facts about it, for our purposes, are these:

  • It hits in bursts. A single product URL shared on a Facebook page with 40,000 followers produces a 600 to 2000 req/sec burst as the news feed pre-fetches Open Graph metadata for each view.
  • It does not honour robots.txt for Open Graph fetches. Facebook treats those as a user-initiated action even though the user has done nothing except scroll past the link.
  • It rotates IPs within the subnet. Blocking a single IP within the crawler's subnet accomplishes nothing.
  • The subnet cannot be blocked outright without breaking link previews for that store's marketing posts. For an e-commerce store that depends on social, that is a non-starter.

MJ12bot

MJ12bot is the crawler for Majestic, the SEO backlink index. It identifies itself as MJ12bot/v1.4.8 and runs from a cluster of IPs of which one is typically the noisier member. Its behaviour is different from Facebook's:

  • It crawls at a low rate per IP, typically one request every few seconds.
  • It does honour robots.txt, but only when it next refreshes its copy of your robots.txt, which can be days.
  • It exhaustively explores filter URLs because it is looking for unique pages to back-link-score. Faceted navigation looks like a goldmine of unique pages to a backlink crawler.

A single MJ12bot IP at one request every three seconds is 1,200 requests an hour. A cluster of ten such IPs walking the filter space of a store with five categories and twelve attributes will produce roughly the 41,000-unique-URLs-per-hour rate we observed in the incident.

Bingbot, Yandex, Baidu, AhrefsBot

These hit the same trap at lower volume. The mechanism is identical: a search-engine or backlink crawler treats every unique URL as a unique page, and faceted navigation manufactures unique URLs faster than any crawler can sensibly index. The fix below covers them by category, not by IP.

The diagnostic flow

When the page comes in and you have an FPM pool at 100% on a single account, the diagnostic is four commands. We run them in this order and we have not yet found an incident of this type that needs a fifth.

The first command counts unique request paths per hour from the Apache access log. This is the one that tells you whether you are looking at a crawler trap or at organic traffic that just happens to be heavy:

# Count unique URL paths per hour over the last 24 hours
awk '{print $4, $7}' /etc/apache2/logs/domlogs/harborbeansco.com \
  | sed 's/\[//;s/:..:..$//' \
  | sort -u \
  | awk '{print $1}' \
  | sort | uniq -c | sort -rn | head -24

The output for a legitimate-traffic site looks like a few hundred unique paths per hour. The output for a crawler trap looks like this:

  41827 10/May/2026:01
  39204 10/May/2026:00
  37892 09/May/2026:23
  38110 09/May/2026:22
  39554 09/May/2026:21

Forty thousand unique URLs per hour, every hour, on a store with roughly a hundred actual products. That is not a traffic spike. That is the geometry of faceted navigation interacting with a crawler that does not know to skip query strings.

The second command identifies which user agents are responsible:

# Top user agents in the last hour
awk -F'"' '{print $6}' /etc/apache2/logs/domlogs/harborbeansco.com \
  | tail -100000 \
  | sort | uniq -c | sort -rn | head -10

We expect the answer here to be a long list of browser strings. When the top two lines are crawlers and they account for more than half of the requests, the diagnosis is confirmed.

The third command is the one that distinguishes a filter-trap from other crawler behaviour. We count the request paths grouped by their path without the query string:

# Group requests by path, ignoring query string
awk '{print $7}' /etc/apache2/logs/domlogs/harborbeansco.com \
  | awk -F'?' '{print $1}' \
  | sort | uniq -c | sort -rn | head

A crawler trap shows a tiny number of unique paths (usually the six or seven WooCommerce category archives) with very high hit counts. The actual product URLs barely show up because the crawler is hammering the archives with filter parameters and not following through to product detail pages.

The fourth command costs nothing and is worth its weight in diagnosis time. It pulls the same access log into GoAccess for a visual breakdown of request volume by URL, user agent, and origin subnet:

# 30-second visual breakdown of the access log
goaccess /etc/apache2/logs/domlogs/harborbeansco.com -a \
  --log-format=COMBINED \
  --output=/tmp/client-b-report.html

For more reusable awk one-liners on Apache access logs, including the per-subnet aggregation we use to confirm a Facebook subnet hit versus a single-IP MJ12bot hit, the WP-Cron stacking on cPanel postmortem has a section on log triage that pairs with this one. The diagnostic instincts overlap even though the underlying cause is unrelated.

Why your existing protections do not help

The instinct on an incident like this is to reach for CSF or Imunify360 or the caching plugin. None of them solve this problem for the reasons listed below, and the time you spend tuning them is time the FPM pool stays exhausted.

CSF

CSF blocks IPs you have denied. You can deny single IPs and small CIDRs. You cannot sensibly deny the Facebook crawler subnet because doing so will break Facebook share previews for every link the store posts to its own marketing pages. You can deny the MJ12bot IP, and you probably should, but the cluster will move to another IP within hours and you are back where you started.

Imunify360

Imunify360 is a WAF for known attack patterns. Faceted-navigation URLs are not an attack pattern. They are technically valid HTTP requests against a publicly accessible endpoint, returning HTTP 200, with a user agent that identifies itself honestly. There is nothing for the WAF to match against, and tuning the WAF to match will produce false positives on legitimate shoppers using filters.

WP Super Cache and W3 Total Cache

Caching plugins generate a cache file per unique query string. A crawler trap that produces 40,000 unique URLs per hour produces 40,000 cache files per hour. We have seen this fill a 50GB cPanel account in eighteen hours. The cache file is not the bottleneck; generating it is. The disk-fill problem turns this from an FPM incident into a full-stop site outage.

robots.txt

robots.txt does the right thing on the bots that obey it, eventually. MJ12bot will obey, on its next refresh cycle, which can be days. Facebook external-hit will not obey for Open Graph fetches. Misbehaving SEO crawlers ignore it entirely. We add a robots.txt directive as part of the fix below, but it is not the fix on its own.

The shape of the right fix is: change what the bots see in the HTML, change what the server lets them ask for, and change what the canonical URL says when they do ask.

Our fix, in three parts

The fix is three changes that take roughly forty minutes to deploy on a typical WooCommerce store and that eliminate the entire incident class. Each part addresses a different layer.

Part 1: stop showing filter URLs to crawlers

The most effective single change is to add rel="nofollow" to every filter link in the WooCommerce template. Well-behaved crawlers (Googlebot, Bingbot, Yandex, Ahrefs, MJ12bot when it refreshes) will skip URLs marked nofollow. Facebook external-hit ignores nofollow for Open Graph fetches, but for everything else this single attribute removes the crawler trap from the indexable surface of the site.

For stores using a standard WooCommerce theme with the built-in "Filter Products by Attribute" widget, the change is a filter on woocommerce_layered_nav_term_html. Add this to the active theme's functions.php:

<?php
// Add rel="nofollow" to every layered-nav filter link.
// Place in your child theme's functions.php. Never in
// the parent theme, which is overwritten on updates.
add_filter(
    'woocommerce_layered_nav_term_html',
    function ($term_html, $term, $link, $count) {
        return preg_replace(
            '/<a /',
            '<a rel="nofollow" ',
            $term_html,
            1
        );
    },
    10,
    4
);

For stores using a custom-built filter UI or one of the popular filter plugins (YITH WooCommerce Ajax Product Filter, WOOF), the filter hook name varies but the pattern is identical: intercept the output that renders filter <a> tags and add rel="nofollow". The plugin documentation calls this out, usually under a heading like "SEO settings".

A robots.txt directive complements this for the bots that read it:

# /robots.txt
User-agent: *
Disallow: /shop/?
Disallow: /product-category/*?
Disallow: /*?filter_
Disallow: /*?attribute_
Disallow: /*?orderby=

The path patterns above match every URL with a filter or attribute query string. Combined with the rel="nofollow" change, this removes filter URLs from the indexable surface for every crawler that obeys either signal (which is the long tail of crawlers).

Part 2: block the misbehaving bots at the edge

Some bots ignore rel="nofollow", ignore robots.txt, or hit hard enough to matter even when they obey the rules an hour later. For those, we block at the edge. There are two reasonable places to do this. We use both, in different scenarios.

The Apache-level rule is appropriate when the store does not sit behind a CDN. It goes in the site's .htaccess or, better, in a per-vhost include managed through WHM's "Apache Configuration → Include Editor":

# Rate-limit filter URLs by user agent.
# Place in the vhost include for the WooCommerce site.
<IfModule mod_security2.c>
    SecRule REQUEST_URI "@rx ^/shop/\?|^/product-category/.*\?" \
        "id:9000401,phase:1,chain,deny,status:429,\
         msg:'WooCommerce filter URL from suspect UA'"
    SecRule REQUEST_HEADERS:User-Agent \
        "@rx (MJ12bot|AhrefsBot|SemrushBot|DotBot|BLEXBot)" \
        "t:none"
</IfModule>

The rule reads as: when the request URI matches the shop or product-category endpoint with a query string, and the user agent matches one of these known aggressive crawlers, return 429. We return 429 rather than 403 so that well-behaved crawlers back off on the next request rather than retrying immediately.

We deliberately omit facebookexternalhit from this rule. Open Graph previews matter for the store's Facebook marketing, and the volume from a single shared URL is a burst we want to absorb at the cache layer rather than reject at the edge.

When the store sits behind Cloudflare, the same intent expresses as a single Page Rule or a Worker. The Cloudflare Bot Fight Mode feature catches the long tail of misbehaving crawlers without hand-rolled rules. For the specific case of MJ12bot and AhrefsBot, a Cloudflare Firewall Rule like the following is sufficient:

(http.request.uri.path matches "^/shop/" and
 http.request.uri.query contains "filter_" and
 cf.client.bot eq false and
 http.user_agent contains "MJ12bot") then block

Cloudflare's bot detection is better than anything we can build in .htaccess. We use it whenever the client is willing to put Cloudflare in front of their cPanel. We do not push Cloudflare on clients who have considered it and declined, usually because of DNS-management workflow, SSL workflow, or compliance considerations that are valid for their business.

Part 3: normalise URLs at the WooCommerce level

The third change is the durable one. It collapses the URL space that the previous two parts have hidden from crawlers, so that even if some crawler sneaks through both, every filter URL canonicalises back to the unfiltered category page:

<?php
// Force canonical to the unfiltered category URL on
// filter-applied archive pages.
add_action('wp_head', function () {
    if (! is_product_category() && ! is_shop()) {
        return;
    }
    // Build the canonical URL without query parameters.
    $canonical = strtok(
        home_url(add_query_arg(null, null)),
        '?'
    );
    printf(
        '<link rel="canonical" href="%s" />' . "\n",
        esc_url($canonical)
    );
    // Tell crawlers not to index filtered views explicitly.
    if (! empty($_GET)) {
        echo '<meta name="robots" content="noindex, follow" />' . "\n";
    }
}, 1);

The effect of this code is that /shop/?color=red&size=L&material=cotton emits a <link rel="canonical" href="/shop/" /> and a <meta name="robots" content="noindex" />. Crawlers that respect canonical signals (Google, Bing, and most reputable backlink crawlers) consolidate all of those filter URLs into a single indexable page. The crawler trap effectively disappears from their queue.

A more complete solution exists in the form of plugins like WooCommerce Permalinks Manager, which rewrite filter URLs into clean canonical paths. We have tested the plugin route and it works, but it pulls in a dependency that the store owner has to keep updated. The five-line wp_head filter above is usually enough and adds no plugin surface.

When a CDN is the right answer and when Apache rules are

We are asked some version of this question on every WooCommerce crawler-trap incident. The honest answer is that Cloudflare's bot detection is significantly better than anything you can build in .htaccess, and on stores that already use Cloudflare or are willing to we always prefer the Cloudflare route.

Apache-level rules are the right answer in three situations:

  • The store sits behind cPanel and the client is not willing to move DNS to Cloudflare. This is common with stores whose mail is on the same domain and whose email-deliverability story depends on DNS records they manage in cPanel.
  • The store is on a managed hosting product that proxies in front of Apache but does not offer bot management features. Some regional managed-hosting providers fall into this bucket.
  • The client has a compliance constraint that disallows non-EU proxying. Cloudflare's free tier does not let you pin to EU POPs; the Apache approach keeps traffic on the cPanel host.

For everyone else, Cloudflare's bot-fight mode plus the three-line firewall rule above will do more than the four hours of mod_security tuning that the Apache approach demands.

How ServerGuard handles this

ServerGuard's use case for the WooCommerce filter URL crawler trap covers detection and the in-flight remediation today, and the structural remediation upcoming. We are deliberate about the line between the two: modifying a WooCommerce theme on a client site is exactly the kind of change that demands a human approval, and pushing mod_security rules across an Apache vhost is the kind of change that demands a paired set of human eyes on the diff.

Today, ServerGuard's safe-action use case handles the in-flight incident:

  • Detect. When per-account PHP-FPM utilisation crosses the 90% threshold for a WooCommerce account on a cPanel host, SGuard correlates the spike with the rolling Apache access log for that account. If the unique-URL count in the last fifteen minutes exceeds the rolling baseline by a factor of twenty, and the top user-agent strings include a known crawler from the built-in list, the incident is classified as woocommerce_crawler_trap and the runbook activates.
  • Act, automatically. SGuard adds the offending source subnet to a soft rate-limit (mod_evasive-style, 10 req/sec per source IP) scoped to the WooCommerce vhost. This is a Safe action under the spec. Rate-limiting is reversible, scoped to a single vhost, and does not modify application code. The limit is logged to the audit trail with the source subnet, the matched user-agent pattern, and the inferred crawler family.
  • Diagnose. SGuard pulls the top fifty filter URLs from the Apache log, the per-attribute hit distribution, and the FPM slow log for the same window. Those land in the incident ticket so the on-call engineer has a single timeline and a single diff-ready set of facts.

The structural remediation (Part 1 and Part 3 of the fix above) sits behind a Moderate-tier approval gate and is upcoming roadmap. Editing a WooCommerce theme on a client site is the kind of change that needs a human signing off on the diff before it ships. When this ships, SGuard will present the proposed functions.php change as a unified diff, the corresponding robots.txt update, and the rollback plan as a single approval prompt in Telegram or the web dashboard.

Part 2 (the mod_security rules) is a Moderate action because pushing rules into an Apache vhost include can break sites in subtle ways and the rollback needs to be one click. We will not ship a Moderate-tier action until that rollback is tested.

We are honest about one thing the use case deliberately does not do: SGuard does not install WordPress plugins on client sites and does not modify WooCommerce templates without an approval. The boundary of automation is the server, not the application running on it. WooCommerce theme-level fixes are outside our intervention surface by design.

For the related "what happens when the bot traffic also stacks WP-Cron" failure mode, see the WP-Cron stacking on cPanel postmortem. The same crawler trap can trigger that cascade on a WooCommerce site that runs heavy scheduled tasks. And for the adjacent failure mode where the crawler trap shows up alongside a firewall flap, see CSF, lfd, and Imunify360: why your firewall is killing itself. When CSF deny lists fill up faster than LFD can rotate them, the two incidents collide in a way that is harder to diagnose than either on its own.

The ten-minute audit

Four commands to detect this on any cPanel and WooCommerce site, before the page arrives instead of after.

  1. Is the unique-URL rate per hour wildly out of proportion to the product count? Run the first awk command above. If the hourly unique-URL count is more than ten times the number of products in the store, you have a crawler trap brewing.
  2. Is the top user agent on the access log a crawler? Run the second awk command. The top user agent should be a browser string. If it is facebookexternalhit, MJ12bot, AhrefsBot, SemrushBot, DotBot, or BLEXBot, and that user agent accounts for more than 30% of requests, you are in the trap.
  3. Are filter URLs canonicalised to the unfiltered archive? Open /shop/?color=red in a browser, view source, find the <link rel="canonical"> tag. If it points back to /shop/?color=red instead of /shop/, the site is feeding every crawler the trap as an indexable page.
  4. Does the layered-nav filter widget emit rel="nofollow" on its <a> tags? View source on the shop archive, search for the filter widget output, check whether the filter links include rel="nofollow". If they do not, the trap is open to every crawler that follows links honestly.

The honest version of this audit is that most WooCommerce stores in the wild answer "no" to at least three of the four. The fix on each is cheap. The fix together prevents the entire incident class, including the variant where the crawler is one we have not named in this post yet.

If you operate a WooCommerce store on cPanel and any part of this post made you wince, join the ServerGuard waitlist. We are onboarding agencies in cohorts, and the WooCommerce crawler-trap use case is one of the runbooks we ship today.

Share this post

  • 6 min read

    When you have to suspend a WooCommerce client: anatomy

    Anatomy of a forced suspension on a shared cPanel server The decision to take a paying client offline to protect fourteen other paying clients is the worst part of running a small hosting agency. There is no scripted version of it that feel

  • 14 min read

    xmlrpc.php abuse and the 27-site one-shot fix on cPanel

    xmlrpc.php abuse and the 27-site one-shot fix on cPanel The first time floods one of your servers, you Google the symptom, find a guide called "how to disable xmlrpc.php in WordPress", install a plugin, click a checkbox, and move on. The se

  • 8 min read

    Hardening every WordPress site on cPanel in one loop

    Hardening every WordPress site on cPanel in one loop You manage twenty-seven WordPress sites on one cPanel server. A clean hardening pass on a single site (disable xmlrpc, lock down file editing, force SSL on the admin, security headers int