قراءة 13 دقيقة

AutoSSL fails on Microsoft 365 autodiscover subdomains: the fix

Why cPanel AutoSSL emails nightly that autodiscover.client failed HTTP-01 when the client uses Microsoft 365, and the two-step WHM exclusion fix.

AutoSSL fails on Microsoft 365 autodiscover subdomains: the fix

The email arrives every night at the same time. [cpanel-host] AutoSSL Failed for User 'bayareade'. The body lists three or four subdomains that failed Domain Control Validation, the same ones every night, week after week. Nothing on the server has changed. The cert on the primary domain renewed cleanly. The site loads. WHMCS is happy. And yet at 03:00 every morning AutoSSL writes a new line in /var/log/autossl.log saying it could not validate autodiscover.bayareadesignco.com, and the admin inbox gets another copy of the same failure email it got yesterday, and the day before, and the day before that.

This is the postmortem of a slow-burn AutoSSL failure that ran on one of our cPanel servers for six weeks before anyone correlated the email with its root cause. The pattern is unambiguous once you know what to look for: a client moves their mail to Microsoft 365, the M365 setup guide tells them to CNAME autodiscover and a couple of other names to Microsoft's endpoints, and from that moment cPanel's AutoSSL can no longer prove domain control over those subdomains because the HTTP-01 challenge file lives on our server while DNS now points the subdomain somewhere else. The fix is two clicks in WHM. The interesting part is how the failure hides in plain sight for so long.

This post covers the failure mode, the DNS pattern that produces it, the two-step fix, the audit command we now run across the fleet, and an honest description of how ServerGuard's use case handles the same class of incident.

The repeating failure email

The first signal is always the email, and the email always looks the same. Subject:

[cpanel-host] AutoSSL Failed for User 'bayareade'

Body (trimmed to the load-bearing lines):

The AutoSSL system failed to renew the SSL certificate for the
"bayareade" cPanel user account.

The following domains failed DCV (Domain Control Validation):

  - autodiscover.bayareadesignco.com
  - autodiscover.bayareadesignco.com (IPv6)
  - mail.bayareadesignco.com

Reason: "DNS DCV" failed: The DNS query for
"_acme-challenge.autodiscover.bayareadesignco.com" returned a result that
does not match the expected challenge response.

The AutoSSL system will continue to attempt renewal.

That last line is what makes this incident class so corrosive. AutoSSL will keep trying. It will keep emailing. The failures are partial. The primary domain bayareadesignco.com renewed fine, only the mail-adjacent subdomains failed, so nothing user-visible breaks. There is no broken site to investigate, no angry client ticket, just a nightly email that joins the queue of other AutoSSL notices, slowly trains the on-call engineer to mark anything starting with [cpanel-host] AutoSSL as background noise, and quietly establishes alert fatigue as the default response to certificate emails.

The cost is not the failing cert. The cost is the day a real renewal fails (a domain transfer that did not propagate, a misconfigured DNS record on a billing-critical site) and the alert for that failure arrives in the same shape, in the same inbox, and gets handled with the same shrug as the six weeks of M365 autodiscover noise that preceded it.

Why it fails

cPanel's AutoSSL validates domain control by serving a challenge file over HTTP. For each domain on a cPanel account, AutoSSL writes a randomised token to /home/<user>/public_html/.well-known/acme-challenge/<token>, asks Sectigo or Let's Encrypt to fetch http://<domain>/.well-known/acme-challenge/<token>, and proves control by matching the response. This is the HTTP-01 challenge in the ACME spec, and on a normal cPanel-hosted domain it works because the A record for <domain> points at the cPanel server's IP, the request hits Apache, Apache serves the file from the user's docroot, the CA verifies the token, and the cert issues.

The flow breaks the moment a subdomain points anywhere other than the cPanel server. For Microsoft 365, the standard tenant setup instructions tell the customer to add CNAME records for at least three names:

NameCNAME targetPurpose
autodiscover.bayareadesignco.comautodiscover.outlook.comOutlook client autodiscovery
lyncdiscover.bayareadesignco.comwebdir.online.lync.comTeams/Skype federation
sip.bayareadesignco.comsipdir.online.lync.comSIP signalling for Teams

Once those CNAMEs are in place, a request for http://autodiscover.bayareadesignco.com/.well-known/acme-challenge/<token> no longer reaches our server. It reaches autodiscover.outlook.com, which does not know what the AutoSSL challenge is, has never heard of the token, and responds with a 404 (or worse, an HTTP redirect to an HTTPS endpoint with an unrelated cert, which AutoSSL also treats as failure). cPanel's AutoSSL reads the failure, logs it, emails the admin, and queues another attempt for tomorrow night.

The same thing happens for mail.bayareadesignco.com if the client has CNAMEd that name to Microsoft's mail endpoint, which several M365 deployment guides recommend even though Microsoft's own documentation suggests using outlook.office365.com only via Outlook's autoconfig. We have seen the mail.* variant in roughly half of M365 migrations on our fleet.

The full chain looks like this in the AutoSSL log:

[2026-04-02 03:14:08 +0000] info [autossl] Domain "autodiscover.bayareadesignco.com": HTTP DCV check started
[2026-04-02 03:14:08 +0000] info [autossl] Domain "autodiscover.bayareadesignco.com": Fetching "http://autodiscover.bayareadesignco.com/.well-known/acme-challenge/q7nM..."
[2026-04-02 03:14:09 +0000] warn [autossl] Domain "autodiscover.bayareadesignco.com": HTTP DCV failed: 404 Not Found
[2026-04-02 03:14:09 +0000] info [autossl] Domain "autodiscover.bayareadesignco.com": Trying DNS DCV fallback
[2026-04-02 03:14:10 +0000] warn [autossl] Domain "autodiscover.bayareadesignco.com": DNS DCV failed: no _acme-challenge TXT record found
[2026-04-02 03:14:10 +0000] error [autossl] Domain "autodiscover.bayareadesignco.com": all DCV methods exhausted, skipping for this issuance

The fallback to DNS DCV is interesting and worth understanding. cPanel will try a DNS-01 challenge if the HTTP-01 fails, but only if there is a _acme-challenge.<domain> TXT record on the authoritative DNS server that AutoSSL can write to. For a domain whose DNS is managed externally (Cloudflare, Route 53, the registrar's own panel), AutoSSL cannot place that TXT record, and the DNS DCV fallback fails immediately. So the two-stage failure is the normal case: HTTP-01 fails because the CNAME points elsewhere, DNS-01 fails because AutoSSL cannot write to external DNS.

Why this only happens with M365 (and similar)

This failure mode is specific to mail-as-a-service providers that expect customers to point named subdomains at provider endpoints via CNAME. The three common cases on our fleet:

Microsoft 365. The case in this post. Requires CNAMEs for autodiscover, lyncdiscover, sip, and frequently enterpriseregistration and enterpriseenrollment for Intune deployments. All five fail AutoSSL the same way.

Google Workspace. Does not produce this failure. Google Workspace uses MX records to route mail and the autoconfig flow is keyed off MX plus a single _dmarc TXT record. There are no provider-pointing CNAMEs on the mail subdomain, so AutoSSL still sees the cPanel server's A record for mail.clientdomain.com and the HTTP-01 challenge succeeds naturally.

Zoho Mail and ProtonMail business. Both can produce this failure in some configurations. Zoho recommends a CNAME for mail to business.zoho.com for webmail access. That one breaks AutoSSL the same way. ProtonMail business with custom domain plus their bridge feature can produce the same pattern on mail.* if the customer follows the optional setup steps.

On-server mail (Dovecot/Exim on cPanel itself). No failure. The mail subdomains resolve to the cPanel server's IP because the mail.clientdomain.com A record is auto-managed by cPanel and points to the same server. HTTP-01 succeeds for the same reason it succeeds for the primary domain.

The pattern that matters for diagnosis: if dig <subdomain> CNAME returns a target outside the cPanel server (especially anything ending in .outlook.com, .lync.com, .protection.outlook.com, or .onmicrosoft.com), this incident class is the explanation.

The DNS verification flow

Before excluding anything, confirm the diagnosis. From the cPanel server (or anywhere with dig):

$ dig +short autodiscover.bayareadesignco.com CNAME
autodiscover.outlook.com.
 
$ dig +short autodiscover.outlook.com
autodiscover-emeawest.office.com.
52.97.146.226
52.97.147.32
# ... rotating Microsoft endpoint IPs ...

The first line is the smoking gun. Any CNAME target outside our cPanel server's domain confirms the diagnosis. If the CNAME target is autodiscover.outlook.com specifically, the client is on M365. If the target is autodiscover.{tenant}.onmicrosoft.com, the client uses a direct-to-tenant CNAME (less common but valid). For the mail.* variant:

$ dig +short mail.bayareadesignco.com CNAME
bayareadesignco-com.mail.protection.outlook.com.

mail.protection.outlook.com is the M365 Exchange Online Protection endpoint. Same explanation, same fix.

A quick sanity check on what the cPanel server itself thinks the A record should be:

$ /usr/local/cpanel/scripts/whmapi1 --output=jsonpretty parse_dns_zone domain=bayareadesignco.com | jq '.data.dnszone[] | select(.dname=="autodiscover")'
{
  "dname": "autodiscover",
  "type": "CNAME",
  "ttl": 14400,
  "record": "autodiscover.outlook.com."
}

The local DNS zone agrees with what dig returned externally. This is the case in every M365 migration we have seen, because the client edits the zone in WHM's DNS Zone Editor before pointing Outlook at the new endpoint. The CNAME is intentional and correct. Do not change the CNAME. The client needs it for Outlook autoconfig to work. The fix is on the AutoSSL side, not the DNS side.

The fix in two steps

Step 1: exclude the mail subdomains from AutoSSL

cPanel has a per-user "Excluded Domains" list that AutoSSL skips entirely. The list is editable via WHM (Home > SSL/TLS > Manage AutoSSL

Manage Users > Excluded Domains) and via the WHM API. For a single client:

$ whmapi1 --output=jsonpretty set_autossl_user_excluded_domains \
    user=bayareade \
    excluded_domains='autodiscover.bayareadesignco.com,lyncdiscover.bayareadesignco.com,sip.bayareadesignco.com,mail.bayareadesignco.com'
{
   "metadata" : {
      "command" : "set_autossl_user_excluded_domains",
      "version" : 1,
      "reason" : "OK",
      "result" : 1
   },
   "data" : {
      "excluded_domains" : [
         "autodiscover.bayareadesignco.com",
         "lyncdiscover.bayareadesignco.com",
         "sip.bayareadesignco.com",
         "mail.bayareadesignco.com"
      ]
   }
}

The first three are always safe to exclude in an M365 deployment. They point at Microsoft endpoints and cPanel will never be able to serve a cert for them. The fourth (mail.*) is the case-by-case one: exclude it only if the client has CNAMEd it to Microsoft, leave it included if the client uses on-server webmail.

To verify the exclusion took effect, trigger a manual AutoSSL run for that user and watch the log:

$ /usr/local/cpanel/bin/autossl_check --user=bayareade
$ tail -f /var/log/autossl.log
[2026-04-15 14:22:01 +0000] info [autossl] Starting check for user "bayareade"
[2026-04-15 14:22:01 +0000] info [autossl] Skipping "autodiscover.bayareadesignco.com": user-excluded
[2026-04-15 14:22:01 +0000] info [autossl] Skipping "lyncdiscover.bayareadesignco.com": user-excluded
[2026-04-15 14:22:01 +0000] info [autossl] Skipping "sip.bayareadesignco.com": user-excluded
[2026-04-15 14:22:01 +0000] info [autossl] Skipping "mail.bayareadesignco.com": user-excluded
[2026-04-15 14:22:02 +0000] info [autossl] Issued certificate for "bayareadesignco.com, www.bayareadesignco.com"
[2026-04-15 14:22:02 +0000] info [autossl] Check complete for user "bayareade"

The nightly emails stop the following day. The cert on the primary domain renews on its normal cycle. The mail subdomains continue to function as M365 endpoints because they never needed a cert from our server in the first place. The certs they need are Microsoft's, and Microsoft serves those from autodiscover.outlook.com directly to the Outlook client.

A note on wildcards: cPanel does not currently support an "exclude autodiscover.* across all users" pattern at the WHM level. The per-user exclusion is the only mechanism. We have asked cPanel for a global pattern-based exclusion list and were told it is on the roadmap with no committed date. Until then, every M365 migration on every cPanel account requires its own exclusion entry.

Step 2: document it for next time

This is the step that prevents the slow-burn recurrence. After excluding the subdomains, write an internal note that links the cPanel user to the M365 deployment. We use a single Markdown file in our agency's runbook repo with one line per client:

goldenvi       Google Workspace          (no AutoSSL exclusions)
northwood       on-server Dovecot         (no AutoSSL exclusions)
bayareade       Microsoft 365             autodiscover, lyncdiscover, sip, mail
tallpine       Microsoft 365             autodiscover, lyncdiscover, sip
aspenroo       Zoho Mail                 mail

The reason this note matters is the failure mode's other recurrence trigger: domain transfers. When a new client transfers a domain to our server and we run cPanel's transfer tool, the DNS zone comes with it, including the M365 CNAMEs. AutoSSL on the new server starts trying to validate those subdomains the same night and the failure email cycle begins again from zero. Having the per-client M365 list as a checklist item in the transfer-in runbook collapses a six-week recurrence into a three-minute fix at transfer time.

Edge case: when the client wants the cert anyway

Some compliance regimes (PCI-DSS scoped environments, certain healthcare contexts) require every subdomain under a regulated domain to have a valid TLS cert, even when the subdomain is delegated to a third-party endpoint. This is rare in practice (most auditors accept "the subdomain delegates to Microsoft and Microsoft serves a valid cert" as compliance) but when it does come up the fix is to issue the cert via DNS-01 challenge out-of-band.

The short version: install acme.sh on the cPanel server, configure it with API credentials for the client's external DNS provider (Cloudflare's CF_Token, Route 53's IAM keys, etc.), issue the cert with the DNS-01 challenge against the third-party DNS, and install it manually into the cPanel user's SSL store via whmapi1 installssl. The cert is then renewed by a cron job that runs acme.sh --renew rather than by AutoSSL.

cPanel's AutoSSL does not natively support DNS-01 against external DNS providers, which is why this has to live as an out-of-band cron. We treat this path as the exception, not the default: it adds a renewal mechanism the cPanel admin has to remember exists, and the first time it silently fails (Cloudflare API token expires, IAM key rotates) the failure mode is invisible. No email, no log line in /var/log/autossl.log, just a cert that quietly expires. If you walk this path, monitor the cert directly via openssl s_client rather than trusting the renewal script.

A 5-minute audit

This is the script we now run quarterly across our cPanel fleet to catch the M365 pattern on servers we have not touched in a while. From a workstation with SSH access to each server in ~/.ssh/config:

# Step 1: list every domain on every cPanel server, with its CNAME if any.
for srv in $(awk '/^Host server[0-9]/ {print $2}' ~/.ssh/config); do
  ssh "$srv" '
    /usr/local/cpanel/bin/whmapi1 --output=jsonpretty list_users \
      | jq -r ".data.users[]" \
      | while read u; do
          /usr/local/cpanel/scripts/whmapi1 --output=jsonpretty \
            domainuserdata user="$u" \
            | jq -r ".data.userdata.main_domain, (.data.userdata.sub_domains // [])[]"
        done
  ' | while read d; do
    cname=$(dig +short "$d" CNAME)
    case "$cname" in
      *.outlook.com.|*.lync.com.|*.protection.outlook.com.|*.onmicrosoft.com.)
        echo "M365_PATTERN $srv $d -> $cname"
        ;;
    esac
  done
done > /tmp/m365-autossl-audit.txt
 
# Step 2: for each match, check whether it is already excluded.
while read line; do
  set -- $line
  srv=$2; domain=$3
  user_domain=$(echo "$domain" | sed 's/^[^.]*\.//')
  ssh "$srv" "whmapi1 --output=jsonpretty get_autossl_user_excluded_domains user=\$(/usr/local/cpanel/bin/whmapi1 --output=jsonpretty domainuserdata domain=$user_domain | jq -r '.data.userdata.user')" \
    | jq -r --arg d "$domain" '.data.excluded_domains // [] | if index($d) then "OK_EXCLUDED \($d)" else "NEEDS_EXCLUSION \($d)" end'
done < /tmp/m365-autossl-audit.txt

The output is two lists: domains matching the M365 CNAME pattern, and which of those are already excluded vs which are still in the failure loop. The first time we ran this across our fleet we found seventeen domains across eleven cPanel users that had been emailing failure notices nightly for periods ranging from three weeks to four months. None of them had ever been reported by the client because none of them caused a user-visible problem. They just chewed through admin attention every morning.

Two cPanel administration patterns deserve their own posts and are covered separately: the layered SSH brute-force defence on cPanel servers in SSH brute force on cPanel: the 8,127-attempt night and the fix, and the AutoSSL-adjacent issue of certificate expiry hiding behind a CSF, lfd, and Imunify360 conflict that masks renewal alerts behind firewall noise. Both share the same core failure shape as this incident: a quiet background subsystem producing alerts that look identical day after day, until they stop being read.

How ServerGuard handles this

The AutoSSL M365 pattern is one of the SSL-renewal use cases we cover. It is one of the highest-frequency and lowest-severity incidents in our fleet, which makes it a clean fit for ServerGuard's Safe-action tier. The failure is unambiguous when the CNAME points at a Microsoft endpoint, the fix is reversible (an exclusion can be removed in a single API call), and the cost of a wrong auto-exclusion is small (one renewal cycle of a cert we did not need anyway).

Detection. ServerGuard ingests AutoSSL signals from two sources in parallel. The first is the same nightly email the admin gets, parsed via the IMAP integration on the agency's central admin mailbox, which is enough to catch the failure within hours of the first occurrence. The second is the AutoSSL log itself, scraped via the same SSH executor that handles every other diagnostic, looking for the error [autossl] ... all DCV methods exhausted line and correlating repeats across nights. Both sources land in the same incident record so a single failure that arrived only via log (the admin's email filter sent the AutoSSL notice to a folder) does not fall through the cracks. The detection layer is implemented today.

Diagnosis. When the incident record fires, ServerGuard runs dig <failing-subdomain> CNAME on the failing subdomain plus the sibling subdomains that share its mail pattern (autodiscover, lyncdiscover, sip, mail, enterpriseregistration, enterpriseenrollment). It classifies the CNAME target against a fixed pattern list: anything ending in .outlook.com, .lync.com, .protection.outlook.com, or .onmicrosoft.com classifies as unambiguous M365. Anything ending in .zoho.com classifies as Zoho Mail. Anything ending in .protonmail.ch classifies as ProtonMail. Anything else (a CNAME pointing at a non-mail third-party, or no CNAME at all) classifies as ambiguous and triggers the manual escalation path instead. The classifier is implemented today for the M365 case (the highest-frequency one); the Zoho and ProtonMail classifiers are upcoming and will follow the same pattern.

Action (Safe, auto). When the classifier returns unambiguous M365 AND the failure has repeated for at least three consecutive AutoSSL runs, ServerGuard calls whmapi1 set_autossl_user_excluded_domains to add the failing subdomain to the per-user exclusion list. The action is logged into the audit log with the dig output and the AutoSSL log lines that triggered it, and the next AutoSSL run is monitored to confirm the exclusion took effect. The Safe-action auto-exclude is implemented today with the unambiguous-pattern + 3-failure guardrail.

Action (manual escalation). For ambiguous cases (a CNAME we have not seen before, a mixed pattern where autodiscover points at Microsoft but mail points at the cPanel server, or a failure on a subdomain without any CNAME) ServerGuard does not auto-exclude. It alerts the on-call engineer with the full diagnosis (dig output, AutoSSL log excerpt, the proposed exclusion list, and the reason the case did not clear the unambiguous bar), waits for a human to approve or reject, and only then applies the exclusion. This is the standard human-in-the-loop pattern and applies to every action ServerGuard is not confident enough to take alone. The manual escalation path is implemented today.

What ServerGuard does not do. It does not modify DNS records, ever. Even when the root cause of an AutoSSL failure is a misconfigured CNAME (and we have seen this exact pattern, where a client adds a CNAME to a Microsoft endpoint they should have pointed to a different tenant) fixing it requires coordinating with the client and the change has to happen in the client's DNS provider, not on our server. ServerGuard's use case for those cases ends at the alert: it identifies the misconfiguration, includes the suggested correction in the alert body, and waits for a human to talk to the client. Modifying DNS on the customer's behalf is not coming.

The honest version of the SGuard story on this incident class: we collapse a six-week tail-end-of-the-inbox problem into an action that fires within the same nightly cycle the failure occurs in, with a guardrail that prevents auto-excluding a cert the customer actually needs. The action is small. The compounding value is in how many of these small actions happen automatically across the fleet every night without an engineer reading another lookalike email.

شارك هذه المقالة

XLinkedInEmail
  • قراءة 6 دقيقة

    When the client changes DNS without telling you first

    When the client changes DNS without telling you first The ticket arrives on a Friday afternoon. Three words: "website is down". No screenshot, no error, no context. You load the site in a private window. It loads fine. You ask a colleague o

  • قراءة 15 دقيقة

    86 CPU spikes in 24 hours: a multi-cause cascade postmortem

    86 CPU spikes in 24 hours: a multi-cause cascade postmortem The mailbox at 08:00 had 86 ChkServd CPU alerts from , all from the previous 24 hours. Not a single tidy outage with a single cause. A steady drip of "CPU at 95% for the last minut

  • قراءة 6 دقيقة

    When you have to suspend a WooCommerce client: anatomy

    Anatomy of a forced suspension on a shared cPanel server The decision to take a paying client offline to protect fourteen other paying clients is the worst part of running a small hosting agency. There is no scripted version of it that feel