cPanel disk full at 96 percent: the backup retention trap
A postmortem on a cPanel /home at 96 percent, why 7 backup sets ate 331GB, the safe-deletion order, and the WHM retention that stops the cascade.
cPanel disk full at 96 percent: the backup retention trap
/home was at 96 percent. The exact numbers were 931GB used out of
970GB, which left 39GB of headroom on a server that wrote roughly
2GB an hour into mail spools and InnoDB tablespaces alone. ChkServd
had started its disk warnings about an hour earlier and was now
firing every five minutes. MySQL had not crashed yet, but it would
within the next few hours. Exim had not stopped flushing yet, but
the spool was already 4GB and growing. The next scheduled cPanel
backup run was due at 02:00 and would fail halfway through, which
would leave the previous incomplete backup set on disk and consume
another 30 to 40GB before it errored out.
This is the postmortem of how we got /home from 96 percent back
to 58 percent on a production cPanel box without losing a single
file a client cared about, and the change to the WHM backup
configuration that prevented the same cascade from happening again
the following month. The root cause was not what we expected.
The cascade you are trying to avoid
A full /home partition does not fail one thing at a time. The
failures stack, and they stack fast, because most of the things
you rely on a cPanel server doing also need to write to disk.
- At 90 percent, MySQL starts to throw
Can't create/write to fileerrors on temp tables. Reads continue to work, writes do not. Anything that runs a sort that spills to disk fails. WordPress sites still load until someone tries to save a post. - At 95 percent, Exim's mail spool stops flushing cleanly. The
spool keeps accepting new mail but cannot fully write delivery
records. Mail backs up under
/var/spool/exim/inputand queue size goes vertical on the next cPanel queue check. - At 96 percent, the next cPanel backup attempt fails halfway. The incomplete backup is not always cleaned up. This is the trap that often makes a disk-full incident worse before anyone intervenes: you discover the alert because backups failed, investigate, and find that the failed backup itself consumed another 30 to 40GB on the way down.
- At 98 percent, anything that uses
/tmpbreaks. cPanel session writes fail, file uploads fail, ImageMagick conversions fail, PHP opcache writes fail, and Composer oryumoperations all start to error.
The order matters because if you are reading the alert at 96 and the cascade reaches 98 before you intervene, you have introduced new failure modes that obscure the original problem. Half of an incident response is keeping the symptom set stable so you can reason about it.
The wrong instinct and the right one
The wrong instinct on a disk-full cPanel server is to delete log
files. Logs are visible, you know what they are, and rm feels
productive. In practice, logs are almost never the dominant
consumer of /home on a cPanel box because cPanel rotates them by
default. The big consumers are almost always one of six things,
and the relative ordering is consistent across our incidents:
- cPanel backup sets under
/backup(when local backups are on) - Mail spools (
cur,new, andtmpunder each Maildir) - WordPress backup plugin output, especially UpdraftPlus and
BackWPup writing to
wp-content/uploads - MySQL binary logs when
binlog_expire_logs_secondsis too high or unset - Apache log archives when log rotation is broken or disabled
- Trash and Spam folders that nobody empties
The right instinct is to find the largest directories first and
work from the top of the list down. du is your only friend here:
# top-level breakdown of /home: slow but precise
du --max-depth=1 -h /home 2>/dev/null | sort -h | tail -20
# same for /backup and /var
du --max-depth=1 -h /backup 2>/dev/null | sort -h | tail -10
du --max-depth=1 -h /var 2>/dev/null | sort -h | tail -10
# fast version that excludes mounts and read-only paths
df -h
df -i # don't forget inode exhaustion masquerades as disk fullRun df -i even when you are sure the problem is bytes, not
inodes. An inode-exhausted disk reports as full to most cPanel
tooling but does not show up in du. The fix is different and
involves finding directories with millions of tiny files, usually
Drupal sessions or PHP-FPM opcache fragments.
The cPanel backup retention trap
In our 96 percent incident, du --max-depth=1 -h /backup came back
with this:
47G /backup/2026-04-19
47G /backup/2026-04-26
47G /backup/2026-05-03
47G /backup/2026-05-10
47G /backup/2026-05-17
47G /backup/2026-05-24
47G /backup/2026-05-31
331G /backupSeven cPanel backup sets, 47GB each, 331GB total. The retention had been set to 7 during the original server setup, at a time when the server held about 30GB of client data. By the time of the incident, the server held about 100GB of live data, so each weekly backup set was now 47GB, and 7 of those eats 331GB of a 970GB disk before anyone notices.
The trap is that nothing in the cPanel UI re-evaluates whether the retention number you picked at install time still makes sense as the server grows. WHM happily kept producing weekly snapshots forever. The disk grew with client data, the backups grew with the disk, and the retention multiplier was never revisited. The default in modern WHM is more conservative now, but on any server that was originally provisioned a few years ago, the old setting is still in force.
For a rough rule, multiply your live /home size by your
retention count. If the result is more than 30 percent of your
total disk, you are running a retention setting that will bite
you. Most agencies need 2 or 3 sets, not 7.
The safe-deletion order
Once du has told you where the bytes are, the work is to
reclaim them in an order that is safe to abort partway through.
The cardinal rule on a cPanel server is to use cPanel's own
deletion paths wherever they exist, because cPanel keeps internal
state (quotas, backup metadata, account indexes) that out-of-band
rm commands will not update.
Step 1: snapshot before you touch anything
Before any deletion, write down the state. If something goes sideways during the cleanup, you want a precise before picture to compare against.
# capture the before state to a file you'll keep
{
date -u +"%Y-%m-%dT%H:%M:%SZ"
echo "--- df -h ---"; df -h
echo "--- df -i ---"; df -i
echo "--- du /home ---"; du --max-depth=1 -h /home 2>/dev/null | sort -h
echo "--- du /backup ---"; du --max-depth=1 -h /backup 2>/dev/null | sort -h
echo "--- /backup listing ---"; ls -lah /backup
} > /root/disk-incident-before.txtThe file lives outside /home and /backup. If the disk fills
further during cleanup, you still have the snapshot.
Step 2: reduce backup retention through WHM
This is the biggest single win in almost every disk-full cPanel
incident and the one place where the temptation to rm -rf is
strongest. Resist it. Manually deleting files in /backup leaves
cPanel's backup manifest pointing at files that no longer exist,
which will cause the next backup run to log warnings and, in some
WHM versions, to abort early.
The blessed path:
- WHM → Backup → Backup Configuration → Retention
- Lower the number of retained backups (we use 3 for most boxes)
- Save the configuration
- Trigger the cleanup explicitly, do not wait for the next run
# cPanel's own cleaner: respects the new retention setting and
# updates backup manifest state correctly
/usr/local/cpanel/bin/backup_cleanerOn the 96 percent incident, the retention change from 7 to 3 sets
plus a backup_cleaner run reclaimed 188GB. That alone moved the
server from 96 percent to 77 percent and stopped the cascade. The
remaining cleanup was no longer time-pressured.
If you want to do this through the API rather than the GUI, the WHM API equivalent is:
# read current backup config (writes to stdout, formatted JSON)
whmapi1 backup_config_get --output=jsonpretty
# update retention to 3
whmapi1 backup_config_set BACKUP_DAILY_RETENTION=3 \
BACKUP_WEEKLY_RETENTION=3 BACKUP_MONTHLY_RETENTION=3Whether you change daily, weekly, or monthly retention depends on which schedules are enabled in your environment. Read the existing config first; do not assume.
Step 3: plugin backups under wp-content
WordPress backup plugins are the second-largest hidden consumer.
UpdraftPlus writes to wp-content/updraft, BackWPup writes to
wp-content/uploads/backwpup-*-backups, and there are a dozen
similar variants. Many sites run a plugin backup on top of the
cPanel backup, doubling the storage cost.
Find them first, do not delete on sight:
# any file inside wp-content/uploads that looks like a backup
# and is bigger than 100MB
find /home/*/public_html -path '*wp-content/uploads/*backup*' \
-size +100M -type f 2>/dev/null
# UpdraftPlus stores under wp-content/updraft, not uploads
find /home/*/public_html -path '*wp-content/updraft/*' \
-type f -size +50M 2>/dev/null
# BackWPup uses a hashed directory name
find /home/*/public_html -path '*wp-content/uploads/backwpup-*' \
-type f 2>/dev/nullThe reason not to delete these immediately is that some clients do rely on plugin backups for their own disaster recovery, even when you are also taking cPanel-level backups for them. The agency policy we settled on is rename rather than remove, then notify the client:
# rename, don't delete. Gives the client 30 days to object
find /home/<user>/public_html -path '*wp-content/updraft/*' \
-type f -name '*.gz' -exec mv {} {}.pending \;The .pending suffix is searchable. A monthly cron deletes
anything older than 30 days that still has the suffix. The client
gets one notification at the time of rename and one final notice
before the actual deletion.
Step 4: mail spool cleanup
Mail spools accumulate quietly under each Maildir. The big
offenders are Junk and Trash folders that no one empties, and
old cur folders for forwarders that point to nothing.
# size by mail account on a single cPanel user
du --max-depth=2 -h /home/<user>/mail 2>/dev/null | sort -h | tail -20
# all Junk folders older than 90 days, across all users
find /home/*/mail/*/.Junk/cur -type f -mtime +90 2>/dev/null | wc -lcPanel ships an Email Disk Usage tool per account that handles this through the GUI, but for bulk cleanup across dozens of accounts on a single server, a script that iterates accounts and prompts for confirmation per account is more practical. The critical rule is that mail deletion is per-account, with explicit consent. There is no agency that does not eventually have a client who deliberately kept seven-year-old emails in Trash.
Step 5: MySQL binary logs
InnoDB binary logs accumulate when binlog_expire_logs_seconds is
set high or left at the engine default. On a busy MariaDB 10.6 or
10.11 instance, binlogs can consume 20 to 40GB of /var/lib/mysql
before anyone notices.
-- enumerate
SHOW BINARY LOGS;
-- purge anything older than 7 days
PURGE BINARY LOGS BEFORE NOW() - INTERVAL 7 DAY;For ongoing retention, set the value in /etc/my.cnf rather than
purging by hand. Seven days is a reasonable default for most
servers that are not running replication; if you are running a
replica, the value needs to be at least as long as the replica's
worst-case lag.
# /etc/my.cnf, under [mysqld]
binlog_expire_logs_seconds = 604800 # 7 daysRestart MariaDB through the cPanel-aware path so cPanel's service monitor does not flag the restart as a failure:
/scripts/restartsrv_mysqlStep 6: Apache log rotation
Apache logs only become a disk-full contributor when log rotation
is broken. cPanel rotates by default through /usr/local/cpanel/etc/logrotate.d/
and through cpanellogd. If logs are accumulating, the problem is
usually that a recent configuration change disabled rotation, or
that a custom log target was added without rotation rules.
# check that the rotation service is alive
systemctl status cpanellogd
# look for accumulated archives
ls -lah /usr/local/apache/domlogs/*/*.gz 2>/dev/null | tail -20
ls -lah /etc/apache2/logs/ 2>/dev/nullResist the temptation to rm the logs without fixing rotation.
They will simply reappear in the next billing cycle, and you will
be back at 96 percent on the same date next month.
Things to never delete on a cPanel server
There are directories on a cPanel server that look like cleanup candidates but are not. Touching them produces failures that are much worse than the original disk-full event, because they break cPanel's internal state machine and the recovery paths are not documented anywhere outside cPanel support.
/usr/local/cpanel/*: anything under here. This is cPanel itself. It is small (single-digit GB) and not a real reclaim target./var/cpanel/*: account configuration, quota state, package definitions, license state. Deleting anything here can detach accounts from their settings or invalidate the cPanel license.- Any directory you do not recognize inside
/home/<user>. It might be a client-managed Git checkout, a Composer cache, a Node.nextbuild directory, or a deliberate file dump the client uses for their own work. Ask before you remove. /tmpduring business hours. PHP sessions live here, in-flight uploads live here, and PHP-FPM opcache files live here on some configurations. Clearing/tmpmid-day will log everyone out of every WordPress admin on the box at the same time./root/*. The previous on-call engineer might be keeping their one-off diagnostic outputs here, and you do not want to nuke another engineer's working notes from a different incident.
Setting up sensible retention from now on
The reason this incident recurs across cPanel servers is that the initial-setup retention values almost never get revisited. The fix is to set defaults at provision time and audit them at every disk-grow event.
- Backup retention: 3 sets. This is enough for the standard agency recovery cases (last week's data, two weeks ago, three weeks ago) and avoids the 7-set trap.
- Mail spool quotas: enforced per account. cPanel supports per-account mail quotas; set them. The most generous default we use is 5GB, and the most common is 2GB.
- MySQL binlog retention: 7 days for non-replicated, replica
lag plus a safety margin for replicated. Configured in
/etc/my.cnf, not by hand. - Plugin backup pruning: monthly cron that finds plugin
backups older than 60 days and renames them to
.pending, then another cron that deletes.pendingfiles older than 30 days. - Disk audit cron: a weekly script that emails the team if any partition is above 80 percent. We use 80 as the early warning and 90 as the page-the-on-call threshold.
A related read on what happens when /home fills while MySQL is
also under memory pressure is in
MySQL OOM on cPanel: diagnosing innodb_buffer_pool_size.
The two failure modes interact: a disk-full event can present as
a MySQL crash because MySQL is the first service to fail loudly,
and a MySQL crash on a near-full disk can prevent the recovery
writes that would otherwise log the real cause.
If your disk-full incident traces back to a single WordPress site
hammering itself, the wp-cron overlap pattern in
WordPress wp-cron stacking on cPanel
is worth checking. Stacked cron runs can generate hundreds of
megabytes of error logs per day on a misconfigured site, which
contributes to category 5 above.
The 60-second disk audit
When you suspect a disk problem and need to know in under a minute whether you have a backup, mail, plugin, binlog, or log problem, this is the audit:
# 1. is it bytes or inodes?
df -h && df -i
# 2. top-level reclaim candidates, sorted
du --max-depth=1 -h /home /backup /var/lib/mysql /var/log 2>/dev/null \
| sort -h | tail -20
# 3. backup sets count and total size
ls -lah /backup 2>/dev/null
du -sh /backup 2>/dev/nullThree commands, one answer. If /backup is more than 20 percent
of total disk, you are in the retention trap and Step 2 above is
the priority. If mail is the biggest item, you are in spool
cleanup. If /var/lib/mysql dominates, you have a binlog problem.
How ServerGuard handles this
ServerGuard's disk-full use case maps to the same six-step order, with one important rule: SGuard will never auto-delete files outside a small set of whitelisted categories. The "AI sees free space, AI deletes" pattern is a non-starter for our risk model. Every dangerous action runs through Telegram approval.
Detection. SGuard subscribes to ChkServd disk
alerts and to its own df-based disk poller. The poller fires at
80 percent (early warning, no action) and 90 percent (diagnostic
flow runs within 60 seconds). The thresholds are configurable per
server.
Diagnosis. When the diagnostic flow runs, SGuard
collects df -h, df -i, du --max-depth=1 -h against /home,
/backup, /var/lib/mysql, and /var/log, plus ls -lah /backup.
The output is summarised into the six reclaim categories above and
posted to the team's Telegram channel as a single message with
per-category byte counts.
Action 1, Safe Apache log cleanup. When the
diagnostic flow finds Apache log archives older than 14 days,
SGuard removes them automatically. This is the only auto-delete
action and it is scoped to a hardcoded path glob:
/usr/local/apache/domlogs/**/*.gz and /etc/apache2/logs/*.gz
with an mtime filter. The path glob lives in code, not config,
so it cannot be widened by a misconfigured environment.
Action 2, Moderate backup retention reduction.
When /backup is more than 25 percent of total disk and weekly
retention is more than 3, SGuard proposes a retention reduction
plus a backup_cleaner run. The proposal goes to Telegram with
the before-state byte counts, the expected reclaim, and a single
approve/reject pair of buttons. On approval, SGuard runs the
WHM API call to update retention and then invokes
/usr/local/cpanel/bin/backup_cleaner. This action is dangerous
even through the cPanel-blessed path, so approval is required
every time. There is no auto-mode for it.
Action 3, Dangerous plugin backup mass-rename.
When the diagnostic flow finds plugin backups over 100MB across
multiple sites, SGuard proposes a .pending-suffix rename per
backup class. The approval message lists the affected paths and
total reclaimable bytes. Approval is per-class, not per-file: the
team approves "rename all UpdraftPlus archives older than 60 days"
as one decision, not 200 separate decisions. Auto-deletion of
renamed files after 30 days is also approval-gated on first run
and remembered per server thereafter.
Honest limits. SGuard does not auto-delete mail spool contents,
does not auto-purge MySQL binary logs, and does not auto-touch any
file under a user's /home/<user>/ tree that is not a recognised
backup pattern. Mail and binlog cleanup are diagnostic-only today.
SGuard tells you they are the bottleneck and links to the
use case, but the deletion is yours to run. The reason is the
same as the rest of this post: the recovery cost of an over-eager
delete on a cPanel server is much higher than the cost of waiting
for a human to approve.
The 96 percent incident that opened this post would have gone like this with SGuard installed: an 80-percent early warning would have landed in Telegram about a day earlier; at 90 percent, the diagnostic flow would have posted "backup sets are 34 percent of total disk, retention is 7, propose reducing to 3 and running backup_cleaner"; one tap from the on-call engineer would have reclaimed 188GB before ChkServd ever started its five-minute alert cycle. The cascade would not have started, MySQL would not have been near a write failure, and the next scheduled backup would have succeeded instead of failing halfway.
That is the shape of every SGuard use case. The dangerous work is still a human's call. The diagnostic work, the proposal, and the path to the approval button is the part we automate.
مقالات ذات صلة
- قراءة 12 دقيقة
MySQL OOM on cPanel: diagnosing innodb_buffer_pool_size
MySQL OOM on cPanel: diagnosing innodbbufferpoolsize The page came in at 03:14. cPanel's ChkServd had decided MariaDB was down on , and the on-call inbox was filling up with the alert every cPanel operator eventually learns to dread: A juni
- قراءة 15 دقيقة
86 CPU spikes in 24 hours: a multi-cause cascade postmortem
86 CPU spikes in 24 hours: a multi-cause cascade postmortem The mailbox at 08:00 had 86 ChkServd CPU alerts from , all from the previous 24 hours. Not a single tidy outage with a single cause. A steady drip of "CPU at 95% for the last minut
- قراءة 8 دقيقة
The corrupted WordPress db.php dropin nobody warned you of
The corrupted WordPress db.php dropin nobody warned you about The ticket reads "Error establishing a database connection". You SSH into the box. MySQL is up. works. The other twelve WordPress sites on the same server are loading fine. Only