cPanel disk full at 96 percent: the backup retention trap

/home was at 96 percent. The exact numbers were 931GB used out of 970GB, which left 39GB of headroom on a server that wrote roughly 2GB an hour into mail spools and InnoDB tablespaces alone. ChkServd had started its disk warnings about an hour earlier and was now firing every five minutes. MySQL had not crashed yet, but it would within the next few hours. Exim had not stopped flushing yet, but the spool was already 4GB and growing. The next scheduled cPanel backup run was due at 02:00 and would fail halfway through, which would leave the previous incomplete backup set on disk and consume another 30 to 40GB before it errored out.

This is the postmortem of how we got /home from 96 percent back to 58 percent on a production cPanel box without losing a single file a client cared about, and the change to the WHM backup configuration that prevented the same cascade from happening again the following month. The root cause was not what we expected.

The cascade you are trying to avoid

A full /home partition does not fail one thing at a time. The failures stack, and they stack fast, because most of the things you rely on a cPanel server doing also need to write to disk.

At 90 percent, MySQL starts to throw Can't create/write to file errors on temp tables. Reads continue to work, writes do not. Anything that runs a sort that spills to disk fails. WordPress sites still load until someone tries to save a post.
At 95 percent, Exim's mail spool stops flushing cleanly. The spool keeps accepting new mail but cannot fully write delivery records. Mail backs up under /var/spool/exim/input and queue size goes vertical on the next cPanel queue check.
At 96 percent, the next cPanel backup attempt fails halfway. The incomplete backup is not always cleaned up. This is the trap that often makes a disk-full incident worse before anyone intervenes: you discover the alert because backups failed, investigate, and find that the failed backup itself consumed another 30 to 40GB on the way down.
At 98 percent, anything that uses /tmp breaks. cPanel session writes fail, file uploads fail, ImageMagick conversions fail, PHP opcache writes fail, and Composer or yum operations all start to error.

The order matters because if you are reading the alert at 96 and the cascade reaches 98 before you intervene, you have introduced new failure modes that obscure the original problem. Half of an incident response is keeping the symptom set stable so you can reason about it.

The wrong instinct and the right one

The wrong instinct on a disk-full cPanel server is to delete log files. Logs are visible, you know what they are, and rm feels productive. In practice, logs are almost never the dominant consumer of /home on a cPanel box because cPanel rotates them by default. The big consumers are almost always one of six things, and the relative ordering is consistent across our incidents:

cPanel backup sets under /backup (when local backups are on)
Mail spools (cur, new, and tmp under each Maildir)
WordPress backup plugin output, especially UpdraftPlus and BackWPup writing to wp-content/uploads
MySQL binary logs when binlog_expire_logs_seconds is too high or unset
Apache log archives when log rotation is broken or disabled
Trash and Spam folders that nobody empties

The right instinct is to find the largest directories first and work from the top of the list down. du is your only friend here:

# top-level breakdown of /home: slow but precise
du --max-depth=1 -h /home 2>/dev/null | sort -h | tail -20
 
# same for /backup and /var
du --max-depth=1 -h /backup 2>/dev/null | sort -h | tail -10
du --max-depth=1 -h /var    2>/dev/null | sort -h | tail -10
 
# fast version that excludes mounts and read-only paths
df -h
df -i  # don't forget inode exhaustion masquerades as disk full

Run df -i even when you are sure the problem is bytes, not inodes. An inode-exhausted disk reports as full to most cPanel tooling but does not show up in du. The fix is different and involves finding directories with millions of tiny files, usually Drupal sessions or PHP-FPM opcache fragments.

The cPanel backup retention trap

In our 96 percent incident, du --max-depth=1 -h /backup came back with this:

47G     /backup/2026-04-19
47G     /backup/2026-04-26
47G     /backup/2026-05-03
47G     /backup/2026-05-10
47G     /backup/2026-05-17
47G     /backup/2026-05-24
47G     /backup/2026-05-31
331G    /backup

Seven cPanel backup sets, 47GB each, 331GB total. The retention had been set to 7 during the original server setup, at a time when the server held about 30GB of client data. By the time of the incident, the server held about 100GB of live data, so each weekly backup set was now 47GB, and 7 of those eats 331GB of a 970GB disk before anyone notices.

The trap is that nothing in the cPanel UI re-evaluates whether the retention number you picked at install time still makes sense as the server grows. WHM happily kept producing weekly snapshots forever. The disk grew with client data, the backups grew with the disk, and the retention multiplier was never revisited. The default in modern WHM is more conservative now, but on any server that was originally provisioned a few years ago, the old setting is still in force.

For a rough rule, multiply your live /home size by your retention count. If the result is more than 30 percent of your total disk, you are running a retention setting that will bite you. Most agencies need 2 or 3 sets, not 7.

The safe-deletion order

Once du has told you where the bytes are, the work is to reclaim them in an order that is safe to abort partway through. The cardinal rule on a cPanel server is to use cPanel's own deletion paths wherever they exist, because cPanel keeps internal state (quotas, backup metadata, account indexes) that out-of-band rm commands will not update.

Step 1: snapshot before you touch anything

Before any deletion, write down the state. If something goes sideways during the cleanup, you want a precise before picture to compare against.

# capture the before state to a file you'll keep
{
  date -u +"%Y-%m-%dT%H:%M:%SZ"
  echo "--- df -h ---"; df -h
  echo "--- df -i ---"; df -i
  echo "--- du /home ---"; du --max-depth=1 -h /home 2>/dev/null | sort -h
  echo "--- du /backup ---"; du --max-depth=1 -h /backup 2>/dev/null | sort -h
  echo "--- /backup listing ---"; ls -lah /backup
} > /root/disk-incident-before.txt

The file lives outside /home and /backup. If the disk fills further during cleanup, you still have the snapshot.

Step 2: reduce backup retention through WHM

This is the biggest single win in almost every disk-full cPanel incident and the one place where the temptation to rm -rf is strongest. Resist it. Manually deleting files in /backup leaves cPanel's backup manifest pointing at files that no longer exist, which will cause the next backup run to log warnings and, in some WHM versions, to abort early.

The blessed path:

WHM → Backup → Backup Configuration → Retention
Lower the number of retained backups (we use 3 for most boxes)
Save the configuration
Trigger the cleanup explicitly, do not wait for the next run

# cPanel's own cleaner: respects the new retention setting and
# updates backup manifest state correctly
/usr/local/cpanel/bin/backup_cleaner

On the 96 percent incident, the retention change from 7 to 3 sets plus a backup_cleaner run reclaimed 188GB. That alone moved the server from 96 percent to 77 percent and stopped the cascade. The remaining cleanup was no longer time-pressured.

If you want to do this through the API rather than the GUI, the WHM API equivalent is:

# read current backup config (writes to stdout, formatted JSON)
whmapi1 backup_config_get --output=jsonpretty
 
# update retention to 3
whmapi1 backup_config_set BACKUP_DAILY_RETENTION=3 \
  BACKUP_WEEKLY_RETENTION=3 BACKUP_MONTHLY_RETENTION=3

Whether you change daily, weekly, or monthly retention depends on which schedules are enabled in your environment. Read the existing config first; do not assume.

Step 3: plugin backups under wp-content

WordPress backup plugins are the second-largest hidden consumer. UpdraftPlus writes to wp-content/updraft, BackWPup writes to wp-content/uploads/backwpup-*-backups, and there are a dozen similar variants. Many sites run a plugin backup on top of the cPanel backup, doubling the storage cost.

Find them first, do not delete on sight:

# any file inside wp-content/uploads that looks like a backup
# and is bigger than 100MB
find /home/*/public_html -path '*wp-content/uploads/*backup*' \
  -size +100M -type f 2>/dev/null
 
# UpdraftPlus stores under wp-content/updraft, not uploads
find /home/*/public_html -path '*wp-content/updraft/*' \
  -type f -size +50M 2>/dev/null
 
# BackWPup uses a hashed directory name
find /home/*/public_html -path '*wp-content/uploads/backwpup-*' \
  -type f 2>/dev/null

The reason not to delete these immediately is that some clients do rely on plugin backups for their own disaster recovery, even when you are also taking cPanel-level backups for them. The agency policy we settled on is rename rather than remove, then notify the client:

# rename, don't delete. Gives the client 30 days to object
find /home/<user>/public_html -path '*wp-content/updraft/*' \
  -type f -name '*.gz' -exec mv {} {}.pending \;

The .pending suffix is searchable. A monthly cron deletes anything older than 30 days that still has the suffix. The client gets one notification at the time of rename and one final notice before the actual deletion.

Step 4: mail spool cleanup

Mail spools accumulate quietly under each Maildir. The big offenders are Junk and Trash folders that no one empties, and old cur folders for forwarders that point to nothing.

# size by mail account on a single cPanel user
du --max-depth=2 -h /home/<user>/mail 2>/dev/null | sort -h | tail -20
 
# all Junk folders older than 90 days, across all users
find /home/*/mail/*/.Junk/cur -type f -mtime +90 2>/dev/null | wc -l

cPanel ships an Email Disk Usage tool per account that handles this through the GUI, but for bulk cleanup across dozens of accounts on a single server, a script that iterates accounts and prompts for confirmation per account is more practical. The critical rule is that mail deletion is per-account, with explicit consent. There is no agency that does not eventually have a client who deliberately kept seven-year-old emails in Trash.

Step 5: MySQL binary logs

InnoDB binary logs accumulate when binlog_expire_logs_seconds is set high or left at the engine default. On a busy MariaDB 10.6 or 10.11 instance, binlogs can consume 20 to 40GB of /var/lib/mysql before anyone notices.

-- enumerate
SHOW BINARY LOGS;
 
-- purge anything older than 7 days
PURGE BINARY LOGS BEFORE NOW() - INTERVAL 7 DAY;

For ongoing retention, set the value in /etc/my.cnf rather than purging by hand. Seven days is a reasonable default for most servers that are not running replication; if you are running a replica, the value needs to be at least as long as the replica's worst-case lag.

# /etc/my.cnf, under [mysqld]
binlog_expire_logs_seconds = 604800   # 7 days

Restart MariaDB through the cPanel-aware path so cPanel's service monitor does not flag the restart as a failure:

/scripts/restartsrv_mysql

Step 6: Apache log rotation

Apache logs only become a disk-full contributor when log rotation is broken. cPanel rotates by default through /usr/local/cpanel/etc/logrotate.d/ and through cpanellogd. If logs are accumulating, the problem is usually that a recent configuration change disabled rotation, or that a custom log target was added without rotation rules.

# check that the rotation service is alive
systemctl status cpanellogd
 
# look for accumulated archives
ls -lah /usr/local/apache/domlogs/*/*.gz 2>/dev/null | tail -20
ls -lah /etc/apache2/logs/ 2>/dev/null

Resist the temptation to rm the logs without fixing rotation. They will simply reappear in the next billing cycle, and you will be back at 96 percent on the same date next month.

Things to never delete on a cPanel server

There are directories on a cPanel server that look like cleanup candidates but are not. Touching them produces failures that are much worse than the original disk-full event, because they break cPanel's internal state machine and the recovery paths are not documented anywhere outside cPanel support.

/usr/local/cpanel/*: anything under here. This is cPanel itself. It is small (single-digit GB) and not a real reclaim target.
/var/cpanel/*: account configuration, quota state, package definitions, license state. Deleting anything here can detach accounts from their settings or invalidate the cPanel license.
Any directory you do not recognize inside /home/<user>. It might be a client-managed Git checkout, a Composer cache, a Node .next build directory, or a deliberate file dump the client uses for their own work. Ask before you remove.
/tmp during business hours. PHP sessions live here, in-flight uploads live here, and PHP-FPM opcache files live here on some configurations. Clearing /tmp mid-day will log everyone out of every WordPress admin on the box at the same time.
/root/*. The previous on-call engineer might be keeping their one-off diagnostic outputs here, and you do not want to nuke another engineer's working notes from a different incident.

Setting up sensible retention from now on

The reason this incident recurs across cPanel servers is that the initial-setup retention values almost never get revisited. The fix is to set defaults at provision time and audit them at every disk-grow event.

Backup retention: 3 sets. This is enough for the standard agency recovery cases (last week's data, two weeks ago, three weeks ago) and avoids the 7-set trap.
Mail spool quotas: enforced per account. cPanel supports per-account mail quotas; set them. The most generous default we use is 5GB, and the most common is 2GB.
MySQL binlog retention: 7 days for non-replicated, replica lag plus a safety margin for replicated. Configured in /etc/my.cnf, not by hand.
Plugin backup pruning: monthly cron that finds plugin backups older than 60 days and renames them to .pending, then another cron that deletes .pending files older than 30 days.
Disk audit cron: a weekly script that emails the team if any partition is above 80 percent. We use 80 as the early warning and 90 as the page-the-on-call threshold.

A related read on what happens when /home fills while MySQL is also under memory pressure is in MySQL OOM on cPanel: diagnosing innodb_buffer_pool_size. The two failure modes interact: a disk-full event can present as a MySQL crash because MySQL is the first service to fail loudly, and a MySQL crash on a near-full disk can prevent the recovery writes that would otherwise log the real cause.

If your disk-full incident traces back to a single WordPress site hammering itself, the wp-cron overlap pattern in WordPress wp-cron stacking on cPanel is worth checking. Stacked cron runs can generate hundreds of megabytes of error logs per day on a misconfigured site, which contributes to category 5 above.

The 60-second disk audit

When you suspect a disk problem and need to know in under a minute whether you have a backup, mail, plugin, binlog, or log problem, this is the audit:

# 1. is it bytes or inodes?
df -h && df -i
 
# 2. top-level reclaim candidates, sorted
du --max-depth=1 -h /home /backup /var/lib/mysql /var/log 2>/dev/null \
  | sort -h | tail -20
 
# 3. backup sets count and total size
ls -lah /backup 2>/dev/null
du -sh /backup 2>/dev/null

Three commands, one answer. If /backup is more than 20 percent of total disk, you are in the retention trap and Step 2 above is the priority. If mail is the biggest item, you are in spool cleanup. If /var/lib/mysql dominates, you have a binlog problem.

How ServerGuard handles this

ServerGuard's disk-full use case maps to the same six-step order, with one important rule: SGuard will never auto-delete files outside a small set of whitelisted categories. The "AI sees free space, AI deletes" pattern is a non-starter for our risk model. Every dangerous action runs through Telegram approval.

Detection. SGuard subscribes to ChkServd disk alerts and to its own df-based disk poller. The poller fires at 80 percent (early warning, no action) and 90 percent (diagnostic flow runs within 60 seconds). The thresholds are configurable per server.

Diagnosis. When the diagnostic flow runs, SGuard collects df -h, df -i, du --max-depth=1 -h against /home, /backup, /var/lib/mysql, and /var/log, plus ls -lah /backup. The output is summarised into the six reclaim categories above and posted to the team's Telegram channel as a single message with per-category byte counts.

Action 1, Safe Apache log cleanup. When the diagnostic flow finds Apache log archives older than 14 days, SGuard removes them automatically. This is the only auto-delete action and it is scoped to a hardcoded path glob: /usr/local/apache/domlogs/**/*.gz and /etc/apache2/logs/*.gz with an mtime filter. The path glob lives in code, not config, so it cannot be widened by a misconfigured environment.

Action 2, Moderate backup retention reduction. When /backup is more than 25 percent of total disk and weekly retention is more than 3, SGuard proposes a retention reduction plus a backup_cleaner run. The proposal goes to Telegram with the before-state byte counts, the expected reclaim, and a single approve/reject pair of buttons. On approval, SGuard runs the WHM API call to update retention and then invokes /usr/local/cpanel/bin/backup_cleaner. This action is dangerous even through the cPanel-blessed path, so approval is required every time. There is no auto-mode for it.

Action 3, Dangerous plugin backup mass-rename. When the diagnostic flow finds plugin backups over 100MB across multiple sites, SGuard proposes a .pending-suffix rename per backup class. The approval message lists the affected paths and total reclaimable bytes. Approval is per-class, not per-file: the team approves "rename all UpdraftPlus archives older than 60 days" as one decision, not 200 separate decisions. Auto-deletion of renamed files after 30 days is also approval-gated on first run and remembered per server thereafter.

Honest limits. SGuard does not auto-delete mail spool contents, does not auto-purge MySQL binary logs, and does not auto-touch any file under a user's /home/<user>/ tree that is not a recognised backup pattern. Mail and binlog cleanup are diagnostic-only today. SGuard tells you they are the bottleneck and links to the use case, but the deletion is yours to run. The reason is the same as the rest of this post: the recovery cost of an over-eager delete on a cPanel server is much higher than the cost of waiting for a human to approve.

The 96 percent incident that opened this post would have gone like this with SGuard installed: an 80-percent early warning would have landed in Telegram about a day earlier; at 90 percent, the diagnostic flow would have posted "backup sets are 34 percent of total disk, retention is 7, propose reducing to 3 and running backup_cleaner"; one tap from the on-call engineer would have reclaimed 188GB before ChkServd ever started its five-minute alert cycle. The cascade would not have started, MySQL would not have been near a write failure, and the next scheduled backup would have succeeded instead of failing halfway.

That is the shape of every SGuard use case. The dangerous work is still a human's call. The diagnostic work, the proposal, and the path to the approval button is the part we automate.

cPanel disk full at 96 percent: the backup retention trap

cPanel disk full at 96 percent: the backup retention trap

The cascade you are trying to avoid

The wrong instinct and the right one

The cPanel backup retention trap

The safe-deletion order

Step 1: snapshot before you touch anything

Step 2: reduce backup retention through WHM

Step 3: plugin backups under wp-content

Step 4: mail spool cleanup

Step 5: MySQL binary logs

Step 6: Apache log rotation

Things to never delete on a cPanel server

Setting up sensible retention from now on

The 60-second disk audit

How ServerGuard handles this

MySQL OOM on cPanel: diagnosing innodb_buffer_pool_size

86 CPU spikes in 24 hours: a multi-cause cascade postmortem

The corrupted WordPress db.php dropin nobody warned you of

cPanel disk full at 96 percent: the backup retention trap

The cascade you are trying to avoid

The wrong instinct and the right one

The cPanel backup retention trap

The safe-deletion order

Step 1: snapshot before you touch anything

Step 2: reduce backup retention through WHM

Step 3: plugin backups under wp-content

Step 4: mail spool cleanup

Step 5: MySQL binary logs

Step 6: Apache log rotation

Things to never delete on a cPanel server

Setting up sensible retention from now on

The 60-second disk audit

How ServerGuard handles this

مقالات ذات صلة

MySQL OOM on cPanel: diagnosing innodb_buffer_pool_size

86 CPU spikes in 24 hours: a multi-cause cascade postmortem

The corrupted WordPress db.php dropin nobody warned you of