ChkServd alert field guide: how to read cPanel's service alerts

chkservd is the bit of cPanel's tailwatchd that watches services and emails you when one looks unhappy. Most teams treat its alerts as noise. They are not noise; they have a stable grammar, and once you read them at a glance you can sort signal from flap in two seconds.

What ChkServd is

ChkServd runs inside tailwatchd and polls the registered service list every ~5 minutes. The list lives in /etc/chkserv.d/chkservd.conf and each service has its own driver file under /etc/chkserv.d/<service>. A driver tells ChkServd what port to probe, what banner to expect, and what command to run if the banner is wrong.

ls /etc/chkserv.d/
# apache_php_fpm  cpsrvd  exim     ftpd       imap     mailman  mysql
# named           nscd    pop      queueprocd spamd    sshd

cat /etc/chkserv.d/mysql
# service[mysql]=x,x,x,connect,/etc/init.d/mysql restart,mysql,root

The alert format

Every alert has the same anatomy. Subject:

[chkservd] Service check on cpanel-host -- FAILED: <service> ([reason])

Body, in order:

The service name and the failure timestamp.
The probe ChkServd ran (port, expected banner).
What it actually got (banner mismatch, refused connection, timeout).
The recovery action, if any (Notice: TailWatchd has restarted <service>).

The five alerts you will see most often

[chkservd] Service check -- FAILED: mysql ([connect failed])

MariaDB or MySQL is not accepting connections on 127.0.0.1:3306. Either the daemon is dead (check /var/log/mariadb/mariadb.log for crash) or it is alive but stuck (check mysqladmin processlist).

[chkservd] Service check -- FAILED: <service> is unable to detect a connection on port <N>

The service is running according to systemctl but the TCP probe times out. Three usual causes: firewall rule blocking 127.0.0.1, daemon stuck on a long-running query, or iptables state table full. Tail /var/log/messages and run ss -ltnp first.

Notice: TailWatchd has restarted <service>

Informational, not failure. ChkServd ran the recovery action from the driver file and the service came back. If you see this every hour for the same service, the underlying cause is unresolved and you need to investigate, not silence the alert.

[chkservd] /var/lib/mysql is over <threshold> full

Disk full alarm scoped to the MySQL data directory. Threshold is configurable in WHM > Server Configuration > Tweak Settings > "Maximum percentage of space used by MySQL". Default 95%. When this fires, MariaDB stops accepting writes long before the partition itself runs out.

[chkservd] SSL certificate is expiring on <domain>

cPanel certificate, not Let's Encrypt for the cPanel hostname. Two weeks default warning. If AutoSSL is failing for the cPanel host, this is the only alert that will tell you before things break.

When ChkServd is right vs wrong

It is right almost always when the alert is connect failed on a real port. The service is dead or the listener is bound wrong. It is wrong often when:

The probe times out at exactly the same time cron.daily is running. The host is alive; ChkServd just lost the race.
The service runs on an odd port and the driver still expects the default. (Same shape of bug as the Imunify360 custom SSH port issue.)
The service accepts the connection but takes >5s to send a banner. ChkServd's TCP probe is short-fused.

Tuning ChkServd

WHM > Service Manager lets you toggle which services ChkServd monitors at all and whether it auto-restarts on failure. For per- service probe overrides, edit the driver file under /etc/chkserv.d/. The format is comma-delimited; the cPanel docs linked from the WHM page are the only reliable reference.

# Disable auto-restart for a specific service while keeping monitoring:
sed -i 's/connect,\/etc\/init.d\/mysql restart/connect,\/bin\/true/' \
  /etc/chkserv.d/mysql
/scripts/restartsrv_tailwatchd

ChkServd alerts are the entry point for several incident types we write up in detail:

For the slow log permission flap that often follows a mysql connect failed alert, see the MariaDB slow log permissions quickref.

How ServerGuard uses this

We parse the ChkServd alert subject line into a structured event (service, kind, port, threshold) and route it into the matching use case before paging a human. Most ChkServd alerts resolve themselves in our triage pass before anyone wakes up.

ChkServd alert field guide: reading cPanel service alerts

ChkServd alert field guide: how to read cPanel's service alerts

What ChkServd is

The alert format

The five alerts you will see most often

When ChkServd is right vs wrong

Tuning ChkServd

How ServerGuard uses this

86 CPU spikes in 24 hours: a multi-cause cascade postmortem

cPanel disk full at 96 percent: the backup retention trap

MySQL OOM on cPanel: diagnosing innodb_buffer_pool_size

ChkServd alert field guide: how to read cPanel's service alerts

What ChkServd is

The alert format

The five alerts you will see most often

When ChkServd is right vs wrong

Tuning ChkServd

Related reading

How ServerGuard uses this

مقالات ذات صلة

86 CPU spikes in 24 hours: a multi-cause cascade postmortem

cPanel disk full at 96 percent: the backup retention trap

MySQL OOM on cPanel: diagnosing innodb_buffer_pool_size