Now that I've once again got reliable notifications from my monitoring system, being woken up by them at 6.30AM on my first day off for the Christmas break needed fixing. My Nagios has sent out lots of "spurious" alerts for some time, and my hope was that getting woken up by them would motivate me to fix it. I'm pleased to say that it has!
In addition to checking various machines respond to ping, my system also logs into them using check_by_ssh and checks various things by running commands locally. Sometimes the entire block of SSH-based checks for a server would flip over to CRITICAL with "connection timed out", even though the machine remained up and running. There was no evidence of a high load average to explain the timeouts, and a bit of checking with netcat revealed that connections to port 22 on the machine in question really did time out from the monitoring machine (but not from anywhere else).
At this point, I had a lightbulb moment and remembered that our firewalls automatically block SSH connections from any IP address which attempts more than 10 in a 60 second period. This crude rate limiting is one of many lines of defence against brute-force attacks, but of course, some hosts have more than 10 checks run over SSH. And the way Nagios runs means that quite often, it hits the rate limit, then continues to do so as it re-tries the checks one after another. The backoff it performs doesn't help, because it backs off the retry interval in lock-step for all 10 checks.
Having added an exemption to the firewall rate limiting for our monitoring server's IPs, all is now well in Nagios, and hopefully the only rude awakenings from the alerts will now be genuine outages.
(And it only took me five years to find time to get to the bottom of this intermittent problem!)