Fail2ban: Automated Firewall Response
Key-based SSH authentication makes brute-force success essentially impossible. But the attempts still happen. Every failed login attempt consumes CPU cycles, network bandwidth, and log space. At scale, it's a nuisance that can obscure real security events in noise.
Fail2ban solves this by watching authentication logs and automatically banning offending IPs at the firewall level. After a small number of failed attempts within a short window, the source IP is blocked for several hours. The ban happens at the firewall layer, so banned IPs can't even complete a TCP handshake — the server doesn't waste resources processing their requests.
I configured fail2ban for the SSH service on my non-default port, using systemd's journal as its backend instead of parsing log files directly. The journal-based approach is more reliable on modern systems because log rotation doesn't cause missed entries.
Custom Monitoring
Fail2ban handles one specific threat. But servers can fail in dozens of other ways: a service crashes, disk fills up, a certificate expires, memory runs out. I wrote a Python script that checks all of these.
It runs every five minutes via crontab and monitors several categories. Service health checks verify that critical services are running — the web server, SSH daemon, fail2ban itself, and the firewall. Resource checks monitor disk usage, memory consumption, and CPU load against configurable thresholds. Security checks count failed SSH login attempts in the last hour and report on fail2ban's current ban count. The script also verifies the TLS certificate's expiration date so I'm warned well before renewal fails silently.
When something is wrong, the script sends a push notification to my phone. It has a cooldown system per issue type — I get one alert per problem, not one every five minutes. A down service triggers an automatic restart attempt, and successful and failed restarts generate different severity alerts.
The Blind Spot
There's one failure mode this setup cannot cover: if the server itself goes down, the monitoring script goes down with it. No alert is sent. I wouldn't know until I tried to access the site or checked manually.
The monitoring lives on the system it monitors. It can tell me about sick services, full disks, and suspicious activity. It cannot tell me about its own death.
For complete coverage, external monitoring from a separate system is necessary — a service that pings the server from outside and alerts when it stops responding. That's a future improvement. For now, I accept this limitation and check the server's public-facing page periodically as a manual fallback.
Why Not a Commercial Solution?
Tools like Datadog, New Relic, and UptimeRobot exist for exactly this purpose. For a production business, they're the right choice. But this project is about understanding, not convenience. Writing my own monitoring taught me what to watch, how to detect failure, and how alerting systems work. Using a commercial tool would have given me a dashboard without the knowledge behind it.