Engineering practice Multi-vertical

~70 incidents over 24 months: the bot notices before people do

~70 incidents over 24 months: the bot notices first, the engineer is in within 1-9 minutes, recovery in 7-50. Incident response without a formal SLA.

Delivered

A site goes down at night. That’s not a hypothesis, it’s a schedule. Over two years on this fleet, disks filled up, the kernel killed databases, the host’s storage system failed, domains expired, and once the monitoring server itself hung. The question was never “will something fall.” The question was different: who finds out first, the on-call engineer or the customer whose cart wouldn’t open.

This case is about turning a stream of outages into a managed routine. The bot writes to the chat, the engineer is in the system within 1–9 minutes, a typical incident closes in 7–50 minutes, and every episode lands as a line of tracked effort. There is no formal response-time guarantee in the contract, but there are numbers.

Snapshot


End-client sector	digital agency; client online stores, mostly on Bitrix
End client	Media Studio (code-name), Russia
Engagement	retainer support: chat for the operational work, Redmine for the record and the hours
Project type	monitoring and incident response as an ongoing process
Scope of work	~14 servers in Zabbix; ~70 incidents over 24 months, ~30 with exact timing
Project date	4 June 2024 – 1 June 2026 (727 days); the process continues
Effort	by monthly “monitoring response” tasks, a typical episode 0.25–2h; a mixed stream does not reduce to one figure
Team	3 specialists (sysadmin-engineer · second engineer when needed · project manager)
Tech stack	Zabbix 6.4 · Telegram bot · ISPmanager · nginx · Apache httpd · MySQL/MariaDB · CentOS 7 / AlmaLinux / Ubuntu
Delivered	the bot notices first in most incidents; response 1–9 minutes; typical recovery 7–50 minutes; monitoring itself moved to a new server in 2 hours, exactly on estimate

The problem

Before the contract, the fleet looked like this: 12 VDS, the previous administrator gone, “once or twice a week something reliably falls over.” The studio already had Zabbix, dutifully sending 1–12 alerts a day into the void. There was no one to watch them, let alone act. Monitoring existed, response did not.

The client’s request was simple and hard: find out about outages before their own customers do, and for every alert see who did what with it and how long it took. Not “set us up some nice graphs,” but “make outages stop being a surprise.”

What’s hard here from the buyer’s point of view: monitoring is a tool that dies of its own noise. The hundredth ignored alert is no different from a switched-off Zabbix, and almost every studio has it set up exactly that way: installed, configured, ignored. Customers still report the outages. The task was not “deploy monitoring” but to build a discipline of response around it. And to hold that discipline for 24 months straight: on someone else’s fleet, with no staff on-call, no budget for a round-the-clock shift.

How we did it

1. Critical alerts go to the shared working chat, not a separate console. In June 2024 the Zabbix bot was connected straight into the chat with the client, and the set of alerts was agreed right away: “ok, let’s keep it to critical for now.” Both sides see the outage and the response to it at the same moment, in one feed. Hiding an incident is impossible, and pointless: response speed became a public metric the client checks every day, just by reading the chat.

2. A monthly “monitoring response” task puts the whole outage stream on the books. From July 2024 to June 2026 one container task lives in the tracker each month. Every restart, disk cleanup, night-alert postmortem is written into it as a separate effort line with a comment: a typical episode costs 0.25–0.5h. The first such task closed with 0.75h for two episodes. Outages stopped being “free panic” and became a unit of account: the client pays the retainer and sees what it’s made of, down to the line.

3. Alert hygiene: an alert has to mean an action. When a disk chronically full at 95% threw dozens of “less than 5% free” messages, we lowered the threshold to 3%: the notice arrives when it’s time to act, not when it’s “the usual.” When the message stream started getting in the way of the work itself, the studio asked the client to trim the set: “right now zabbix is getting in our way here more than it helps” — “Left only the critical ones, turned the rest off.” Noisy monitoring dies first. This one is in its third year.

4. We monitor the monitoring. On 8 October 2024 every server in the fleet “lost” its agents at once: the Zabbix server had updated itself to 6.4.19 on its own. The agents came back in 13 minutes, and auto-update was switched off that same night: monitoring has no right to update itself at a random moment. December 2025 was more serious: the old ARM monitoring server began hanging for 4–8 hours, leaving the whole fleet blind, and the host refused to migrate the VM. The studio’s estimate: “about two hours.” On the evening of 29 December the database and config moved to a new server, exactly 2 hours on the tracker, carrying the IP across so the agents would not notice the move. The next morning from the client: “first checks look like everything works, thanks!”

5. We admit gaps in writing, and close them. On 30 December 2024 a site that wasn’t on monitoring stayed down for 4.5 hours until the client’s manager wrote in by hand. The engineer brought it up in 10 minutes from the moment of contact, after which the site went onto monitoring and the missing syslog was added to the server. On 18 August 2025 the backup server’s agent went quiet for 20.5 hours: the alert was there, it was missed. In the chat, an honest “I think I missed an alert,” fixed in 7 minutes: firewalld had started up on its own on the server with ssh as the only allowed port, and we tore the firewall back out. A miss admitted and closed in 7 minutes does more for trust than clean statistics with no admission at all.

Incidents and response

Three telling episodes out of ~30 documented:

Date	What fell	Who noticed	Recovery
20.02.2025	shared server hung, not pinging	bot, 20 min before the client	~22 min (host-side fault)
26.11.2024	3 TB storage: disk filled to ~100%	bot, instantly	3–4 min
27.01.2025	online-store database went down	client’s manager, picked up in 1 min	~28 min

The pattern is one. The bot catches almost everything first. The client’s people bring in the night episodes and what hasn’t made it onto monitoring yet. After that the same loop runs: engineer in the system within minutes, diagnosis out loud in the chat, the fix logged as effort against the monthly task. And every “why did it fall” turns into a setting so it won’t fall again: process limits, watchdogs, thresholds, explicit timeouts instead of defaults.

In fairness, the format has a ceiling. Without a 24/7 shift, a night alert sometimes waits for morning: the longest outages of the period (4.5–8 hours) fell on night runs and on sites not yet on monitoring. The studio’s own infrastructure was never down for longer than ~3 hours.

Results

Metric	Value
Incidents over 24 months	~70, ~30 of them with exact timing
Who notices first	the bot, in most cases
Engineer response	1–9 minutes from the alert or the contact
Recovery of a typical incident	7–50 minutes
Migrating Zabbix itself to a new server	2 hours, exactly on the prior estimate
False mass alert (Zabbix auto-update)	resolved in 13 minutes, cause removed that same night
Missed alert (the only one admitted)	closed in 7 minutes after escalation
Baseline before the start	“once or twice a week something reliably falls over”

In plain terms: a studio with a fleet of ~14 servers has lived two years without surprises from its own infrastructure. Outages still happen: on cheap VPS with Bitrix stores they are inevitable. What changed is that the bot reports them first, not an angry customer, and that every outage has a response time, a cause, and a takeaway, written into the tracker.

Team

sysadmin-engineer (studio) — first line: alert response, diagnosis, watchdogs, the Zabbix migration
engineer (studio, when needed) — pulled in for consults and parallel incidents
Anton Hersun, Xaver Pro — project manager: process, escalations, review of disputed episodes

If your Zabbix sends alerts into the void and customers report outages first, send us a brief or your current technical documentation. We’ll look, name the bottlenecks, and come back with a fixed estimate in hours. The initial review is free.

Set up monitoring →

~70 incidents over 24 months: the bot notices before people do

Snapshot

The problem

How we did it

Incidents and response

Results

Team

Related case studies

Peak-season crisis: two days, four root causes

GitLab: from 15.4 to 18.11 over two years, zero data loss

A two-byte diagnosis: a seven-month bug closed in one evening