Engineering practice Multi-vertical

Peak-season crisis: two days, four root causes

A Bitrix store goes down in peak season. Four root causes: panel antivirus, a one-digit DB-config typo, disk degradation, a slow DDoS. Closed in two days.

11h delivered

Outages rarely pick a convenient moment, and this one picked the worst: the end client, a Bitrix workwear store, was in its sales season. The hard part of incidents like this is not the load or the work running late into the night. The hard part is when there is more than one cause. You fix the obvious thing, the site comes back for an hour, then dies again: that means a second problem sits under the first. Here there were four.

Snapshot


End-client sector	retail and wholesale of workwear and PPE
End client	Bitrix store (anonymized), VDS at a third-party host
Engagement	white-label support for a web studio, on retainer
Project type	emergency diagnosis and stabilization of a production server in peak season
Scope of work	production VDS: 18 CPU / 24 GB RAM at the start of the crisis, order exchange with 1C
Project date	25 Feb 2026 – 27 Feb 2026 (3 days)
Effort	11.25h on this site for February; 6.5h of that across the two peak days
Team	1 sysadmin-engineer (studio), project manager on escalation
Tech stack	1C-Bitrix · MySQL · nginx + Apache · CentOS 7.8 · ispmanager
Delivered	site back to normal, auto-ban script on standby, half of the extra memory we proposed returning ourselves

The problem

The studio runs its clients’ sites, we run its servers: monitoring, monthly updates, incidents. The workwear store is one of the heaviest objects: a heavy Bitrix install, a large database, regular order exchange with 1C from the client’s cloud. The server is someone else’s VDS on CentOS 7.8 with the ispmanager panel, and February is the store’s season.

The request came in plainly. On 25 February at 16:36 monitoring flagged it: the homepage was not answering for longer than 60 seconds. The studio manager wrote in the chat: “check this please, the site’s dead.” What followed was two days during which the site either lay flat or crawled: a catalog page took 20–30 seconds to open, an item went into the cart in 6–8 seconds instead of one and a half. By the evening of the second day the studio said plainly what it was costing: “6 hours of site problems today, this hurts the business as much as it can. And the client is calling me every half hour.”

The danger here is single: the causes are several, and all on someone else’s hardware. The host answers complaints with “we see no problems on our side,” and when resources are added the virtual machine’s statistics reset to zero: there are simply no “before” graphs. Each cause fixed buys an hour of quiet, and if you stop at the first one, tomorrow it all repeats while the client’s trust in the studio is already dented. So we dug to the bottom of the list.

By the end of the first day we closed two of the four causes. By the end of the second we found the last one and beat the attack back.

Four root causes

1. An antivirus nobody asked for. In the freshly updated panel, “some kind soul switched on ImunifyAV.” The antivirus was grinding through a scan queue with dozens of processes, each holding up to half a CPU core. Working out exactly where it was choking, mid-season, was not an option: the processes were killed, the panel stopped so it would not restart the queue. The postmortem was left for later. First the site.

2. A one-digit typo. The MySQL config held sort_buffer_size = 2M instead of 32M: “probably a mistake, someone just dropped the 3 at the start.” For a large database under a heavy engine, that meant sorts spilling to disk for no reason. We put the number back: “seems a bit easier.” A bit easier, but not good: the list of causes did not end there.

3. The disks. The database kept hammering the disk even after the config fix, and after the VM was migrated to a different hypervisor it got worse: less load, more lag. The host would not admit the problem, so we argued it with a comparison: on the studio’s own hypervisor, read IOPS are comparable, write IOPS are 100 times higher, and that with 107 live VMs and disk usage at 10–20%. The recommendation to the client is on record: NVMe and/or more memory, so the database lives entirely in cache.

4. A DDoS that’s easy to miss. Not thousands of requests per second, the kind you can see an attack from space by, but 500–600 open and silent connections to port 443. Not a byte of data, the web-server slots simply occupied. The engineer’s honest call in the chat: “in my defense, this is the second time I’ve seen this in my life: usually it’s more obvious, not 500–600 connections but thousands.” This one cause explained the central riddle of the two days: why the site fell again after every fix.

An auto-ban by evening

An off-the-shelf anti-DDoS service would have needed approvals, a DNS change, and hours of waiting that a sales season does not have. Instead the engineer wrote a script for the specific signature of the attack: more than 5 simultaneous connections to port 443 with not a byte transferred over them gets a 300-second ban. The whitelist: the end client’s office and the cloud running 1C.

The first version was cruder, and that’s an honest part of the story. A manual ban, everyone at once, and our own people got caught in it: the studio manager checking the site across several browsers at once, and the IP addresses of the cloud the order exchange runs through. Half an hour of “now it’s blocked, now it’s unblocked,” and only then the script with exceptions. It worked. By 19:45 the ban list was empty: the attacker figured out the connections die after five minutes and dropped off.

On 27 February the last complaint came in: dropped 1C exchange from the cloud. We checked it by eliminating layers one at a time: ban list empty → server pings the cloud → access log shows 200 responses to the 1C requests → firewall stopped for a clean experiment → a control reboot, 2–3 minutes of downtime agreed in the chat in under a minute. After the reboot the drops stopped. Across two crisis days of settings edited on the fly, something stray could have stayed behind, and the reboot put the question to rest.

After the peak

The most telling line of the crisis came once everything was already working. The end client bought 6 more CPU and 40 GB of memory, and that same evening the studio engineer wrote: “let them cut the memory back by half, down to 32G. The cores I’d leave for now, watch it a day or two, and if it’s all steady let them put it back to 18 as it was.” We earn on hours of work, not on someone else’s hardware: leaving the client carrying extra cost would have been easier, but dishonest. The studio decided to watch it for a week and passed the recommendation to the client.

What else was left after the crisis:

targeted MySQL tuning from Bitrix’s own diagnostics, on the parameters it flags as bottlenecks itself;
working log rotation, double-checked the next day: the first run “didn’t work, though the log claimed otherwise”;
debts recorded in writing: CentOS 7.8 is badly out of date, the disks need to move to NVMe, and MySQL wants “a thoughtful tune, not a quick fix of the obvious errors.”

Results

Metric	Value
Active crisis	2 days (25–26 February), the 1C-exchange tail on 27 February
Causes found and closed	4: panel antivirus · DB-config typo · disk degradation · slow DDoS
Server resources	client bought more memory and CPU; we proposed returning half the memory ourselves once the crisis passed
February effort on this site	11.25h tracked, 6.5h of that across the two peak days

In plain terms: the store was back to its old speed while still in season, the auto-ban script stayed on standby on the server, and of the resources bought we proposed returning half ourselves. The server’s debt list (disks, OS, deep DB tuning) was handed to the client in writing, without alarm and without “let’s rebuild everything.”

Team

sysadmin-engineer (studio) — diagnosis, recovery, the auto-ban script, all the work on the server
Anton Hersun, Xaver Pro — project manager

If your store goes down at the height of the season and there looks to be more than one cause, send us a brief or your current technical documentation. We’ll look, name the bottlenecks, and come back with a fixed estimate in hours. The urgent review is free.

Diagnose the outage →

Peak-season crisis: two days, four root causes

Snapshot

The problem

Four root causes

An auto-ban by evening

After the peak

Results

Team

Related case studies

An inherited fleet of 12 VDS: audited in 6 hours, in order within a month

Migrating an infected NextCloud: 400 GB with no docker in 7 hours

GitLab: from 15.4 to 18.11 over two years, zero data loss