The most dangerous thing in someone else’s server fleet is not what’s broken. Broken things are visible. The danger is what nobody remembers: a firewall rule added on the fly and never saved, a swap entry written into fstab with a typo, a hand-rolled nginx config that lives only in the head of an admin who has left. So we start a fleet handover not with repairs but with an inventory, and only then touch anything in production.
This case is about the first month of that work: how 12 unfamiliar VDS with a steady background of outages became a fleet where every server has a wiki page, every job has an agreed window, and monitoring has a person who answers it.
Snapshot
| End-client sector | digital agency, building online stores on Bitrix |
| End client | web studio Media Studio, Russia |
| Engagement | direct sole-proprietor↔sole-proprietor contract: 3 months with renewal, signed via EDM |
| Project type | fleet handover to support: audit, first wave of cleanup, process setup |
| Work done | 12 VDS, 4 of them for the studio’s internal services (GitLab, cloud, backups, monitoring) |
| Project date | 4 Jun 2024 – 26 Jul 2024 (53 days) |
| Effort | 11.25h across five tracker tasks |
| Team | 3 specialists (project manager · sysadmin-engineer · engineer) |
| Tech stack | CentOS 7 / Ubuntu · ISPmanager · MariaDB 10.4 · nginx · Apache httpd · Zabbix 6.4 · fail2ban |
| Delivered | audit of all 12 VDS with wiki descriptions, updated fleet, monitoring in the shared chat, a regular update cycle as a process |
The problem
A web studio came to us: Bitrix, a little Laravel, dozens of client sites, and 12 VDS at one host. The previous admin had left, no documentation existed. The tech lead framed the request honestly: “Roughly once or twice a week something reliably falls over or breaks. Either the database decides to restart itself on some timer of its own, or it runs out of RAM, or bots show up on a site.” The studio had its own Zabbix, and it produced anywhere from 1 to 12 alerts a day.
They wanted four things, and we quote them in the client’s own words: steady the stability, react to problems faster, install security updates regularly, and run server work on request. They were choosing between three companies. The decision matured before any contract: the studio lost network on a bare-metal server, and we responded within minutes. The client managed to fix it themselves by correcting a MAC address, then wrote in the chat: “can I switch to you instead?” A week later the contract went out via EDM.
A tech lead handing over a fleet like this carries two fears, and both are well-founded. The first: the contractor starts “cleaning up” with broad strokes and brings down production stores that ran for years on undocumented hacks. The second, the opposite: the contractor drowns in the audit and ships nothing tangible for months. We had to balance on a live fleet, where every reboot is a lottery by definition: nobody yet knows what on this particular server won’t survive a restart.
How we did it
1. An audit with a clock running, not a “study.” We went through every server on the list: about 30 minutes per machine, 6 hours for the fleet. We corrected the estimate in public, right as the work ran: after six servers were documented, it was clear we wouldn’t fit the original 4 hours. We opened a wiki page for each server and produced a recommendations document on the way out: swap, a unified log, fail2ban, backups. From there the client cut their own tasks out of that document in the tracker, each one referencing the audit, nothing done “from memory.”
2. Their monitoring instead of ours. The studio already had Zabbix: building a parallel one would have cost hours and produced two sources of truth. Instead we connected the client’s bot into the shared work chat and agreed on alert hygiene right away: only major and critical triggers, and only per server, not per site. Otherwise one downed server with fifty sites buries the chat under fifty messages. “OK, let’s keep it to critical for now.” From that day both sides see incidents and the response to them in one place, with timestamps.
3. Updating the fleet as a procedure, not a feat. The client assumed half an hour per server and proposed slicing the fleet into groups: otherwise 6 hours, the whole monthly budget. We proposed something else: an agreed window, an on-call engineer with access to the host’s console (“on any update there’s a chance the server doesn’t come back from the reboot”), and the whole fleet in one pass. We updated ten servers from the agreed list in 1.75 hours. We deliberately left the cloud untouched until its migration, and GitLab was updated by another engineer under a separate task. The task included a dump of the updated packages per server: MariaDB, zabbix-agent, the kernel. The update also surfaced PHP 7.2 on the central server, end-of-life since November 2020. We recorded that risk in writing.
4. The first wave of order, targeted, by the audit’s recommendations. We added swap everywhere it was missing, and everywhere the same way, on the same path. We declined repartitioning into a separate volume, together with the client: this is a VDS, you can’t boot from a live-CD, and a file is enough. On every server with a control panel we assembled a unified nginx access log while keeping user logs for their own sites: now bots and DDoS show up under a single tail. The most rewarding was the store whose database used to fall over “two or three times a day”: we found a swap entry written into fstab with a typo that never mounted after a reboot, trimmed the MySQL buffer from 4 to 3 GB to match a real database of about 3 GB, and finished a half-configured nginx cache with a 1-minute TTL: “the job here is to smooth the spikes, not to cache everything.” Nine days of observation, not a single fall, and we closed the task.
5. Process from the first month, built on live precedents. We didn’t write the rules in advance. We recorded them after the first collision with each one. A low-priority task ate the hours needed for updates? We agreed: work runs strictly by the client’s priority or by agreement, and anything done off-plan can move to the next invoice. Redmine is awkward for discussion? Chat for discussion, the tracker for recording the fact, the estimate, and the hours. After a discussion we open the tasks, with the client’s note “approved.” A client user’s antivirus scan loaded CPU for all the neighbors for 20 minutes? We proposed adding to the contract the right to kill processes that interfere with other clients: a practice we brought from running our own hosting. By the end of July the client asked for the main thing: “so I don’t have to remember myself that our servers need updating.” We set up a recurring monthly task, and since then updates run in cycles, without reminders.
What went wrong
Without this section the case would be a lie. While setting up the unified log, the engineer overwrote hand-edited nginx configs that handled proxying, and one service’s static assets started returning 404. The client noticed about ten minutes later, and roughly as long again went into the fix. After that, work on the cause rather than the symptom: we put those configs into the wiki and verified that the panel backs up all of /etc (full once a week, changes daily). The second episode the monitoring caught itself: after a reboot one server dropped out of Zabbix for 7 minutes. The firewall rule for the agent’s port had once been added on the fly and never saved, and the reboot wiped it. Both cases are exactly the mines that justify documenting an inherited fleet first and rebooting it only afterward.
There was a longer episode too: the evening after the update, the central server held LA above 10, and for about two hours both teams hunted the culprit together. It turned out to be a “ghost” node process from a project long deleted off disk: it lived in memory until the reboot, then tried to start again and loaded the CPU. Along the way we raised the database’s innodb_buffer_pool from 1 to 1.5 GB, and the client drew their own conclusion: split projects across separate users.
Results
| Metric | Value |
|---|---|
| Fleet audit | 12 VDS, ~30 min per server, 6h tracked, a wiki page per server |
| Fleet update | 10 servers in 1.75h — against the 6h the client had budgeted at 0.5h/server |
| Risks surfaced | PHP 7.2 on the central server (support ended in 2020) — recorded in writing |
| Problem store’s database | fell “two or three times a day” → 0 falls over 9 days of observation, task closed |
| Monitoring | Zabbix alerts into the shared chat: major + critical, per server |
| Process | recurring monthly update task + a priority rule + the right to kill harmful processes, in the contract |
| Period effort | 11.25h tracked |
In short: in the first month an unfamiliar fleet became documented, updated, and observable. The background of “once or twice a week something reliably breaks” did not vanish by magic (the bots went nowhere, and migrations were still ahead), but every fall was now seen by a bot and an engineer, not by the client via a complaint from the end customer. And the main thing: updates, reboots, and config edits gained a ritual: a window, an on-call engineer, a report.
Process
| Phase | Duration | Result |
|---|---|---|
| Introduction and contract | 07.05 – 03.06 | choice among three contractors, sole-proprietor↔sole-proprietor contract for 3 months with renewal, EDM |
| Audit and monitoring | 04.06 – 06.06 | 12 VDS documented in the wiki, recommendations document, Zabbix bot in the shared chat |
| First wave of order | 10.06 – 20.06 | fleet updated in one pass, swap and a unified log everywhere, store database stabilized |
| Process setup | 14.06 – 26.07 | priority rule, tasks recorded in the tracker, recurring monthly update task |
Phases overlap: the process rules were born out of first-wave incidents, and the agreement on regular updates matured by the end of July, so the sum of phases does not equal the calendar.
Team
- sysadmin-engineer (studio) — audit, fleet update, swap, unified log, database stabilization
- engineer (studio) — audit, estimates, process setup
- Anton Hersun, Xaver Pro — project manager
If you’ve inherited a fleet of servers with no documentation and no previous admin, send us the list of machines and what’s known about them. We’ll look, name the bottlenecks, and come back with a fixed estimate in hours. The fleet audit is free.