Engineering practice certification-compliance

From attack to hardening and migration: security and infrastructure for a B2B platform in 2025–2026

Investigating a server breach via an open Docker API: rootkit and miner removed, login history with IP whitelist, then a January 2026 migration, documented.

Delivered
Security, incident response and server migration for B2B

In April 2025 we found xmrig on the client’s server, hidden behind a tampered system library, libsystemd-shared. Next to it sat a container with an open Docker API and no authentication, plus more than a hundred stray layers in overlay2. We killed the miner overnight and wrote an incident report over the weekend. The story had started earlier though: back in January 2025 a virus on the ClickHouse server forced an emergency move. After that came login history with an IP whitelist (it caught a real credential leak almost at once), export controls, and a planned infrastructure migration in January 2026 with documentation running past a hundred pages.

Snapshot

Industry Product certification, B2B market analytics
End client Certificate Analytics
Engagement Regular support retainer + incident response + project migration
Project type Response to two incidents, a system-hardening spec, an infrastructure migration
Work done Incident response (rootkit, miner); investigation and report; security spec (access monitoring, roles, export controls) + an urgent slice; move to a new server under ISPManager
Project date 28 January 2025 — 29 January 2026 (about a year)
Effort Urgent security slice (10h, shipped in a day); migration and incident response billed separately
Team Anton Hersun (lead) and the Xaver Pro infrastructure unit; the roles-and-permissions work was done by our dedicated developer
Tech stack Docker (cleanup) · ClickHouse + replicator · ISPManager · Laravel
Delivered Server cleaned of rootkit and miner, login history and IP whitelist in place (they caught access being passed to third parties), infrastructure moved to a new server with full documentation

The problem

The client’s internal platform had grown from an analytics dashboard into a system used by dozens of people on the client’s side and their partners’ side. The more visible a system is, the more people want a look at someone else’s compute and someone else’s data. Both threats showed up within the year.

On 28 January 2025 a virus took hold on the server running ClickHouse and started loading the system. The cause was ordinary: the server had gone a long time without updates and had collected vulnerabilities. That Saturday we planned and ran an urgent package: we temporarily moved document search back to MySQL, migrated everything to a new server, checked that it worked, and proposed a plan of regular preventive maintenance so it would not happen again.

It happened again in April. On 8 April users started complaining that the widget was slow to respond. By night it was clear: the ClickHouse server on Timeweb had been broken into through an open Docker API, and someone was mining cryptocurrency on it. The attacker created a privileged container that mounted the host’s root filesystem, ran commands as root, installed a hidden xmrig miner, and swapped a system library, libsystemd-shared, for a version carrying the signature HEUR:Rootkit.Linux.Processhider so the miner stayed invisible in ps and top.

Why this is hard. For an internal B2B platform the real danger is not an outage, it’s a quiet compromise. Ransomware is obvious right away: the business stops. A miner with a rootkit can eat the processor for months while hiding from standard tools. When you finally find it, the owner asks “what else could they have gotten?” and without a prepared infrastructure there is no answer. Deleting files doesn’t settle it. You need constant monitoring so the next incident isn’t discovered by accident, and you need to see who logs in and from where. And for the day the server has to be rebuilt anyway, a fresh copy of the data has to be on hand.

How we did it

Response to the April incident. We killed the miner and the rootkit overnight and restored service by the next morning. In case of hidden changes, a fresh server backup was ready. Over the weekend we ran a full investigation: the swapped system library, the hidden miner, more than a hundred unused layers in /var/lib/docker/overlay2 — traces of the attacker’s work. We removed them, restored the original library, closed the entry point, and rotated the keys. We logged all of it in an internal incident report and handed it to the client. That month of defense ran over the planned budget, 15 hours instead of 5. We flagged it immediately and covered the overrun from other periods of the retainer: that’s our risk to carry, not a surprise on the invoice.

What grew out of the incident. After April we moved server routine (monitoring with alerts, backups, scheduled server upgrades) into a regular server-support format — that’s a separate case of ours. For the security story one effect matters: since June 2025, load and disk alerts go straight into the working chat, and the next anomaly will be seen first by automation, not by users whose work has started to slow down.

The security spec, and why we deliberately broke it up. In November 2025 the client came with a request about access control on the portal. We wrote a full technical spec with an estimate and then said plainly: the scope had come out larger than we’d want for a single pass. The heaviest block was control over multiple concurrent sessions tied to a device or a city: the standard feature set has nothing like it, and you can’t do it without rewriting authentication. Rather than dragging the big block along whole, we proposed cutting out what needed closing now and shipping it immediately as a small separate urgent spec of 10 hours. We split urgent from full on purpose: otherwise the urgent part never reaches delivery.

Why a suspicious-activity detector can’t be built on IP. During the spec discussion a detail surfaced from the client’s own practice: most users have a dynamic IP that changes over a day or a week, and in the logs it looks as if one person is logging in from a hundred addresses. So the heuristic for suspicious activity should be built on the pairing of “device fingerprint plus location (country, city)”, not on a bare IP. And one more thing: closing the browser doesn’t end the session, so a concurrent-session limit has to survive the “forgot to log out and logged in from another device” case — it has to wait for a timeout. Those caveats are exactly why full session control costs more and was lifted out of the urgent slice.

Login history and IP whitelist: what they showed right away. Our roles-and-permissions developer built the urgent slice in a day. A “Login history” menu appeared: every user login is written to a table with its IP and is visible to the administrator. The user editor gained an IP whitelist field and a checkbox “only allow from permitted addresses”; a separate column of shared company-level addresses, so the same server IP doesn’t have to be typed in for every user; the super-admin role stays unrestricted; and separate permissions were added for setting and clearing restrictions in bulk. The result came fast: from the login history the client recorded real cases of access being passed to third parties and opened an internal investigation. The tool paid for itself in the first days — it showed something that simply wasn’t visible without it.

Roles and export controls. Next we extended the permissions system: we added roles matched to real needs and locked report exports behind permissions (general, individual, multi-reports, and the database of new market participants), and along with that the “select-and-copy” of data out of tables. We warned honestly: copy protection is shallow, anyone who can read the server response in the browser console will still get the data out. For filtering out the obvious leaks it’s enough, and the client chose this level knowingly. For an analytics platform, export and copy are the main channel of a possible leak, so controlling them matters more than it looks.

Migration to a new server (January 2026), with full documentation. By January 2026 the main server hit its limits: disk alerts were constant. We picked a plan with headroom (200 GB, ISPManager panel) and recorded the server’s initial state: panel and kernel version, the set of add-ons (scheduled antivirus, monitoring, file manager, site builder), domain limits, and the license expiry. On 20 January we prepared a migration document of more than a hundred pages and handed it to the host’s support team: the move ran by that document and under supervision. On 24 January at 9:00 Moscow time we stopped the main server and started the move. We finished on 25 January, ran the checks, and along the way sped up the declarations table in ClickHouse. We didn’t delete the old server straight away: for several days we watched for errors and confirmed parsing ran normally, and only on 29 January did we shut it down and remove it. One caveat: the server with ClickHouse, the replicator, and the S3 dumper stayed an independent machine, was not part of the migration, and kept running.

Results

Metric Value
Incident response Two incidents (January and April 2025) closed; miner and rootkit removed, report logged
Recovery speed Miner killed overnight, service restored by morning
Access control Login history + IP whitelist (10h, shipped in a day); third-party access leaks surfaced at once
Leak control Permissions on report and market-participant database exports, restriction on copying out of tables
Infrastructure migration 24–25 Jan 2026, planned, by 100+ pages of documentation; old server removed 29 Jan
Calendar 28 January 2025 — 29 January 2026

After two incidents the server is clean and the entry points are closed. Access control went from theoretical to working: in the first days, login history and the IP whitelist showed a real credential leak. Export and copying are locked behind permissions. In January 2026 the platform moved to a server with spare capacity, by detailed documentation and without data loss.

Process and timeline

Stage Period Result
Incident on the ClickHouse server 28 January 2025 Emergency move, temporary return of search to MySQL, prevention plan
Incident: compromise via Docker API 8 April 2025 Miner and rootkit found and removed overnight
Incident report 14 April 2025 Investigation over the weekend, document handed to the client
Security spec (full) November 2025 Estimate showed a large scope → decided to break it up
Urgent security slice 18–19 November 2025 Login history + IP whitelist in a day; caught an access leak
Roles and export controls November 2025 Permissions on export and copying, new roles
Migration prep 16–20 January 2026 Server analysis, migration document
Migration to new server 24–29 January 2026 Move, checks, removal of the old server

Team

  • Anton Hersun, Xaver Pro, lead: coordinating incident response, formalizing the report, signing off the security spec, running the migration.
  • Infrastructure unit, a separate server-support team inside Xaver Pro. Theirs was the server cleanup after the incident and the migration to the new server. Server work and panel development are different disciplines, which is why the team is specialized.
  • Roles-and-permissions developer built the urgent security slice (login history, IP whitelist, permissions on export and copying). Access on the platform is owned by a dedicated person, not a general pool.

Screenshots and materials

To be added in a separate pass: we need monitoring screenshots, login-history captures, and a before/after infrastructure diagram, all run through privacy processing.

If something on your server looks off (unexplained CPU load, unknown containers, unfamiliar processes), send us the output of ps aux, docker ps -a, and the last week of logs. We’ll tell you whether to go looking for a crypto-miner and rootkit or whether it’s just a broken config. The urgent review is free.

Check your server →


Scroll to Top