Engineering practice certification-compliance

Parsing the Russian registry of certificates and declarations (FSA): engineering resilience against a hostile source

FSA (Rosaccreditation) declarations-and-certificates parser, inherited as legacy, moved to Laravel Horizon queues with proxy rotation. Four years dodging bans.

Delivered
FSA registry parsing: a resilient certs parser on Laravel

The whole platform stands on a single source it can’t control. The state registry of declarations and certificates serves data however its server feels like today: yesterday a date returned a thousand documents, today suddenly ten million; on Friday the proxies worked, on Saturday every machine was banned; in December the server infrastructure changed and the parser went silent. And the client’s analytics has to stay complete and fresh, whatever the weather on the other side. The client put it plainly: the parser is the foundation. This case is about keeping the foundation alive while the source changes, bans, and breaks four years running.

Snapshot

End-client sector Product certification, B2B market analytics
End client Certificate Analytics, the internal platform of a group of companies
Engagement Ongoing support: the platform core, a stream of reactions to changes in an external source
Project type A parser of the Russian registry of declarations and certificates (FSA / Rosaccreditation) as the source engine of the whole system
Work done Taking over a legacy engine, moving it to a modern architecture, years of support and reaction to source incidents
Project date November 2021 — ongoing, more than four years
Effort The parsing farm was formalized within a spec bundle; the bulk of the work runs through retainer support and incident response
Team Anton Hersun (project lead; took parsing on as legacy, took it apart and moved it to Laravel queues by hand himself), the direction supported inside the team; the latest parsing orders were partly picked up by the widgets-and-parsers developer
Tech stack Laravel 5.5.50 · Laravel Horizon · queues and jobs · Redis · Docker · ClickHouse · in-house HTTPS-proxy rotation · a mini-farm of cheap Russian servers
Delivered An FSA-registry parser on queues, incremental resume from the point of failure, daily collection statistics, automatic reaction to outages

The problem

The platform lives on public data from the Russian registry: conformity declarations and certificates that the FSA publishes openly. There’s no proper public API. There is a web listing by date: the parser asks it “give me all documents for such-and-such date” and reads the response page by page. The source is state-run, no outsider controls its behavior, and it can change that behavior any day without warning.

This engine came to us as someone else’s code. Before our team, an outside developer handled the parsing, and he ran it by hand: some things got collected, some fell off, and you couldn’t tell where exactly it stopped, because there was simply no real statistics. In November 2021 the client dropped the previous arrangement and handed parsing to the team in full. Anton set out the plan right away: rewrite the whole system onto full automation and add daily statistics, how much was parsed and where, what wasn’t parsed, what fell off or hit a ban, so we could react fast and stay aware. The direction was only handed further inside the team after the legacy had been taken apart by hand and moved onto a single architecture.

Why this is hard

The hard part isn’t parsing HTML. The hard part is that the other side behaves like a live opponent, even with no ill intent behind it. Over four years the registry managed to: start returning “ten million certificates” to any request and hang our services with it; close itself off to foreign proxies; ban every parser machine at once on a Friday evening; die mid-walk through the pages and quietly short us half the declarations; change its server infrastructure so the old parser went silent; and finally impose a hard listing limit of twenty pages per request. Each of these changes on its own stops collection, and you learn of it either from the silence in the logs or from an analyst’s complaint that the data hasn’t refreshed for a week. So we built the architecture around one principle: the method of fetching the data can change as much as it likes, while the accumulated database and the storage schema stay intact.

How we did it

1. A parser on Laravel queues under Horizon, not a cron script.
The collection logic went into Laravel jobs. Horizon runs on top as the queue manager, with Redis underneath. The queue gives a job restart on failure, a re-collection of a specific period on demand without waiting for the schedule, and one shared panel instead of scattered monitoring. When the registry shorts us documents, a re-run of the needed month simply goes into the queue, and you can watch the remainder melt from tens of thousands down to zero.

2. Incremental collection with resume exactly at the point of failure.
The parser pulls documents one date at a time every day. The donor site regularly dies mid-walk through the pages: after three failed attempts the script stops, and if the break fell in the middle, the remainder doesn’t reach the database. To keep this from becoming a silent data loss, the walk state is saved, and re-collection picks up from the interrupted point rather than scanning the range again. The hole that was leaking declarations closed. The load on the source dropped as a side effect.

3. Daily collection statistics as an early detector.
From day one the parser has daily statistics bolted on: how many documents are on the site, how many in the database, what’s missing. This is what catches anomalies before the client does. A suspiciously even, small count of declarations for July flagged the data loss immediately. A “this many in the database, this many on the site” mismatch shows the lag in real time. Without this panel, every source incident would become a week of blindness.

4. In-house HTTPS-proxy rotation plus a mini-farm of cheap servers.
When Russian resources closed off foreign proxies, the European channels fell off and the parser on the main server crawled. We built the fix from cheap Russian machines: stood up extra servers at a token price, set the parser on them, and spread the load, which also unloaded the main server. On top runs an in-house HTTPS-proxy rotation system with temporary exclusion of banned addresses. That same system later pulled collection out of the heaviest ban.

5. A parsing farm and automatic re-walk.
In autumn 2022 the scattered machines were brought into a single farm and formalized as a separate spec: three VPS, block bypass, automatic re-parse after failures, and detailed monitoring. It went in one bundle with an adjacent task, so there’s no separate figure for it in this case. The point of the farm is simple: a ban or a crash of one machine no longer stops collection, the queue redistributes, the re-walk starts on its own.

Incidents and response

The engine’s real value isn’t tested in the quiet weeks, but on the days the source breaks. The chronicle for the Russian registry over four years reads like this.

April 2022. “Ten million certificates”. The registry started answering any date request that it had found 10,000,000 certificates. The parser believed it: it stepped through the first pages, collected real documents, and then walked off into endless emptiness looking for records that don’t exist. The services lost their minds over this, and users stopped logging in. We patched it: the walk was capped at the first hundred pages, which in practice is enough to collect every existing record without wandering through empty ones.

June 2022. Closing off to foreign proxies. Russian resources stopped accepting requests from abroad, and the parser stalled. Over the weekend we stood up the mini-farm of Russian servers and brought collection back. We gathered all certificates and declarations for May, and caught June up right after.

July 2023. Silent loss of declarations. An analyst noticed a suspiciously even, small count of declarations. The cause: the donor site dies mid-walk, sometimes serving fifty thousand declarations in one go and then crashing. That’s where the long work on resume resilience began.

December 2023. Ban of all machines. On a Friday evening, the declarations service banned every parser machine at once. Over the weekend we rewrote the parsing system into several threads on a new proxy system: four machines, two threads through proxies on each. December reached 100% currency, and from there we held the pace on the new scheme.

October 2024. The source went silent. Declarations and certificates served no data for almost a week, while the site opened fine by hand. Anton found a change on the FSA server side, edited the parser for it, and started it again.

December 2024. Infrastructure change. The registry switched its server infrastructure to a new one, and the old parser didn’t adapt to it. We reworked the parser for the new source. 23,000 documents went into the re-collection queue.

March 2026. The twenty-page limit. The registry imposed a hard cap: no more than twenty pages of a hundred documents, that is a thousand rows per request. The old principle of “take everything for a date in one walk” stopped working. We built a new algorithm: once an hour we step through the fresh documents up to the limit, catch the end-of-listing error, and fill the gaps with a variable filter-based walk. The missing days were caught up right after.

Between the big events ran small adaptation. When a new field “Person who accepted the declaration” appeared in the registry, it was added to collection without rewriting the receiver. And a client request to collect all technical regulations in the certificates exposed an old quirk of the legacy parser: it took only the first TR CU record from a document. The re-collection touched roughly 259,000 certificates, and without a separate spec from the client that task stayed on hold. The plan to connect the new REAEC registry, opened in 2026, lives in the same mode: a 13-hour plan was prepared the day after the request, with the start left to the client.

Results

Metric Value
Role in the system The source engine of the platform; all RU analytics rests on it
Architecture stack Laravel 5.5.50 · Horizon · Redis · Docker · ClickHouse · in-house HTTPS-proxy rotation · mini-farm of servers
Major source incidents handled seven in four years (ban, infrastructure change, listing limit, listing anomalies) plus a series of small adaptations
Reaction speed to a ban (12.2023) over a weekend the scheme was rewritten to 4 machines × 2 threads, 100% currency for December
Reaction speed to infrastructure change (12.2024) parser reworked, re-collection queue of 23,000 documents
Response to the listing limit (03.2026) new hourly-collection algorithm with gap fill, missing days caught up
Work calendar November 2021 → ongoing, more than four years

The engine was never rewritten from scratch as a whole. It grew from someone else’s legacy to today’s architecture through a chain of incidents, and each time only the part that hit new source behavior changed: now the page walk, now the transport through proxies, now the algorithm for getting around the limit. The storage schema and the data accumulated over the years stayed in place throughout. This same infrastructure of queues and proxy rotation later became the basis for parsers of foreign registries, but the Russian engine was first, and it remains the core.

Team

  • Anton Hersun, Xaver Pro, project lead, architectural decisions on the parser, spec formalization. He took the Russian parsing apart by hand himself: took on someone else’s legacy with no real statistics or automation, studied it, rewrote it onto full automation with daily monitoring, and moved it onto Laravel queues under Horizon with in-house proxy rotation. It’s rare for us to have the project lead close the build by hand, but handing off tangled legacy to someone without taking it apart personally wasn’t an option.
  • Widgets-and-parsers developer picked up part of the latest parsing orders, once the direction already stood on a modern architecture. Continuity is held inside the team: a module written years ago is maintained by someone from inside the direction, not by “a new person from scratch”.

Screenshots and materials

To be added in a separate pass. Possible candidates: the daily collection-statistics panel (documents on the site vs. database), the Horizon queues panel, a simplified diagram of the parser job with walk resume.

If your business depends on a single external data source you don’t control, send us a description of the collection scheme. We’ll say what breaks first under a ban or a source change and which safeguard is the cheapest. We look for free.

Send us your collection scheme →


Scroll to Top