Engineering practice certification-compliance

Parsing three government certification registries (KG/KZ/BY): around 260 hours on Laravel queues

Parsers for the Kyrgyz, Kazakh, and Belarusian certification registries: Laravel Horizon queues, proxy rotation, incremental collection by next number.

260h delivered
Parsing KG/KZ/BY certification registries on Laravel

A foreign government certification registry owes you nothing. Today it serves data, tomorrow it changes the URL, closes the public list, adds a geo-block, updates the format, or bans all your machines at once, and nobody warns you. An internal analytics platform has a simple choice: either a resilient parser on queues, with proxy rotation and incremental collection, or analytics on that registry stalls within a week, and you learn about it from the silence in the logs.

Snapshot

End-client industry product certification, B2B market analytics
End client Certificate Analytics — an internal platform of a group of companies
Engagement retainer — a stream of orders as new registries appear and old sources change
Project type a long series of work parsing foreign government certification registries (Kyrgyzstan, Kazakhstan, Belarus)
Work done nine parsing orders, plus support of existing parsers and a collection control panel
Project date November 2022 — ongoing, more than three years of active work
Effort around 260 hours on agreed estimates; actual volume is higher thanks to small refinements without a separate estimate
Team Anton Hersun (project manager; took the parsing on as legacy, untangled it, and moved it to a modern architecture by hand himself), then handed it to the widgets-and-parsers developer, who has run the track since
Tech stack Laravel 5.5.50 · Laravel Horizon · queues and jobs · Redis · Docker · ClickHouse · an in-house HTTPS proxy rotation system with random User-Agents
Delivered Parsers for three national registries (KG/KZ/BY), a collection control panel, automatic incremental collection on schedule and on demand

The problem

The platform works with government registries of product conformity certification: public data on issued declarations and certificates. The sources sit on the government sites of Kyrgyzstan, Kazakhstan, and Belarus, with collection running in parallel against the Russian registry. None of them has a proper public API. One has an API with token auth of unknown lifetime, one has only a web interface with sequential number iteration, one regularly changes its response structure or closes public access entirely.

The client already had an old parser running, written before us. Part worked, part didn’t, and there was no way to tell where it fell over: there were almost no logs, the Kazakhstan collection was held together with tape and caught up by hand on every ban. Anton sat down with the source, worked out what to keep and what to rewrite, and gradually moved it all onto a single architecture: Laravel jobs, queues under Horizon, in-house proxy rotation. Only then did he hand the track to a colleague.

The client placed sequential orders as new sources appeared or old ones changed:

  • November 2022: a parsing farm of three servers, working around Kazakhstan’s blocks, automatic re-parsing
  • early 2024: parsing Kyrgyz declarations by iteration after the public registry closed, then KG certificates by a country-probability matrix
  • 2024: a rewritten Belarusian registry, KZ parser refinements
  • 2025: a collection control panel and connecting the shared EAEU registry, moving to a new Kazakh registry via API
  • late 2025: a buffer table for documents without a number, KZ field refinements, identification acts

Why this is hard. The source changes without warning, and every change costs the platform data. Kyrgyzstan closed its public list, so we had to switch to number iteration. FSA banned every parser machine at once on a Friday evening. Over the weekend we rewrote the system into several threads through proxies and caught the month back to 100% currency. Kazakhstan moved to a new registry with token auth, and the old web-page parser stopped working. Belarus added a geo-block that even caught a Belarusian proxy. The Russian registry set a limit of 20 pages per request. And so it goes every year. So we built the architecture from the start for a changing source: the data-fetch method can change as much as it likes, while the storage schema and the accumulated data stay put.

How we did it

1. The parser is not a cron script, it’s a Laravel queue under Horizon.
Bash with cron was rejected at once: no retry, no monitoring, no way to launch collection for a specific body on request. An in-house job system in Python fell away too, a duplicate of the existing Laravel stack. We moved the collection logic into Laravel jobs. Horizon runs on top as the queue manager, Redis underneath. The queue gives a restart on a worker crash, on-demand placement of a specific body’s parser without waiting for the schedule, and a single Horizon panel instead of scattered monitoring. A new parser is one new job class.

2. Incrementality by “next expected number,” not range re-scanning.
A Kyrgyz declaration’s number structure is fixed: EAEU KG417/049.D.0006245, where KG417/049.D. is the static code of the certification body and 0006245 counts the sequential number with a step of 1. Scanning the whole range every day is pointless: for each body we store the last collected number and check only the next 10–20 values, the step being configurable. The number of requests to the source drops by two orders of magnitude, and with it the ban risk. If a body has a numbering gap or a format change, the panel lets you set the start number for iteration by hand.

3. Certificates: iteration by number and by country.
With declarations everything is linear, but KG certificate numbers embed the country-of-manufacture code (EAEU KG417/044.CN.02.02191, where CN denotes China), and it can’t be worked out in order. So for each of the 28 bodies we compute a separate country-probability matrix and sort it from most frequent to rarest. From the last known number the parser runs the countries down the matrix. If it finds nothing, it increments the number by one and repeats, up to ten times. It rarely reaches the end of the country list, so there are almost no wasted requests. Some bodies keep their numbers sloppily: a dot here, a slash there, the space after KG dropped. Those we collect in general order without the matrix, and discard the junk documents.

4. An in-house HTTPS proxy rotation system with random User-Agents.
Third-party proxy services were rejected: expensive, vendor dependence, and unstable on Russian sources. We built an internal system: a pool of HTTPS proxies with health checks, a random proxy per request, temporary exclusion of “toxic” ones (those that got a 403 or 429), periodic User-Agent rotation. All the KG, KZ, BY parsers reuse this system. When FSA banned every machine in late 2023, it was on this system that we stood up a scheme of four machines with two threads each through proxies over the weekend and restored December’s full currency. When token auth through a proxy was later needed for the new Kazakh API, the system picked it up with no changes.

5. Adapting to an API change without rewriting the receiver.
Kazakhstan moved to a new registry with the API techreg.gov.kz/Synergy/rest/api/ and token auth in the header. The access turned out weakly protected: public documents were served on a simple role check. That’s what we used: we scoped the work at 50 hours. The new source’s fields fit the existing database table with no schema migration. The job wrapper stayed the same, only the source inside changed. On launch, collection across 46 sources finished in about ten seconds, with around 32,000 documents queued. The old KZ parser ran in parallel, and deduplication kept documents from doubling. Debugging the new source, on the other hand, stretched across several months: BIN-vs-OKPO confusion (told apart by string length), duplicates, date formats, regulatory-document fields.

6. Deferred handling of documents without a number (14 h).
Sometimes a document is published in the registry while the number field is empty: the number appears hours or days later. You can’t store such a document without a number, or you can’t tie it to a real declaration later. We built a buffer table: a document without a number sits for up to 5 days, checked every 12 hours. If no number appears in 5 days, it moves to the main table with a flag. The requirement isn’t obvious, but it’s critical for data quality.

7. Skipping already-collected numbers during iteration.
We caught this subtlety only in operation: a duplicate counted as a “not found” document, and ten duplicates in a row would wrongly break off the iteration. We fixed the logic: if a number is already in the database, we don’t run a search for it and move straight to the next, and the “empty” counter doesn’t grow on already-known documents.

Results

Metric Value
Work on agreed estimates around 260 hours
Registries in operation 3 countries (KG, KZ, BY), in parallel — the Russian registry
Parser architecture stack Laravel 5.5.50 · Horizon · Redis · Docker · ClickHouse · in-house HTTPS proxy rotation
Effect of the new system on KG +6400 certificates since 1 January 2025, 72.6% found via the shared EAEU registry
New KZ registry launch 46 sources, ~32k documents queued at start
Collection rhythm on schedule (from hourly to every few hours) plus on-demand launch
Work calendar November 2022 → ongoing, more than three years

Three national parsers, a shared Laravel wrapper over Horizon, a buffer table for documents without a number, and a panel where the client switches countries and bodies on. There’s no scaffolding beyond that. Each parser has its own data-fetch logic: sequential number iteration, iteration with a country matrix, an API with a token, classic scraping with proxy rotation. In three years not one parser was rewritten from scratch: each grew to its current form through a chain of orders, on the infrastructure of the ones before it.

Process and timeline

Stage Period Result
Parsing farm and block workarounds November 2022 three VPS, auto re-parsing, monitoring; parsing fully on the team’s side
KG moved to number iteration early 2024 Declarations every 4 hours through 10 proxies (~7300 of them); then certificates by country matrix, 28 bodies
One-offs and extensions 2024 One-time collection from eokno.gov.kz; rewritten Belarusian registry (30 h), KZ refinements
Control panel and shared EAEU registry first half of 2025 collection control panel, shared registry as a number source
New KZ source via API second half of 2025 moved to the techreg.gov.kz API with token auth
Data-quality refinement late 2025 KZ fields, deferred-number buffer, identification acts
Block and limit workarounds late 2025 — 2026 Belarus added a block (worked around in two days, back-collected from 1 November); the RF registry set a 20-page limit (moved to hourly collection with back-fill)

All orders are joined through a shared database and a shared proxy rotation system. Each new order reuses the infrastructure of the ones before.

Team

  • Anton Hersun, Xaver Pro, project manager, spec formalization, architectural decisions on the parsers. He untangled the parsing by hand himself: took on a partly working legacy with no single approach, studied it, reworked it, and moved it to a modern architecture, Laravel jobs under Horizon, incremental number iteration, in-house proxy rotation. A rare case for us where the project manager closes the build by hand, but without it there was no handing a tangled legacy to anyone.
  • The widgets-and-parsers developer took the already-reworked track from Anton and has run it since. On him sit the KG/KZ/BY parsers on the new architecture, the collection control panel, the move to the new KZ source via API, the deferred-number buffer table, the identification acts. Continuity within the track holds: any parser written a year ago is refined by the same person.

Screenshots and materials

To be added in a separate pass. Possible candidates: the collection control panel interface (statistics by country of manufacture, the “active for iteration” flag, the expected next number), the queues panel, a simplified diagram of the parser-job architecture.

If your parser stopped yesterday and you learned about it from the silence in the logs, send the code. We’ll tell you where it leaks, which parts to move into a queue, and the minimal monitoring to put in so the silence doesn’t repeat. We look for free.

Send the parser for review →


Scroll to Top