Operational Security

One IP, Six Crawler Identities, One Second: Detection Built Against Real Production Logs

How three production patches in 24 hours closed two leaks that synthetic testing missed

Published
May 17, 2026
Reading Time
16 min
Thirteen nginx log entries from a single source IP within one second, claiming six distinct named-bot identities while probing security-sensitive paths

Rotational bot-identity spoofing is an attacker pattern where one source IP issues requests under multiple named-bot user agents within a short time window. On 16 May 2026 the axilog.io access log captured the pattern cleanly: source IP 5.255.104.83 issuing thirteen requests in one second under six distinct named-bot identities — ClaudeBot, GPTBot, PerplexityBot, YandexBot, Baiduspider, and bingbot — across /api/env, /actuator/env, /api/config, /config.json, /secrets.json, /appsettings.json, and the canonical-form variants of the same paths with trailing slashes.

No published crawler operates under multiple identities from a single egress IP. The combination is unambiguous. This article describes the pattern, the four-part discriminator used to detect it, and the three-patch sequence in which the v1.10.0 detector shipped, then revealed two false-positive failure modes against the production log within the same day, then closed both in v1.10.1 and v1.10.2.

The empirical claims are traceable: the nginx access log line excerpt is reproducible from any host running the attack, the analyser is open source under AGPL-3.0-or-later, and every fix described below corresponds to a tagged release with its own test suite.

By William Murray, Founder of SpeyTech — deterministic computing for safety-critical systems. Inverness, Scottish Highlands.

Definition: Rotational bot-identity spoofing

Rotational bot-identity spoofing is an attacker pattern where a single source IP issues requests under multiple distinct named-bot user agents within a short time window.

What the production log captured

The pattern was caught in raw form on the axilog.io seo log on 16 May 2026. The lines below are real production capture, abbreviated only in the user-agent column where the recognisable bot identifier is sufficient. Timestamps, IP, paths, and status codes are intact:

5.255.104.83 [16/May/2026:15:54:14] "GET /"                  200  "ClaudeBot/1.0..."
5.255.104.83 [16/May/2026:15:54:14] "GET /robots.txt"        200  "YandexBot/3.0..."
5.255.104.83 [16/May/2026:15:54:14] "GET /config.json"       404  "YandexBot/3.0..."
5.255.104.83 [16/May/2026:15:54:14] "GET /actuator/env"      301  "PerplexityBot/1.0..."
5.255.104.83 [16/May/2026:15:54:14] "GET /api/env"           301  "ClaudeBot/1.0..."
5.255.104.83 [16/May/2026:15:54:14] "GET /appsettings.json"  404  "YandexBot/3.0..."
5.255.104.83 [16/May/2026:15:54:14] "GET /api/config"        301  "Baiduspider/2.0..."
5.255.104.83 [16/May/2026:15:54:14] "GET /sitemap.xml"       301  "GPTBot/1.3..."
5.255.104.83 [16/May/2026:15:54:14] "GET /secrets.json"      404  "bingbot/2.0..."
5.255.104.83 [16/May/2026:15:54:14] "GET /api/env/"          404  "ClaudeBot/1.0..."
5.255.104.83 [16/May/2026:15:54:14] "GET /actuator/env/"     404  "PerplexityBot/1.0..."
5.255.104.83 [16/May/2026:15:54:14] "GET /api/config/"       404  "Baiduspider/2.0..."
5.255.104.83 [16/May/2026:15:54:14] "GET /sitemap-index.xml" 200  "GPTBot/1.3..."

Thirteen requests, one source IP, one second, six distinct named-bot identities. Four 301 redirects — three from nginx normalising /api/env, /actuator/env, and /api/config onto trailing-slash canonical forms that then 404, plus one content-route redirect from /sitemap.xml to /sitemap-index.xml that the attacker followed to a 200 on row 13. Six 404 probe failures. Three 2xx exploratory fetches. The 2xx requests under multiple identities serve a different purpose to the 404 probes: they fingerprint what an unknown server returns to each named bot before the targeted probes begin.

A reader running the analyser against this log will see the rotation section report five identities, not six. The path-class gate excludes the two GPTBot rows, because /sitemap.xml and /sitemap-index.xml are not in ROTATION_TRIGGER_PATHS and are not in the security-probe regex set. This is the gate doing its job: the same source IP issuing legitimate sitemap fetches under a verifiable public crawler stays attributed to GPTBot in the report, while its rotation-probe traffic under five other identities gets reclassified as SuspectedBotIdentityRotation. The discriminator is the combination of conditions, not any single one of them.

The attacker IP 5.255.104.83 is already publicly known through its probe campaigns against thousands of hosts. The paths probed are commodity exploit fingerprints. Nothing about either is sensitive. What matters is the combination visible in the log: one IP, multiple trusted-crawler identities, security-sensitive paths, sub-minute window. That combination is the signal.

Why naive detection approaches do not catch this

Three obvious detection approaches each fail in their own way. An IP blocklist is reactive and lags the threat — the same toolkit appears tomorrow from a different /24, and blocklists never catch up. A user-agent blocklist is structurally wrong here, because the entire premise of the attack is that the user-agent strings are valid — blocking the Googlebot user-agent string blocks Googlebot. Verifying a claimed crawler identity properly requires reverse-DNS lookups, which are too slow for log-time classification and only validate Googlebot in practice. A 404-rate detector sees the rotation entries as a content-404 burst inside a one-minute window, which fires the deploy-window anomaly detector with a false positive that looks like a real symptom — which is exactly what happened in production on the first deployment.

The signal that holds is the combination of four conditions: one source IP, multiple distinct named-bot identities, a short time window, and at least one path in a security-sensitive class. Each condition individually is benign. Two real crawlers can share a datacentre egress IP; two distinct named bots in one log is not suspicious. A single bot probing /api/env once is not suspicious either. The combination of all four is.

The v1.10 detector implements this as a stateful post-aggregation override. The detect_rotation_indices() function walks a per-IP identity log and returns indices that participate in any sliding window containing at least three distinct identities, bounded by a configurable --rotation-window-minutes value (default 5). The algorithm is O(N²) per IP in the worst case, bounded by a per-IP entry cap of 100. The reclassification then re-attributes counters from the named-bot buckets to a synthetic SuspectedBotIdentityRotation row, leaving the legitimate crawler statistics untouched.

Rotation-detection is signal plus path-class gate. A request only reclassifies if its path is in ROTATION_TRIGGER_PATHS (/api/env, /actuator/env, /config.json, and so on — nineteen entries) or matches the existing security-probe regex set. Legitimate named-bot traffic on legitimate paths from the same IP is untouched.

What synthetic testing missed and production revealed

Synthetic verification before the first deployment

v1.10.0 shipped on 16 May 2026 with twenty-three tests, all green. The suite included an end-to-end check against a synthetic fourteen-line fixture reproducing the 5.255.104.83 pattern: one IP, multiple identities, probe paths, sub-minute window. The override reclassified everything correctly. The synthetic verification was thorough by every measure available before deployment.

Two leaks visible on the first production run

Then the first production run against the real axilog.io seo log surfaced two leaks immediately.

Leak one: trailing-slash variants. The override caught /api/env (in ROTATION_TRIGGER_PATHS) but missed /api/env/. The attacker’s first request to each path hit the no-slash form, which nginx redirected (301) to the canonical-form /api/env/, which then 404’d. Both forms entered the log under the same source IP and the same rotation burst, but only the no-slash form matched the exact-string set lookup. The trailing-slash 404s remained attributed to ClaudeBot and Baiduspider in the Crawler-Visible 404s section, exactly as if they were real crawler failures.

Leak two: stale burst counters. The deploy-window anomaly detector tracks content-404s in one-minute buckets. The aggregation pass populated those buckets during initial classification, when the rotation entries were still attributed to named bots. The post-aggregation override then re-attributed them to SuspectedBotIdentityRotation, but did not decrement the bucket counters. The result: a [HIGH] Content-404 burst: 5 404s finding for the rotation minute, fired by traffic that had already been reattributed and should no longer have contributed to the burst signal.

v1.10.1: trailing-slash normalisation and burst-counter decrement

v1.10.1 shipped within an hour. Two fixes. The first normalised a single trailing slash in is_rotation_trigger_path() before the set lookup, with a len > 1 guard to prevent root-path collapse and a query-string strip to match the analyser’s existing clean_path convention. The second extended apply_rotation_override() to decrement minute_404s and minute_content_404s for any rotation entry that originally contributed to them, with gating conditions that mirror the aggregation increment gate exactly. Eleven new tests including a regression guard that proves probe-matched rotation entries (which never entered the burst counter in the first place) do not double-decrement on their way out.

v1.10.2: redirect-counter migration and the consistency invariant

The second production run confirmed both fixes. But the Redirect Analysis section now showed three distinct named-bot rows — Baiduspider, ClaudeBot, PerplexityBot, each with one redirect — while the Status Code Breakdown By Bot correctly unified the same three 301s under SuspectedBotIdentityRotation. The two views disagreed. The override had migrated bot_status, bot_urls, bot_url_status, bot_ips, special_files, crawler_404s, and the Googlebot-specific caches. It had not migrated redirects or redirect_attribution. Another leak in the same family of mistake: a counter populated at aggregation time, not migrated by the override.

v1.10.2 shipped twenty minutes later. The 3xx rotation entries now migrate agg.redirects and agg.redirect_attribution alongside the other counters. Five more tests. One of them is the consistency invariant in test form: after the override runs, the total 3xx count in bot_status[ROTATION_DETECTION_NAME] must equal the sum of redirects[ROTATION_DETECTION_NAME] across all paths. v1.10.0 violated it silently. v1.10.1 violated it silently. v1.10.2 holds it as a test that will fail loudly if any future override forgets to migrate a counter.

What the synthetic test fixture missed

The fixture reproduced the shape of the attack — one IP, multiple identities, probe paths, short window. The override caught it. What the fixture did not contain was two properties of the real attack as it actually appears in nginx: that nginx redirects no-slash forms to canonical-form slash variants before serving 404s, and that the attacker’s first probe of each path hits both forms. The test fixture was constructed with the same mental model that produced the override. Production was the only place that mental model met an adversarial environment it had not been built against.

Synthetic tests verify the algorithm. Production observation verifies the algorithm meets the system. Both passes are necessary. The work done before deployment caught everything that had been imagined; the work done after deployment caught what had not.

Three properties that earned the right to ship

The analyser runs in cron against speytech.com daily and against axilog.io on demand. A feature that broke output on the common case (no rotation traffic) would be a regression in everyone’s pipeline. Three properties of the v1.10 design protected against that, and each maps to a specific failure mode it prevents.

Byte-identical output for the no-rotation case. The override only fires when at least three distinct named-bot identities appear from one IP inside the configured window, against trigger paths. On a clean log — speytech.com today, 1,178 lines, no rotation traffic — the override never activates. Verified with diff between v1.9.1 and v1.10.2 on the same input: zero differences in text, zero differences in JSON. The failure mode this prevents is cosmetic drift in routine reports being treated as a substantive change by operators who monitor the output for actual change.

Conditional JSON keys, not null placeholders. The rotational_bot_identity top-level key is emitted only when rotation is detected — not as null, not as {}, simply absent. Downstream JSON consumers that look for the key on the days it matters get structured data; consumers that have never seen the key get identical output on every other day. The failure mode this prevents is schema-stability anxiety: a consumer treating the appearance of a new key on a quiet day as a schema migration to handle.

Test-driven counter consistency. The internal invariant — that Status Code Breakdown and Redirect Analysis must agree on rotation attribution after the override runs — is now a property a test enforces, not a property the implementer remembers to maintain. v1.10.0 and v1.10.1 both violated it silently. The cost of v1.10.0 → .1 → .2 was three iterations. The cost of v1.11 forgetting to migrate counter X will be zero, because the test catches it before the commit. The failure mode this prevents is the same gap that produced the v1.10.2 patch in the first place: a counter migrated correctly in one place, forgotten in another, and the disagreement only visible against real production output.

Cross-reference: the broader operational case

The detector described here is one capability inside a tool whose origin is documented separately. The original purpose was deploy-window 404 detection on a static Astro site, which then revealed a much larger chronic crawl-rate suppression — both covered in the operational SEO observability case for speytech.com. The rotation detector is the v1.10 addition to the same tool; the two stories share the analyser and the access logs but solve different problems.

Frequently Asked Questions

What is rotational bot-identity spoofing?

Rotational bot-identity spoofing is an attacker pattern where a single source IP issues requests under multiple distinct named-bot user agents within a short time window. The attack tests whether a server treats trusted-crawler user-agents differently — by exempting them from rate limits, returning verbose errors, or relaxing security headers. Real published crawlers do not operate under multiple identities from a single egress IP, so the pattern is unambiguous when observed in raw access logs.

How do I detect rotational bot-identity spoofing in nginx access logs?

Rotational bot-identity spoofing is detectable by combining four conditions on a single source IP within a short time window. The four are: multiple distinct named-bot identities, a short window holding them all, at least one security-sensitive path, and the trusted-crawler character of the identities claimed. Each condition individually is benign; the combination is the signal. A practical implementation walks a per-IP identity log and reclassifies entries that fall inside any window meeting all four conditions, leaving legitimate crawler statistics untouched.

Why don’t user-agent allowlists catch rotational spoofing?

User-agent allowlists cannot catch rotational spoofing because the attack works by claiming valid identities, not by using exotic user-agent strings. Blocking the Googlebot user-agent string blocks Googlebot. Verifying a claimed crawler identity properly requires reverse-DNS lookups, which are too slow for log-time classification and only validate Googlebot reliably in practice. The detection signal has to come from request behaviour, not from the user-agent string in isolation.

What are the limitations of log-based rotational spoofing detection?

Log-based detection sees only the requests that actually reached the server, not the ones blocked upstream or targeted elsewhere. A determined attacker can defeat the per-IP signal by rotating source IPs alongside identities, though doing so loses the operational advantage the original attack was testing for. The minimum-identities threshold is a tunable — too low and shared datacentre egress for two real crawlers can false-positive; too high and slower-paced campaigns slip past the window. The detector surfaces patterns upstream security controls did not catch; it does not replace them.

Closing

Rotational bot-identity spoofing is detectable in nginx access logs through a stateful per-IP identity-counting algorithm that combines four discriminating conditions against a configurable time window. The trade-off is that the detector itself must verify against live production traffic — synthetic tests reproduce the attack signature but not the structural behaviour of the system being attacked. As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context.

The analyser is open source under AGPL-3.0-or-later at github.com/SpeyTech/crawler-log-analyser. The v1.10.2 tag corresponds to every claim in this article; the changelog records the three-patch sequence in the order it happened.

Hope this helps.

— William

About the Author

William Murray is a Regenerative Systems Architect with 30 years of UNIX infrastructure experience, specializing in deterministic computing for safety-critical systems. Based in the Scottish Highlands, he operates SpeyTech and maintains several open-source projects including C-Sentinel and c-from-scratch.

Discuss This Perspective

For technical discussions or acquisition inquiries, contact SpeyTech directly.

Get in touch
← Back to Insights