Self-Hosted Static Sites Need Operational SEO Observability

Update — May 16, 2026: The analyser described here is at v1.3 as published. The repository currently sits at v1.10.2, which adds SuspectedUASpoof behavioural detection, SuspectedBotIdentityRotation for attacker IPs cycling through named-bot identities, framework-fingerprint opacity scoring, a TOML config file, and AI crawler discovery-window classification. The minimum Python is now 3.11. The capabilities described in this article are a strict subset of what is live; the empirical claims and operational pattern are unchanged. The rotation detector and the three-patch sequence that produced v1.10.2 are covered in detail in the follow-up: One IP, Six Crawler Identities, One Second. See the repository README for the current capability set.

Sparse Googlebot crawl pattern across 89 days transforming into a dense burst on May 10

Search Console reported an average of 6.83 Googlebot requests per day on speytech.com across the 89 days between February 9 and May 8, 2026. Twenty-three of those days saw zero crawl activity at all. The site was passing Lighthouse, the sitemap was submitted, robots.txt was correct. The crawl rate was just quietly low.

On May 10, the same site received 568 Googlebot requests in a single day. That is 93.4% of the entire preceding three-month total of 608 requests. Lighthouse was unchanged. The sitemap location was unchanged. The robots.txt was unchanged. What changed was a small set of operational fixes that the conventional SEO tooling had no way to surface, because the conventional SEO tooling does not look at nginx access logs.

This article describes what the gap looks like in measured data, why standard SEO tooling cannot see it, what closing it required, and the open-source tool that made the diagnosis possible. The empirical claims are traceable: the Search Console crawl-stats export is reproducible, the nginx logs are timestamped, and the analyser used to read them is publicly available under AGPL-3.0-or-later.

Operational SEO concerns the runtime behaviour crawlers actually experience, not just the static correctness of pages at rest.

Definition: Operational SEO observability

Operational SEO observability is the practice of reading live web-server access logs to verify what real search and AI crawlers experienced, rather than inferring it from synthetic audits or stabilised dashboard summaries.

What the data shows

The Search Console crawl-stats export for the 89 days from February 9 to May 8, 2026 contained 608 total Googlebot requests. That distributes as a mean of 6.83 requests per day, a median of 4, a peak of 30 (on April 26), and 23 days with zero activity. Nearly a quarter of the period saw no Googlebot crawling at all.

On May 10, 2026, the nginx access log for speytech.com recorded 568 Googlebot requests across the day, restricted to the main Googlebot user-agent (Googlebot-Image excluded). The activity was not spread evenly. It clustered in two distinct passes:

Pass 1: 16:10 to 16:11 UTC, 142 requests, two minutes.
Pass 2: 20:26 to 20:40 UTC, 426 requests, roughly fourteen minutes.

The first pass started seven hours after a build that added per-URL <lastmod> timestamps to the sitemap. The second pass started within one minute of a further build which refreshed the <lastmod> values again. Pass 1 reads as the initial response to a freshness signal Google had not seen before; pass 2 reads as a follow-up triggered by the same signal advancing.

The same May 10 produced 100/100/100/100 Lighthouse scores on both mobile (emulated Moto G Power, Slow 4G throttling, Lighthouse 13.0.1) and desktop, captured at 22:39 UTC. The scores were the same before the crawl burst as after. Lighthouse measures page quality and conformance to documented SEO checks; it does not measure whether Googlebot is actually fetching the page. Those are two different questions, and the second one is the operational one.

The sequence is suggestive of causation but not directly observable. The sitemap freshness signal was the input, and the burst was the output. Google’s scheduler made some decision between them. This is observed correlation consistent with Google’s documented use of sitemap signals, not directly observed causation.

The failure mode the tool was originally built to detect

The tool that produced the May 10 measurement was not built for sitemap diagnosis. It was built earlier this year, in response to a different and more acute problem: transient Googlebot 404s during atomic deployment windows on a static Astro site.

The mechanism is structural. A site deployed via rsync, symlink swap, or atomic-rename strategies appears atomic at the directory level but is not atomic at the per-request level. A crawler request that begins reading file paths during the swap window can observe an inconsistent state: the new build’s HTML references new asset hashes, but the asset directory has not been swapped in yet, or vice versa. CDN cache invalidation can fall out of sync with origin file availability. The combination produces transient 404 responses that resolve themselves within seconds, but only after the crawler has logged the failure.

Three-stage diagram showing how transient 404s arise when HTML and asset directory swaps are not perfectly synchronised during deployment

Synthetic monitoring misses this entirely. A monitoring agent that polls the site every five minutes is overwhelmingly likely to poll outside the swap window and see a clean 200. Lighthouse audits are run on demand and are not in flight during deploys. Search Console reports daily aggregates that smooth bursts into averages. The failure mode is invisible to every observational tool except the one that captures every actual request: the web server’s own access log.

The acute incident produced a clear signature in the logs: a tight cluster of Googlebot 404 responses, all within the same two-minute window, all on URLs that were known to exist before and after the window. That signature has a name in the analyser’s output now (deploy-window anomaly), but at the time of the original investigation it had no name and no tool. There was just the raw log and the question “is this a thing that is happening, or am I imagining it.”

The first investigation

The first instinct was to write a one-shot awk script. That instinct lasted approximately one hour. Real production nginx logs contain a category of traffic that swamps any naive grep: security probes. A substantial fraction of the lines in a typical day on speytech.com are WordPress probe attempts, PHP exploit fingerprinting, .env scans, phpmyadmin discovery, and similar background noise. Filtering this out is straightforward conceptually but tedious in practice, and any analysis that does not filter it cleanly drowns the actual SEO signal.

The second instinct was to look for an off-the-shelf tool. The category of “log analyser for SEO” exists, but everything in it is either commercial SaaS that expects log shipping to a third-party service, general-purpose log management software with no SEO-specific classification, or hand-rolled awk scripts in blog posts that do not handle the messy realities of production logs (HTTP/2 requests, IPv6 source addresses, binary noise in URLs from probe traffic, gzip-rotated history).

The diagnostic question was specific enough to require its own tool: “did Googlebot receive any 404s in the time window around my deploy, and if so, on which URLs, and at what rate, and how does it compare to the surrounding hours of crawl activity.” That question is not naturally answered by log management software. It needs an analyser that knows what Googlebot is, what a deploy window looks like, and what to ignore.

Building the analyser

The design discipline was constrained from the start. Single-file Python, standard library only, runs on any nginx host that has Python 3.8 or later, no telemetry, no dashboards, no SaaS dependencies, no database. The output formats are plain text, markdown, and JSON, intended for terminal viewing, file-based reporting, and machine consumption respectively. The licence is AGPL-3.0-or-later, which permits commercial and internal use while keeping any networked modifications open.

Three iteration rounds against real production data shaped the analyser. The first round handled the parser tolerance problem. Real production nginx logs contain malformed probes, HTTP/2 requests, IPv6 traffic, binary URL payloads, and crawler behaviours that textbook combined-log parsers do not tolerate cleanly. The naive regex for the combined log format rejected approximately 41% of real production lines. The fix was a two-stage parser: a permissive skeleton regex extracts structural fields without imposing assumptions on the request line, and the request line is parsed separately with fallback behaviour for anything unusual.

The second round handled the probe-suppression problem. Once parsing was tolerant, the volume of probe traffic became visible: hundreds of .php requests, exploit fingerprints, .env scans, framework-manifest discovery. Including this in the 404 report drowned the actual signal. Excluding it required two separate detectors (security probes proper, and framework fingerprint probes like /_next/, /_nuxt/, webpack-stats.json), each summarised in its own section so that operators could see what was being filtered without it polluting SEO conclusions.

The third round added the deploy-window anomaly detector that was the original purpose. The detector watches three signals at one-minute resolution: robots.txt failures from any crawler, content-404 bursts (five or more within two minutes, excluding probe and framework traffic), and Googlebot 404 events followed by a crawl gap of two hours or more. Each signal is tiered by severity based on which bot family is affected. A Googlebot robots.txt failure is CRITICAL. The same failure from an AI crawler is a WARNING. The same failure from a generic bot is INFO. The tiering matches the operational reality: Google’s reaction to a robots.txt failure is the one that meaningfully suppresses crawling.

A fourth and more recent round, released as v1.3 the same week as this article, added support for nginx’s extended seo_crawl log format with $host, $scheme, $request_time, and $upstream_response_time. This enables concrete redirect attribution (HTTP→HTTPS vs www→apex vs trailing-slash, with the configuration-bug case flagged when a request was already on the canonical host and scheme) and per-request latency analysis. The standard combined format remains fully supported, and output for combined-format inputs is byte-identical to v1.2. The format is auto-detected per file.

The Googlebot health score is the analyser’s summary metric, a 0–100 number with explicit deduction reasons. The deduction logic is auditable: a 404 rate over 1% deducts 10 points, any 5xx response deducts 15, robots.txt failures deduct 25, and so on, with each deduction visible in the report. The score is not a magic number; it is an arithmetic summary of observable facts. The v1.3 release added a latency-based deduction when seo_crawl data is present and Googlebot’s p95 response time exceeds the Search Console “Average response time” warning threshold. Crawl health is now expressible as a single integer the operator can monitor without staring at the underlying log.

The output of a healthy daily run looks like this:

Deploy-Window Anomaly Detection
───────────────────────────────

No deploy-window symptoms detected. ✓

Checks performed:
  - robots.txt 404/5xx for any crawler
  - content-404 bursts (≥5 within 2 minutes)
  - Googlebot 404 followed by >2h crawl gap

Googlebot Crawl Health Score
────────────────────────────

Score: 100/100
Verdict: Excellent
No deductions applied.

The format is deliberately readable in a terminal without any rendering. The same content is available as markdown or JSON via --format for downstream tooling.

The chronic finding

The analyser was built to detect the acute incident. Once it was built and running over historical logs, it surfaced a second and unanticipated finding.

The acute deploy-window 404s had stopped. The atomic deploy fix had worked. But the analyser’s daily reports kept showing Googlebot activity well below what the site’s content volume would suggest. Forty articles published, multiple sitemaps, full structured data, sensible internal linking, conformant robots.txt, and Googlebot was visiting at a rate of fewer than seven requests per day with nearly a quarter of days seeing no visits at all. The acute fix had stopped the bleeding, but the patient was not recovering.

The discovery of the chronic state was not the analyser’s intended purpose. It was a side effect of having continuous visibility into crawler behaviour. Without the tool, the suppressed crawl rate was visible only in Search Console’s smoothed daily averages, which read like a baseline rather than like a symptom. With the tool, the per-request granularity made it obvious that “low and steady” was the pathology, not the steady state.

The remediation was operational SEO work that any SaaS audit would have flagged but no SaaS audit had pushed to the top of the queue: adding per-URL <lastmod> timestamps to the sitemap, fixing a sitemap.xml redirect that returned 404 in some cache states, adding a plain-text sitemap.txt for crawlers that prefer it, consolidating legacy redirected URLs. The work was done on the morning of May 10. The first major Googlebot recheck pass began at 16:10 UTC the same day. The second, larger pass began at 20:26 UTC. By the end of the day the site had received 568 Googlebot requests.

Today’s 568-request figure is a burst, not a new steady state. Google’s scheduler front-loads recheck activity when a site’s freshness signals advance, then settles into a higher steady-state cadence proportional to the new signals. The honest projection from this data is a sustained 7-15x improvement over the previous baseline, which corresponds to roughly 50-100 requests per day; that projection should be checkable in Search Console by late May. The 568-request day shows the ceiling of Google’s current interest in the site; the steady state will reveal itself over the coming weeks in Search Console.

What the tool teaches

A tool built to detect one specific failure mode surfaced a second, larger failure mode that no acute fix would have addressed. This is a familiar pattern in operational engineering. Tooling built for an acute issue produces visibility into the chronic conditions that surround it. The acute fix stops the bleeding; the observability reveals what was happening beneath.

The pattern generalises beyond SEO. Adding access logs to a web service to debug one incident often reveals a second class of slow queries that had been silently degrading performance. Adding tracing to a distributed system to find one race condition often reveals an entire category of latency tail behaviour. The instrumentation built for an emergency tends to outlast the emergency, and its second use is usually the more valuable one.

The lesson for self-hosted static sites specifically is that observability for outputs matters as much as observability for inputs. The build-time correctness of a site (its HTML, its sitemap, its robots.txt) is necessary but not sufficient. The deployment-time and crawl-time behaviour (whether deploys produce transient 404s, whether crawlers are actually fetching, whether the right URLs are being returned to the right user-agents) is observable only in live logs. Lighthouse runs against the deployed site after stabilisation; Search Console reports aggregated data after Google’s own batch processing; synthetic monitors poll on intervals that miss bursts. The access log is the only artefact that records what actually happened, in order, with timestamps, for every request.

The same discipline applied to a Content Security Policy rollout the next day caught a Partytown sandbox iframe issue during the Report-Only observation phase, before any visitor experience could be affected. Different concern, same pattern: structured observation reveals what assumption-driven deployment would have missed.

The pattern only works if the surrounding system is observable. Defensive deployment that silences error output also silences the diagnostic surface that would surface failure. Build it defensively and visibly; the second adjective is non-negotiable.

The general lesson is not “every site needs this tool.” The general lesson is “every site that cares about real-world crawler behaviour needs some way to read its own logs.” The tool described here is one shape that observability can take. It is small on purpose.

Three honest limitations apply to the analysis presented here. The analyser is built for nginx and would require parser changes for Apache or Caddy. The Search Console baseline figures describe one site over one 89-day period; broader empirical claims would require multiple sites measured comparably. The 568-request figure is a burst triggered by simultaneous remediation of multiple freshness signals on a single day, not a sustainable daily rate. The durable claim is a 7-15x sustained improvement, not 83x.

The tool

The analyser is a single-file Python script with no external dependencies, runs on Python 3.8 or later, parses nginx logs in either the combined or extended seo_crawl format, and produces text, markdown, or JSON output. It supports the --strict flag for cron and CI integration, exiting non-zero on critical findings. The full source, README, changelog, and AGPL-3.0-or-later licence are at github.com/SpeyTech/crawler-log-analyser.

The repository contains no dashboard, no SaaS component, no database backend, and no external service dependency. It does not replace Search Console; it complements it. The tool’s value is concentrated in the parser tolerance, the classification of probe versus real traffic, and the deploy-window detector. The rest is plumbing.

Frequently Asked Questions

What is operational SEO observability?

Operational SEO observability is reading live web-server access logs to verify what real crawlers experienced, rather than inferring it from synthetic audits. The practice complements Search Console rather than replacing it. Lighthouse measures page quality; access logs measure actual crawler fetches. Both are needed, and they answer different questions.

How do I read my own nginx access logs for SEO signals?

Filter by known crawler user-agents (Googlebot, Bingbot, AI crawlers) and look at status codes, request timing, and URL paths. The raw approach is grep plus awk, which works for a one-off look. For ongoing use, an analyser that handles HTTP/2, IPv6, probe filtering, and crawler classification is more sustainable. The crawler-log-analyser tool described in this article is one such option; the broader category of log-based SEO tooling exists for those who prefer a different shape.

Why doesn’t Lighthouse catch deploy-window 404s?

Lighthouse audits run on demand against a stabilised site, not during the deployment window itself. The transient 404 occurs in a brief inconsistent state between the HTML and asset directory swap, typically lasting seconds. A Lighthouse audit triggered before or after the swap sees a fully consistent site. Only continuous request-level logging captures requests that landed inside the window.

What are the limitations of log-based SEO observability?

Log-based observability sees only what crawlers actually fetched, not what they decided not to fetch or rank. It cannot replace Search Console for impressions, click-through rates, or search position data. The analyser is built for nginx; Apache or Caddy would require parser changes. The 568-request burst documented here is one site over one day, not a generalised effect size.

Closing

Whether the steady-state recovery sustains will be visible in Search Console over the coming weeks, and the claims above should be verifiable from that data rather than from this one day’s burst. The 568-request day shows the ceiling of Google’s current interest in the site; the durable number is whatever the median settles at by the end of June.

Twenty-two independent engineers cloned the analyser repository in its first 36 hours, with no promotional activity. The audience exists for tools that take static-site SEO observability seriously.

The discipline that produced the measurement is portable. Most static-site operators have nginx logs they have never read. Most of them would find at least one surprising thing in the first reading.

Hope this helps.

— William

Self-Hosted Static Sites Need Operational SEO Observability

What the data shows

The failure mode the tool was originally built to detect

The first investigation

Building the analyser

The chronic finding

What the tool teaches

The tool

Frequently Asked Questions

What is operational SEO observability?

How do I read my own nginx access logs for SEO signals?

Why doesn’t Lighthouse catch deploy-window 404s?

What are the limitations of log-based SEO observability?

Closing

About the Author

Discuss This Perspective

Self-Hosted Static Sites Need Operational SEO Observability

What the data shows

The failure mode the tool was originally built to detect

The first investigation

Building the analyser

The chronic finding

What the tool teaches

The tool

Frequently Asked Questions

What is operational SEO observability?

How do I read my own nginx access logs for SEO signals?

Why doesn’t Lighthouse catch deploy-window 404s?

What are the limitations of log-based SEO observability?

Closing

About the Author

Occasional Technical Updates

Discuss This Perspective