Skip to main content
All posts
Engineering6 min read

Syncing Enterprise Data Without Re-Indexing the World

Re-indexing a million-document corpus every night isn't a sync strategy — it's a bill. ZenSearch combines content-hash dedup, per-connector incremental sync, page-change detection, and set-diff deletion detection to sync only what actually changed.

March 3, 2026 · ZenSearch Team

Incremental sync means re-indexing only what actually changed since the last run, not the entire source corpus. ZenSearch ships this at four layers: content-hash dedup that short-circuits unchanged documents before they enter the parse pipeline, per-connector sync scheduling with per-document next-sync timestamps, a dedicated page-change detector for web crawls, and a deletion detector that diffs each connector's "seen source IDs" against the indexed set to find removals. Together they turn nightly sync from a full re-index into a delta operation.

The back-of-envelope difference: a corpus of 500K documents with 2% daily churn. Full re-index parses 500K docs; a correctly-incremental sync parses 10K. Same coverage, 50x the throughput, 50x lower cost.

Content-Hash Dedup

Before a document enters the parse pipeline, the collector calls CoreAPIClient.CheckDuplicates() with a hash of the document's raw bytes. If the hash matches what's already indexed, the document is marked Unchanged in sync stats and skipped — it never hits the parser, projector, vectorizer, or classifier.

The wiring is deliberately fail-open: if the API call errors, the document is published to the parse queue anyway. Better to re-parse once than to silently drop a real change because of a transient error. Full-sync mode (SyncModeFull) bypasses dedup entirely, which is the escape hatch when you need to rebuild an index from scratch.

Per-Connector Sync Scheduling

Each connector has a NextSyncAt timestamp. A cron-driven scheduler (services/core-api/internal/cron/document_sync.go) picks up connectors whose NextSyncAt is in the past, runs their sync, and updates the timestamp based on the connector's configured cadence. This mirrors the schema-sync scheduler used for database connectors, but filters those out so they only run their schema-discovery path.

The practical effect: Confluence connectors can sync every 15 minutes while SAP connectors sync nightly, without either blocking the other or requiring a per-connector cron job.

PageChangeDetector for Web Crawls

Web crawls have a special challenge: the crawler can find the same URL with identical content across many runs. A naive crawler re-parses every page every time.

ZenSearch's webcrawler service uses an indexed_pages table with the URL, last-fetch timestamp, and content hash. Before parsing a page, it calls CompareContentHash() — if the hash matches, the page is marked Unchanged and skipped. The table is kept up to date by the core-api endpoints at /api/v1/indexed-pages, which the collector service talks to over the internal API.

DeletionDetector

The trickiest part of sync is finding deletions. A Confluence page that was deleted in Confluence doesn't show up in the collector's current sync output — it's just missing. Without detection, stale documents accumulate in the index until someone notices.

ZenSearch solves this with a set-diff approach. At the end of each sync, the collector publishes a SyncCompleteMessage to NATS containing the set of source IDs it saw (capped at 50K — larger sets skip detection to avoid memory blowouts). The DeletionDetector service subscribes to events.*.sync.complete, queries the indexed document set with NOT IN against the seen set, and marks the diff for deletion.

One safety rail: if the "to delete" set exceeds 50% of the connector's indexed documents, the detector refuses to act and logs a warning. This prevents a broken sync (say, a misconfigured auth credential that returns zero results) from wiping the entire connector's indexed corpus. The threshold is a simple heuristic — the kind of guardrail that catches 99% of incidents without needing smarter logic.

SyncStats Metrics

Every sync job writes an Added / Modified / Deleted / Unchanged / Errors counter to SyncJob.Metadata.sync_stats. These feed the Control Tower dashboard's per-connector health panel and surface as Prometheus metrics so operators can alert on "no successful sync in 24 hours" or "deletion rate 10x baseline".

The counters are also a debugging aid — a connector showing huge Modified numbers every sync usually has a timestamp-precision issue in its change-detection logic (pulling back "changed since X" where X loses milliseconds, causing the same documents to appear changed on every run).

Why This Matters

Most enterprise search platforms ship a collector and call the story done. The sync loop — what makes it robust, efficient, and recoverable when things go wrong — is where operational maturity actually lives. Content hashes, per-document scheduling, page-change detection, set-diff deletion, and safety thresholds aren't glamorous features, but they're what separate a platform you can run at scale from one that melts under a real corpus.