Architecture

The inner workings of instant-grep — from regex to result in under a millisecond.

Query Pipeline

The complete query pipeline — six stages from regex to results.

╔══════════════════════════════════════════════════════════════════╗
║          QUERY PIPELINE — SEARCH ARCHITECTURE          ║
╚══════════════════════════════════════════════════════════════════╝

  User Input
      │
      ▼
┌─────────────────────────────────────────┐
│  Stage 1 · regex-syntax Extractor       │  query/extract.rs
│  "async fn" → literal fragments         │
└─────────────────┬───────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────┐
│  Stage 2 · Sparse N-gram Generation     │  index/ngram.rs
│  fragments → variable-length n-grams    │
└─────────────────┬───────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────┐
│  Stage 3 · Covering Set Selection       │  index/ngram.rs
│  all n-grams → minimal covering set     │  build_covering_ngrams()
└─────────────────┬───────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────┐
│  Stage 4 · Hash Table Lookup            │  index/reader.rs
│  n-grams → posting list offsets (mmap)  │
└─────────────────┬───────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────┐
│  Stage 5 · Posting List Intersection    │  index/reader.rs
│  sorted file ID lists → candidates      │
└─────────────────┬───────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────┐
│  Stage 6 · Parallel Regex Verification  │  Rayon threadpool
│  candidates → verified matches          │
└─────────────────────────────────────────┘
                  │
                  ▼
            ✓ Results  (p50: 0.9ms)

Module Structure

src/index/ngram.rs

Core Algorithm

Sparse n-gram generation and covering set selection

hash_bigram() build_all_ngrams() build_covering_ngrams()

src/index/writer.rs

Index Build Pipeline

Walks filesystem, drives the SPIMI pipeline, serializes the index

build_index() write_lexicon() write_postings()

src/index/spimi.rs

SPIMI Index Build

Single-Pass In-Memory Indexing with 128MB RAM budget — segments flushed to disk when budget is hit

SpimiBuildPipeline flush_segment() merge_segments()

src/index/merge.rs

Segment Merge

Merges SPIMI segments into the final index with VByte-encoded postings

merge_postings() merge_lexicons()

src/index/vbyte.rs

VByte Compression

Variable-byte delta encoding for posting lists — ~50–60% smaller postings.bin

vbyte_encode() vbyte_decode() delta_encode()

src/index/overlay.rs

Overlay Incremental Index

For <100 changed files: builds a small overlay index, merges on read. Avoids full rebuild.

OverlayIndex build_overlay() merge_overlay()

src/index/reader.rs

Index Query

mmap-backed hash table + posting list intersection. Decodes VByte postings on the fly.

query_index() intersect_postings() mmap_open()

src/query/extract.rs

Regex → N-gram Conversion

Parses regex AST to extract searchable literal fragments

extract_ngrams() literal_fragments() NgramQuery

src/daemon.rs

Background Daemon

Unix socket server with full lifecycle: start/stop/status/install/uninstall + launchd auto-restart

start_daemon() stop_daemon() daemon_status() install_launchd() uninstall_launchd()

src/read.rs

Smart File Reading

Full file reading or signatures-only mode — extracts function/struct definitions to reduce agent token cost

read_file() read_signatures()

src/smart.rs

2-line File Summaries

Heuristic 2-line summaries for every project file — high-level map before full reads

summarize_project() summarize_file()

src/pack.rs

Project Context Generator

Generates .ig/context.md: tree, symbols, dependencies, Next.js API routes, .env.example keys

pack_context() emit_deps() emit_routes() emit_env()

src/ls.rs

Compact Directory Listing

Groups files by directory with counts — token-efficient alternative to raw find output

ls_directory() format_tree()

src/rewrite.rs

Command Rewrite Engine

Translates grep/find/rg invocations to ig equivalents — used by the Claude Code PreToolUse hook

rewrite_command() RewriteRule

src/tracking.rs

Token Savings Tracking

Appends rewrite events to a JSONL log; read by ig gain to compute cumulative savings

record_rewrite() load_history()

src/gain.rs

Savings Dashboard

Reads the JSONL tracking log and renders a token savings + cost summary table

print_gain() aggregate_history()

src/main.rs

CLI Entry Point

Command dispatch via clap — routes to index, search, status, watch, daemon, query, ls, read, smart, pack, gain

Commands enum clap::Parser dispatch()

v1.0.0 — SPIMI Build Pipeline & Overlay Index

SPIMI Index Build

v1.0.0 uses a Single-Pass In-Memory Indexing (SPIMI) pipeline with a 128MB RAM budget. When the budget is hit, a segment is flushed to disk. After all files are processed, segments are merge-sorted into the final index.

spimi.rs → merge.rs → writer.rs
flush_segment() when RSS > 128MB
merge_segments() → final index
1,552 files → 2 segments → 450ms

Overlay Incremental Index

When fewer than 100 files have changed since the last build, ig builds a small overlay index for the changed files only. Queries merge the base index and overlay on read — no full rebuild required.

overlay.rs
OverlayIndex + build_overlay()
Incremental (no-op): 25ms
Full rebuild: 450ms (if >100 files changed)

On-disk Format (.ig/ directory) — INDEX_VERSION 13

Five binary files, all memory-mapped for zero-copy access. INDEX_VERSION must be bumped when the format changes. v13: 18-byte lexicon entries keep aggregate bloom + locMask metadata, while postings.bin stores tagged per-document posting entries with bloom, loc and position-zone masks. Dense posting lists add skip-block metadata for selective intersections.

.ig/                                                         Size (1,552 files)
├── metadata.bin    — bincode: paths, mtimes, git SHA, IDF flag                 112 KB
├── lexicon.bin     — Hash table: [key:u64, offset:u32, len:u32,                 35 MB
│                      bloom:u8, loc_mask:u8] (18 bytes/entry)
├── postings.bin    — Tagged VByte PostingEntry stream + skip blocks              7.8 MB
├── bigram_df.bin   — bincode: corpus bigram document frequencies               95 KB
├── filedata.bin    — bincode: line offsets, symbols, block boundaries          1.8 MB
├── tree.txt        — Compact directory tree snapshot (auto-generated)
└── context.md      — Structured AI agent snapshot: tree, symbols, deps, routes, env

lexicon.bin

Open-addressing hash table. Each slot is 18 bytes: 8B key (n-gram FNV-1a hash) + 4B posting offset + 4B posting length + 1B bloom mask + 1B locMask. Load factor: 0.7.

postings.bin

Tagged VByte delta-encoded sorted posting entries. Small lists use a compact stream; dense lists add skip-block metadata with aggregate masks so the reader can jump toward candidate doc IDs.

metadata.bin

Maps file IDs to paths. Stores mtimes and git SHAs for incremental detection. First 4 bytes: INDEX_VERSION (13). IDF flag marks whether bigram weights are available.

bigram_df.bin

Corpus bigram document frequencies collected during index build. Used at query time to weight n-gram boundary selection by inverse document frequency.

filedata.bin

Pre-computed per-file metadata: line byte offsets, symbol definitions with block boundaries, smart summaries. Enables O(1) lookup for ig read -s and ig context.

context.md

Rich project snapshot written by pack.rs. Sections: directory tree, symbol definitions, Cargo.toml / package.json dependencies, Next.js API routes, .env.example keys.

Sparse N-gram Algorithm

Port of github.com/danlark1/sparse_ngrams (GitHub Blackbird)

Traditional trigram search uses every contiguous 3-character substring — expensive and produces many false positives. The sparse n-gram algorithm uses variable-length n-grams that adapt to the string being indexed, concentrating selectivity where it matters.

✗ Naive Trigrams

"async" → trigrams:
"asy", "syn", "ync"
Problems:
• Fixed length loses selectivity
• Many false candidates
• Large index size

✓ Sparse N-grams

"async fn" → covering set:
"async", "fn"
Advantages:
• Variable length = high selectivity
• Fewer candidates to verify
• Smaller index, faster lookup

How hash_bigram() works:

Each n-gram is hashed by combining its characters using a polynomial rolling hash with a prime base. The hash fits in a u64 — enabling the open-addressing hash table lookup to be a single memory access when the index is warm in the OS page cache (or the daemon's mmap).

Covering Algorithm

From all possible n-grams for a query, we need only the minimal covering set — the smallest collection of n-grams that together guarantee we find all matching files. Enough to cover every position, no redundancy.

build_covering_ngrams() pseudocode:

fn build_covering_ngrams(ngrams: Vec<Ngram>) → Vec<Ngram> {
    sort ngrams by (position, length desc)
    greedily select ngrams that cover uncovered positions
    return minimal subset that covers the full query string
}

Daemon Architecture

The daemon keeps the index mmap'd in process memory, eliminating OS page cache warmup on every query.

Without daemon

• Cold start: open files, read mmap headers
• OS page faults on first access
• ~5-15ms for cold index
• ~0.9ms when OS cache is warm

With daemon

• Index already in process memory
• Unix socket: no TCP overhead
• Sub-0.5ms consistent latency
• Perfect for AI agent repeated queries

# Start daemon (background)
ig daemon start .  # socket at .ig/daemon.sock

# Check status
ig daemon status .

# Stop daemon
ig daemon stop .

# Install as launchd service (auto-restart on macOS)
ig daemon install .
ig daemon uninstall .

# Query via socket (sub-ms)
ig query "async fn" .

Cursor-Inspired 3.5-gram Metadata

Sparse n-grams plus active bloom and locMask filtering before regex verification

The v13 index stores cheap masks per posting entry. Query execution uses sparse n-gram lookup, next-byte bloom filtering, modulo-8 location compatibility, exact small-position compatibility, posting-list intersection, and final regex verification. The masks only narrow candidates; regex remains authoritative.

Layer 1 · Sparse N-grams (3.0)

Variable-length n-grams from the danlark1 covering algorithm. Each n-gram covers a run of literal characters with maximal selectivity — no wasted index entries on fixed 3-char windows.

Layer 2 · Bloom Mask (+0.25)

An 8-bit mask on each posting entry encodes the byte that follows an indexed n-gram occurrence. Query-time masked n-grams can reject files whose following byte cannot match.

Layer 3 · locMask (+0.25)

An 8-bit mask encodes the position modulo 8 of each n-gram occurrence. The reader checks relative alignment across multiple n-grams before touching file contents.

Query Pipeline — v13 Detail

regex pattern
    │
    ▼
regex-syntax::Extractor → extract literal sequences
    │
    ▼
IDF-weighted covering ngrams (corpus-tuned boundaries)
    │
    ▼
FNV-1a hash → NgramKey (u64) → lookup in mmap'd hash table
    │
    ▼
Next-byte bloom mask → reject incompatible per-file postings
    │
    ▼
locMask alignment   → keep compatible modulo-8 positions
    │
    ▼
Small-position mask → reject incompatible exact early-file positions
    │
    ▼
Intersect posting entries (VByte decoded or skip-block advanced)
    │
    ▼
Parallel regex verification (rayon, per-thread regex clone)
    │
    ▼
✓ Results  (colored / JSON / compact)

IDF-Weighted Ngram Boundaries

Corpus-tuned selectivity — boundaries that learn from your codebase

Not all bigrams are equal. In a Rust codebase, fn appears in almost every file — it's a terrible n-gram boundary. The IDF system measures this automatically and steers the covering algorithm away from common bigrams toward rare, selective ones.

Build Phase — Collect Frequencies

• During ig index, bigram document frequencies are tallied across all files
• Stored compactly in bigram_df.bin (~95 KB for 1,552 files)
• First build uses heuristics; second build uses real IDF from the first
• Convergence in 2 passes — subsequent rebuilds are self-reinforcing

Query Phase — Weighted Boundaries

• Common bigrams (>50% of files): weight 0.25× → boundary pushed away
• Typical bigrams: weight 1.0× → unchanged
• Rare bigrams (<5% of files): weight 2.0× → boundary attracted
• Result: more selective n-grams, fewer candidates to regex-verify

Example — "async fn handle_request"

Without IDF:  boundary at "fn" (appears in 98% of Rust files)
→ ngrams: "async fn", "handle" — "fn" boundary is near-useless
With IDF:    "fn" weight=0.25, boundary slides to "handle_request"
→ ngrams: "async", "handle_request" — highly selective

FileData Cache — Pre-computed Metadata

filedata.bin — pay once at index time, earn forever at query time

Every ig index run pre-computes per-file metadata and serializes it to filedata.bin. Commands that previously required a full file read now return instantly from the cache — with graceful fallback when the cache is absent.

Line byte offsets

Every line's byte offset pre-indexed. ig context does an O(1) binary search to find the enclosing block — no file read required.

Symbol definitions

Function/struct/trait/impl definitions with block start and end lines. ig read -s returns cached symbols in ~0.1ms vs 5–10ms for a live regex scan.

Smart summaries

2-line heuristic summaries pre-generated for every file. ig smart renders the project map without opening a single file.

Command latency with filedata.bin cache:
ig read -s src/index/reader.rs  →  ~0.1ms  (cached symbols)
ig context src/index/reader.rs 42  →  O(1) block lookup
ig smart .  →  instant (no file reads)
Graceful fallback: if filedata.bin is missing, commands fall back to live regex scan.

Posting Entry Decode — Masked Candidate Intersection

Posting entries carry file IDs plus per-document bloom and loc masks

The v13 reader decodes PostingEntry records from the mmap'd postings.bin slice. Each record contains a delta-encoded file ID, a next-byte bloom mask, a modulo-8 location mask, and an exact small-position mask with overflow. Dense lists include skip blocks so selective intersections can advance directly toward candidate file IDs before regex verification.

Lazy VByte Decode

• Posting list slice is borrowed directly from mmap — no copy
• Each record decodes one VByte doc-ID delta plus bloom, loc and zone masks
• Decoder tracks cumulative delta sum for absolute doc IDs
• One-byte VByte deltas stay compact for adjacent file IDs

Masked Intersection

• Next-byte bloom removes files whose following byte cannot match
• locMask keeps only relative modulo-8 alignments that can coexist
• Rarest posting list is consumed first — fewest bytes decoded overall
• Regex verification still decides the final result set

PostingEntry — conceptual layout:

struct PostingEntry {
    doc_id:    u32,  // delta-encoded in postings.bin
    next_mask: u8,   // hashed byte after the n-gram
    loc_mask:  u8,   // occurrence positions modulo 8
}

fn resolve_and_masked(query_ngrams) {
    filter postings by next_mask
    intersect doc_id sets
    check loc_mask compatibility
    verify candidates with regex
}