ig
instant-grep
๐ŸŒ€ Secret Technique Scroll

Architecture

The inner workings of instant-grep โ€” from regex to result in under a millisecond. Every technique honed to perfection.

๐ŸŒ€ Rasengan Pipeline

The complete query pipeline โ€” six stages of pure concentrated energy.

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘          โœฆ RASENGAN PIPELINE โ€” FORBIDDEN SCROLL OF SEARCH โœฆ          โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

  User Input
      โ”‚
      โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Stage 1 ยท regex-syntax Extractor       โ”‚  query/extract.rs
โ”‚  "async fn" โ†’ literal fragments         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Stage 2 ยท Sparse N-gram Generation     โ”‚  index/ngram.rs
โ”‚  fragments โ†’ variable-length n-grams    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Stage 3 ยท Shadow Clone Selection       โ”‚  index/ngram.rs
โ”‚  all n-grams โ†’ minimal covering set     โ”‚  build_covering_ngrams()
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Stage 4 ยท Hash Table Lookup            โ”‚  index/reader.rs
โ”‚  n-grams โ†’ posting list offsets (mmap)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Stage 5 ยท Posting List Intersection    โ”‚  index/reader.rs
โ”‚  sorted file ID lists โ†’ candidates      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Stage 6 ยท Parallel Regex Verification  โ”‚  Rayon threadpool
โ”‚  candidates โ†’ verified matches          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
                  โ–ผ
            โœ“ Results  (p50: 0.9ms)

๐Ÿ“‚ Module Structure

src/index/ngram.rs

Core Algorithm

Sparse n-gram generation and covering set selection

hash_bigram() build_all_ngrams() build_covering_ngrams()
src/index/writer.rs

Index Build Pipeline

Walks filesystem, builds and serializes the index

build_index() write_lexicon() write_postings()
src/index/reader.rs

Index Query

mmap-backed hash table + posting list intersection

query_index() intersect_postings() mmap_open()
src/query/extract.rs

Regex โ†’ N-gram Conversion

Parses regex AST to extract searchable literal fragments

extract_ngrams() literal_fragments() NgramQuery
src/daemon.rs

Sage Mode Daemon

Unix socket server for sub-ms repeated queries

start_daemon() handle_connection() serve_query()
src/main.rs

CLI Entry Point

Command dispatch via clap โ€” routes to index, search, status, watch, daemon, query

Commands enum clap::Parser dispatch()

๐Ÿ’พ On-disk Format (.ig/ directory)

Three binary files, all memory-mapped for zero-copy access. INDEX_VERSION must be bumped when the format changes.

.ig/
โ”œโ”€โ”€ lexicon.bin    โ€” Hash table: u64(n-gram hash) โ†’ u32(posting offset) + u32(count)
โ”œโ”€โ”€ postings.bin   โ€” Sorted u32[] file IDs per n-gram (delta-encoded)
โ””โ”€โ”€ metadata.bin   โ€” u32 version + u64[] mtime + []string file paths (null-separated)
lexicon.bin
Open-addressing hash table. Each slot: 8B key (n-gram hash) + 8B value (offset + count). Load factor: 0.7.
postings.bin
Concatenated sorted lists of file IDs. Each list referenced by lexicon offset + length. Intersection runs in O(n+m).
metadata.bin
Maps file IDs to paths. Stores mtimes for incremental rebuild detection. First 4 bytes: INDEX_VERSION.

๐ŸŒ€ Rasengan Algorithm โ€” Sparse N-grams

Port of github.com/danlark1/sparse_ngrams (GitHub Blackbird)

Traditional trigram search uses every contiguous 3-character substring โ€” expensive and produces many false positives. The Rasengan Algorithm uses variable-length n-grams that adapt to the string being indexed, concentrating energy where it matters.

โœ— Naive Trigrams (weak jutsu)
"async" โ†’ trigrams:
"asy", "syn", "ync"
Problems:
โ€ข Fixed length loses selectivity
โ€ข Many false candidates
โ€ข Large index size
โœ“ Sparse N-grams (Rasengan)
"async fn" โ†’ covering set:
"async", "fn"
Advantages:
โ€ข Variable length = high selectivity
โ€ข Fewer candidates to verify
โ€ข Smaller index, faster lookup

How hash_bigram() works:

Each n-gram is hashed by combining its characters using a polynomial rolling hash with a prime base. The hash fits in a u64 โ€” enabling the open-addressing hash table lookup to be a single memory access when the index is warm in the OS page cache (or the daemon's mmap).

๐Ÿ‘ฅ Shadow Clone Selection โ€” Covering Algorithm

From all possible n-grams for a query, we need only the minimal covering set โ€” the smallest collection of n-grams that together guarantee we find all matching files. Like selecting the perfect shadow clones: enough to cover every angle, no redundancy.

build_covering_ngrams() pseudocode:
fn build_covering_ngrams(ngrams: Vec<Ngram>) โ†’ Vec<Ngram> {
    sort ngrams by (position, length desc)
    greedily select ngrams that cover uncovered positions
    return minimal subset that covers the full query string
}

๐Ÿธ Sage Mode โ€” Daemon Architecture

In Sage Mode, the ninja draws power from nature without moving โ€” the daemon keeps the index mmap'd in process memory, eliminating OS page cache warmup on every query.

Without Sage Mode
  • โ€ข Cold start: open files, read mmap headers
  • โ€ข OS page faults on first access
  • โ€ข ~5-15ms for cold index
  • โ€ข ~0.9ms when OS cache is warm
With Sage Mode (daemon)
  • โ€ข Index already in process memory
  • โ€ข Unix socket: no TCP overhead
  • โ€ข Sub-0.5ms consistent latency
  • โ€ข Perfect for AI agent repeated queries
# Start Sage Mode
ig daemon .  # runs in background, socket at .ig/daemon.sock

# Query via socket (0.3ms)
ig query "async fn" .