OpenAI Embeddings — opt-in POC

A pedagogical playground for dense semantic search over your codebase, using OpenAI's text-embedding-3-small model. Disabled by default at two independent layers: compile-time (cargo feature flag) and runtime (ig emb on/off, since v1.14.2). The published binary contains zero OpenAI client code unless you build with --features embed-poc; even then, no network call fires until you flip ig emb on.

Why opt-in?

Cost. Indexing 3 k files ≈ $0.05; a runaway re-index could rack up real money.
Network. Each search is one OpenAI round-trip (~200–800 ms). The trigram daemon answers in < 1 ms.
Recall is similar at this scale. Well-tuned ig --semantic --top 10 (PMI) catches most queries dense embeddings catch on a 3 k-file repo. Embeddings start to dominate at 50 k+ files / multi-language polyglot repos.

The fallback for users without an API key is the regular trigram path — sub-ms, no network, no cost.

Two-layer gating

Layer	Mechanism	Controls	Default
Compile-time	`cargo build --features embed-poc`	Whether the subcommand is in the binary at all	absent
Runtime (v1.14.2+)	`ig emb on / off`	Whether it executes when present	disabled

Both layers are independent. The runtime toggle is always-available — even on a binary built without --features embed-poc, you can call ig emb on and the flag is persisted; the subcommand just isn't there until you rebuild.

Runtime toggle — `ig emb`

# Inspect the current state (default: disabled)
ig emb status

# Enable — accepts: on, true, 1, yes, y, enable, enabled
ig emb on

# Disable — accepts: off, false, 0, no, n, disable, disabled
ig emb off

State persisted at ~/.config/ig/embed.toml:

# Runtime toggle for ig emb — overridable with ig emb on/off
enabled = false

When the runtime toggle is off and someone calls ig embed-poc <op>:

$ ig embed-poc hello "test"
Error: embeddings are disabled.
Enable with:  ig emb on
(or build a binary without the embed-poc feature to remove the subcommand entirely.)

Fail-closed: if the config file is unreadable or malformed, embeddings stay off.

Quickstart

# 1. Build with the feature on (compile-time opt-in)
cargo build --release --features embed-poc
cp target/release/ig ~/.local/bin/ig

# 2. Drop your OpenAI key in either ~/.config/ig/config.toml OR a project .env
mkdir -p ~/.config/ig && cat > ~/.config/ig/config.toml <<EOF
[providers.openai]
api_key = "sk-proj-XXXX"
default_model = "text-embedding-3-small"
EOF
chmod 600 ~/.config/ig/config.toml

# 3. Flip the runtime toggle ON (default is OFF, even when feature is built in)
ig emb on

# 4. Smoke-test — Phase 1: 1 string → 1 vector → console
ig embed-poc hello "function cancelSubscription(userId)"

# 5. Index a directory — Phase 2: chunk + embed + JSON store
ig embed-poc index ./src

# 6. Search — Phase 2: top-N cosine
ig embed-poc search "function that cancels a Stripe subscription"

# 7. (optional) — Phase 3: tiny_http JSON server + React SPA
ig embed-poc serve --port 7877 --ui ui/dist

# 8. When you're done — flip it back off
ig emb off

The pipeline

ig embed-poc index ./src
   │
   ├─▶ walk files (respects .ig/ excludes)
   ├─▶ chunk each file: 40 lines, 5-line overlap
   ├─▶ batch-embed via OpenAI /v1/embeddings (size 100, ureq sync)
   │     model = text-embedding-3-small
   │     dim   = 1536, L2-normalised → cosine = dot product
   ├─▶ persist as JSON  ──▶  .ig/poc-embeddings.json
   │     (~30 MB on a 3 k-file repo, deliberately readable: cat | jq)
   └─▶ console:  "Embedded 768 chunks · $0.0046 · 11.3 s"

ig embed-poc search "<query>"
   │
   ├─▶ embed the query (1 OpenAI call, ~$0.0000002)
   ├─▶ rayon par_iter cosine over the JSON store
   │     ~3 ms on 247 chunks · ~600 ms on 50 k chunks
   ├─▶ sort descending, take top-N
   └─▶ stdout:  file:lines · score · 5-line preview

What an embedding actually is

Phase 1 (ig embed-poc hello <text>) exists for one reason: to make the abstract concept tangible. Run it once and you'll see exactly what comes back from OpenAI.

$ ig embed-poc hello "function cancelSubscription(userId) { ... }"

Provider     : openai
Model        : text-embedding-3-small
Input tokens : 12
Cost         : $0.00000024
Vector dim   : 1536
First 10     : [-0.0123, 0.0456, -0.0211, 0.0089, ..., 0.0334]
L2 norm      : 1.0000  ← OpenAI vectors are L2-normalized
                          → cosine(a, b) = dot(a, b)
                          → no division step needed

A 1 536-dimensional unit vector. Two semantically similar strings produce vectors with a high dot product (typically 0.4–0.7 for code/code), unrelated ones cluster around 0.1–0.2. That's the whole magic.

The store format — deliberately readable

The Phase-2 store is plain JSON, not bincode. You can cat .ig/poc-embeddings.json | jq '.chunks[0]' and inspect a single chunk + its full vector. This is intentional: when (later) we'd switch to bincode + HNSW, you'd already know exactly what got replaced.

{
  "version": "poc-1",
  "model": "text-embedding-3-small",
  "dim": 1536,
  "provider": "openai",
  "total_tokens": 156783,
  "total_cost_usd": 0.00313566,
  "chunks": [
    {
      "id": 0,
      "file": "src/embed_poc/config.rs",
      "start_line": 1,
      "end_line": 40,
      "tokens": 312,
      "embedding": [0.0123, -0.0456, /* … 1534 more … */, 0.0089]
    },
    /* … */
  ]
}

Search math — brute-force cosine

No HNSW, no FAISS, no PCA. The whole search is a par_iter dot product across the chunk array:

pub fn rank<'a>(query: &[f32], store: &'a Store, top_n: usize)
    -> Vec<(f32, &'a Chunk)>
{
    let mut scored: Vec<_> = store.chunks
        .par_iter()                              // rayon
        .map(|c| (dot(query, &c.embedding), c)) // L2-normalised → cosine = dot
        .collect();
    scored.par_sort_unstable_by(|a, b|
        b.0.partial_cmp(&a.0).unwrap()
    );
    scored.truncate(top_n);
    scored
}

Latency on a Mac M4 Max:

247 chunks × 1 536 dim: ~3 ms cosine + ~250 ms OpenAI = ~250 ms total.
50 k chunks × 1 536 dim: ~600 ms cosine + ~250 ms OpenAI = ~850 ms total.

HNSW becomes worth it past ~100 k chunks. Below that, brute force is simpler, more debuggable, and the OpenAI round-trip dominates anyway.

Cost guard

Pricing as of April 2026 (always re-confirm in the OpenAI console):

Model	Dim	$/M tokens	Index 3k files
text-embedding-3-small	1 536	~$0.02	~$0.05
text-embedding-3-large	3 072	~$0.13	~$0.40

Set hard limits before generating your key.

OpenAI Settings → Billing → Usage limits. For this POC: Soft $2/mo + Hard $5/mo. The API will return errors past the hard limit. $5 covers ~100 full-repo re-indexes on a 3 k-file project.

Security — how the key never reaches the repo

.env is gitignored. Verified in .gitignore at the repo root.
Pre-commit hook (.githooks/pre-commit) blocks any staged content matching sk-[A-Za-z0-9]{20,} or a non-placeholder OPENAI_API_KEY=.
Project key only — the recommended OpenAI key is a Project key (sk-proj-…) with permissions restricted to Models: Read + Model capabilities: Write. It cannot generate text, cannot access your ChatGPT history.
If in doubt, revoke. A leaked key costs $0 to revoke + regenerate from the OpenAI dashboard.

Phase 3 — the React SPA

ig embed-poc serve starts a 3-route tiny_http JSON server (sync, blocking, ~200 LoC) plus an optional static SPA. Bound to 127.0.0.1 only — no auth, no TLS, single-user local POC.

# Backend (with the feature on)
ig embed-poc serve --port 7877 --ui ui/dist

# Routes
GET  /api/status              → { ready, model, dim, chunks, total_tokens, total_cost_usd }
POST /api/search   { query }  → { hits, openai_ms, cosine_ms, query_cost_usd }
GET  /api/chunks?limit=N      → { total, returned, dim, chunks: [...] }

The SPA (under ui/ in the repo, generated with shadcn@latest + Vite) ships three routes:

/ — Home. Status cards (chunks, dim, tokens, cost), provider/model, store path.
/search — Semantic search. NL input + top-N spinbutton + ranked results with scores. Latency breakdown (OpenAI ms vs cosine ms vs query cost).
/inspect — Embedding heatmap. Browse the indexed chunks; click any chunk to render its 1 536-D vector as a 32×48 heatmap (red = positive, blue = negative, black = zero, scaled symmetrically around the max abs value). This is what makes embeddings tangible: two semantically similar functions have visibly similar heatmaps.

Phase 0 — getting an API key safely

Create an OpenAI account at platform.openai.com (carte requise — l'API n'est pas sur l'abonnement ChatGPT).
Set usage limits FIRST (before generating a key): Settings → Billing → Usage limits → Soft $2 + Hard $5.
Generate a Project key (not a User key) at Dashboard → API keys. Permissions: Restricted → Models: Read + Model capabilities: Write only.
Copy sk-proj-… immediately (never reshown).

Persist locally with restrictive perms:

mkdir -p ~/.config/ig && chmod 700 ~/.config/ig
cat > ~/.config/ig/config.toml <<EOF
[providers.openai]
api_key = "sk-proj-XXXX"
default_model = "text-embedding-3-small"
EOF
chmod 600 ~/.config/ig/config.toml

Smoke-test:

curl https://api.openai.com/v1/models \
  -H "Authorization: Bearer $(grep api_key ~/.config/ig/config.toml | cut -d'"' -f2)" \
  | head -5

What this POC does not do

❌ No HNSW or vector index optimisation — brute-force cosine only.
❌ No multi-provider — OpenAI only (no Voyage, no Cohere, no local model).
❌ No hybrid search — no lexical + vector RRF blending. Use ig --semantic --top N for hybrid.
❌ No auth, no TLS — 127.0.0.1 only.
❌ No automatic cost guard — just an estimate printed at index time.
❌ No async runtime — sync ureq + tiny_http blocking I/O.

This is by design. It's a teaching artefact: read the source, see the cost, decide if industrial-grade embedding search is worth building.

When to use embeddings vs `--semantic` vs trigram

Trigram (`ig "pat"`)

Sub-ms, no network, no cost
Exact / regex matching
Best for: known-token searches, refactor queries, structural patterns
Default.

`--semantic` (PMI)

~5 ms, no network, no cost
Synonyms learned from your repo at index time
Best for: queries where you know the concept but not the project's word for it
Combine with --top N for BM25 rerank.

Embeddings (POC)

~250–800 ms, OpenAI round-trip, $0.0000002 per query
True natural-language understanding (cross-token, cross-language)
Best for: NL queries with no shared token ("function that cancels a Stripe subscription")
Opt-in only.

Source layout

src/embed_poc/
├── mod.rs       # entry: run_hello, run_index, run_inspect, run_search
├── config.rs    # parse ~/.config/ig/config.toml + .env
├── openai.rs    # POST /v1/embeddings via ureq, batched
├── chunk.rs     # 40-line chunker, 5-line overlap
├── store.rs     # JSON (de)serialisation of .ig/poc-embeddings.json
├── search.rs    # rayon par_iter cosine
└── server.rs    # tiny_http: /api/status, /api/search, /api/chunks

ui/                                # generated via `bunx shadcn@latest init`
├── src/routes/Home.tsx             # status + nav
├── src/routes/SearchPage.tsx       # NL search + ranked hits
├── src/routes/Inspect.tsx          # 32×48 embedding heatmap
└── src/lib/api.ts                  # typed fetch helpers

OpenAI Embeddings — opt-in POC

Two-layer gating

Runtime toggle — `ig emb`

Quickstart

The pipeline

What an embedding actually is

The store format — deliberately readable

Search math — brute-force cosine

Cost guard

Security — how the key never reaches the repo

Phase 3 — the React SPA

Phase 0 — getting an API key safely

What this POC does not do

When to use embeddings vs `--semantic` vs trigram

Trigram (`ig "pat"`)

`--semantic` (PMI)

Embeddings (POC)

Source layout

Next steps

CLI Reference

Architecture

v1.14.0 release notes

OpenAI Embeddings — opt-in POC

Two-layer gating

Runtime toggle — ig emb

Quickstart

The pipeline

What an embedding actually is

The store format — deliberately readable

Search math — brute-force cosine

Cost guard

Security — how the key never reaches the repo

Phase 3 — the React SPA

Phase 0 — getting an API key safely

What this POC does not do

When to use embeddings vs --semantic vs trigram

Trigram (ig "pat")

--semantic (PMI)

Embeddings (POC)

Source layout

Next steps

CLI Reference

Architecture

v1.14.0 release notes

Runtime toggle — `ig emb`

When to use embeddings vs `--semantic` vs trigram

Trigram (`ig "pat"`)

`--semantic` (PMI)