ESC
Deep Dive Embedding

Embedding

robotmem uses embedding vectors to enable semantic similarity search. Two backends are supported: ONNX (local, default) and Ollama (HTTP API).

Embedder Protocol

All embedding backends implement the same protocol:

class Embedder(Protocol):
    @property
    def available(self) -> bool: ...

    @property
    def unavailable_reason(self) -> str: ...

    @property
    def model(self) -> str: ...

    @property
    def dim(self) -> int: ...

    async def embed_one(self, text: str) -> list[float]: ...

    async def embed_batch(self, texts: list[str], batch_size: int = 32) -> list[list[float] | None]: ...

    async def check_availability(self) -> bool: ...

    async def close(self) -> None: ...

This protocol-based design allows clean backend switching through the factory function:

embedder = create_embedder(config)
# config.embed_backend == "onnx"  → FastEmbedEmbedder
# config.embed_backend == "ollama" → OllamaEmbedder

ONNX Backend (Default)

The ONNX backend uses fastembed by Qdrant for local CPU inference with zero external service dependencies.

Key Properties

Property Value
Model BAAI/bge-small-en-v1.5
Dimension 384
Model Size ~67MB (auto-downloaded)
Latency ~5ms/query
Dependencies fastembed (108KB wheel, no PyTorch)
Execution Pure CPU, ONNX Runtime
Cache ~/.cache/fastembed/

Initialization

The encoder is lazily initialized on first use (not at startup):

class FastEmbedEmbedder:
    def __init__(self, model, dim, cache_dir):
        self._encoder = None  # lazy init
        self._init_lock = threading.Lock()

    def _ensure_encoder(self):
        """Thread-safe lazy init — only loads model once"""
        if self._encoder is not None:
            return
        with self._init_lock:
            if self._encoder is not None:
                return
            from fastembed import TextEmbedding
            self._encoder = TextEmbedding(model_name=self._model_name)

Key design decisions: - Lazy loading: Model downloaded/loaded only when first embedding is needed - Thread-safe: threading.Lock() prevents duplicate model loading - Double-check locking: Avoids lock contention after initialization

Async Execution

ONNX inference is synchronous, so it's wrapped in run_in_executor to avoid blocking the event loop:

async def embed_one(self, text: str) -> list[float]:
    loop = asyncio.get_running_loop()
    embeddings = await loop.run_in_executor(
        None, lambda: list(self._encoder.embed([text]))
    )
    return embeddings[0].tolist()

Availability Check

async def check_availability(self) -> bool:
    # 1. Load model (lazy init)
    self._ensure_encoder()
    # 2. Test embed "ping"
    test_result = await loop.run_in_executor(
        None, lambda: list(self._encoder.embed(["ping"]))
    )
    # 3. Verify dimension matches config
    if len(test_result[0]) == self._dim:
        return True

Failure modes: - fastembed not installed → clear error message - Dimension mismatch → suggests updating onnx_dim config - Model download fails → reports the exception

Configuration

{
  "embed_backend": "onnx",
  "onnx_model": "BAAI/bge-small-en-v1.5",
  "onnx_dim": 384,
  "fastembed_cache_dir": ""
}

Set fastembed_cache_dir to override the default ~/.cache/fastembed/ location.

Ollama Backend

The Ollama backend connects to a local or remote Ollama server via HTTP API, supporting both native Ollama API and OpenAI-compatible endpoints.

Key Properties

Property Value
Default Model nomic-embed-text
Dimension 768
Model Size ~274MB
Latency ~20-50ms/query
Dependencies Running Ollama server
API Modes ollama (native) or openai_compat

HTTP Client

Uses httpx.AsyncClient with connection pooling:

self._client = httpx.AsyncClient(
    base_url=self._ollama_url,
    timeout=httpx.Timeout(connect=3.0, read=10.0, write=10.0, pool=10.0),
    transport=httpx.AsyncHTTPTransport(
        limits=httpx.Limits(max_connections=10, max_keepalive_connections=5),
    ),
)

The client is lazily created with an asyncio lock to prevent race conditions.

Retry Strategy

Every embed call has exponential backoff retry:

Attempt 1 → fail → wait 1.0s
Attempt 2 → fail → wait 2.0s
Attempt 3 → fail → raise exception

Parameters: - Max retries: 3 - Backoff base: 1.0s - Total timeout: 30s (kills even pending retries) - Concurrent batch limit: 4 (semaphore-controlled)

async def _embed_one_inner(self, text):
    for attempt in range(self._MAX_RETRIES):
        try:
            resp = await client.post(endpoint, json=payload)
            resp.raise_for_status()
            return self._parse_embeddings(resp.json())[0]
        except (ConnectError, TimeoutException) as e:
            wait = 1.0 * (2 ** attempt)  # 1s, 2s, 4s
            await asyncio.sleep(wait)

Batch Embedding

For multiple texts, batches are processed with controlled concurrency:

async def embed_batch(self, texts, batch_size=32):
    batches = [texts[i:i+batch_size] for i in range(0, len(texts), batch_size)]
    sem = asyncio.Semaphore(4)  # Max 4 concurrent batches
    results = await asyncio.gather(
        *[_limited(b, i) for i, b in enumerate(batches)],
        return_exceptions=True,
    )
    # Failed batches → None fill (partial success)

Failed batch positions are filled with None, allowing partial success — the caller decides how to handle missing embeddings.

API Compatibility

Two API modes are supported:

Mode Endpoint Request Format Response Format
ollama /api/embed {"model": "...", "input": "..."} {"embeddings": [[...]]}
openai_compat /v1/embeddings {"model": "...", "input": "..."} {"data": [{"embedding": [...], "index": 0}]}

OpenAI-compatible mode sorts responses by index field to handle out-of-order returns.

Availability Check (Ollama Mode)

Three-phase verification:

Phase 1: Identity Check
    GET /api/version
    → Verify "version" in response (not a different service on same port)

Phase 2: Model Check
    GET /api/tags
    → Verify target model is downloaded

Phase 3: Embed Test
    POST /api/embed {"model": "...", "input": "ping"}
    → Verify non-empty vector returned

Each phase has specific error messages: - Port not Ollama → "Check port conflict" - Model not found → "Run: ollama pull <model>" - Embed timeout → "Possibly out of memory"

Configuration

{
  "embed_backend": "ollama",
  "embedding_model": "nomic-embed-text",
  "embedding_dim": 768,
  "ollama_url": "http://localhost:11434",
  "embed_api": "ollama"
}

For OpenAI-compatible servers:

{
  "embed_backend": "ollama",
  "embed_api": "openai_compat",
  "embedding_model": "your-model",
  "embedding_dim": 768,
  "ollama_url": "http://your-server:8080"
}

Service Cooldown

When the embedding service fails, a ServiceCooldown mechanism prevents repeated failed connections:

Failure 1 → cooldown 60s
Failure 2 → cooldown 120s
Failure 3 → cooldown 240s
Failure 4+ → cooldown 300s (max)
Success → reset counter
class ServiceCooldown:
    base_cooldown = 60.0    # Initial cooldown
    max_cooldown = 300.0    # Maximum cooldown (5 minutes)
    backoff_factor = 2.0    # Exponential multiplier

    @property
    def current_backoff(self):
        return min(
            base * (factor ** (failures - 1)),
            max_cooldown,
        )

During cooldown: - embedder.available returns False - recall degrades to BM25-only mode - No embedding requests are attempted - Cooldown expires → next check_availability() call retries

Vector Storage

Embeddings are stored in two places:

Location Format Purpose
memories.embedding BLOB (float32 array) Backup / re-indexing
memories_vec (vec0) vec0 virtual table KNN search

Conversion between Python floats and SQLite blobs:

import struct

def floats_to_blob(floats: list[float]) -> bytes:
    return struct.pack(f'{len(floats)}f', *floats)

def blob_to_floats(blob: bytes) -> list[float]:
    count = len(blob) // 4  # 4 bytes per float32
    return list(struct.unpack(f'{count}f', blob))

Backend Comparison

Dimension ONNX (default) Ollama
Setup Zero config Requires ollama serve + model pull
Speed ~5ms/query ~20-50ms/query
Model size 67MB (auto-downloaded) 274MB (nomic-embed-text)
CPU/GPU Pure CPU CPU (GPU optional)
Offline Fully offline after first download Requires local Ollama server
Dimension 384d 768d
Quality MTEB retrieval 51.68 Higher for some tasks
Multilingual Limited Better with multilingual models
Failure mode Process crash → restart needed HTTP timeout → graceful degradation
Retry No retry (local, fast) 3 retries with exponential backoff

When to Choose ONNX

When to Choose Ollama

Graceful Degradation

The embedding system is designed to never block core functionality:

Failure Impact Behavior
ONNX model not installed No vector search BM25-only search works fine
Ollama server down No vector search BM25-only, ServiceCooldown activated
Embedding fails for one memory That memory has no vector Still findable via BM25
sqlite-vec not installed No vec0 table BM25-only, embeddings still stored in memories.embedding

The learn tool always succeeds even if embedding fails — the memory is stored without an embedding, and can be backfilled later using get_memories_missing_embedding().