Embedding

robotmem uses embedding vectors to enable semantic similarity search. Two backends are supported: ONNX (local, default) and Ollama (HTTP API).

Embedder Protocol

All embedding backends implement the same protocol:

class Embedder(Protocol):
    @property
    def available(self) -> bool: ...

    @property
    def unavailable_reason(self) -> str: ...

    @property
    def model(self) -> str: ...

    @property
    def dim(self) -> int: ...

    async def embed_one(self, text: str) -> list[float]: ...

    async def embed_batch(self, texts: list[str], batch_size: int = 32) -> list[list[float] | None]: ...

    async def check_availability(self) -> bool: ...

    async def close(self) -> None: ...

This protocol-based design allows clean backend switching through the factory function:

embedder = create_embedder(config)
# config.embed_backend == "onnx"  → FastEmbedEmbedder
# config.embed_backend == "ollama" → OllamaEmbedder

ONNX Backend (Default)

The ONNX backend uses fastembed by Qdrant for local CPU inference with zero external service dependencies.

Key Properties

Property	Value
Model	BAAI/bge-small-en-v1.5
Dimension	384
Model Size	~67MB (auto-downloaded)
Latency	~5ms/query
Dependencies	`fastembed` (108KB wheel, no PyTorch)
Execution	Pure CPU, ONNX Runtime
Cache	`~/.cache/fastembed/`

Initialization

The encoder is lazily initialized on first use (not at startup):

class FastEmbedEmbedder:
    def __init__(self, model, dim, cache_dir):
        self._encoder = None  # lazy init
        self._init_lock = threading.Lock()

    def _ensure_encoder(self):
        """Thread-safe lazy init — only loads model once"""
        if self._encoder is not None:
            return
        with self._init_lock:
            if self._encoder is not None:
                return
            from fastembed import TextEmbedding
            self._encoder = TextEmbedding(model_name=self._model_name)

Key design decisions: - Lazy loading: Model downloaded/loaded only when first embedding is needed - Thread-safe: threading.Lock() prevents duplicate model loading - Double-check locking: Avoids lock contention after initialization

Async Execution

ONNX inference is synchronous, so it's wrapped in run_in_executor to avoid blocking the event loop:

async def embed_one(self, text: str) -> list[float]:
    loop = asyncio.get_running_loop()
    embeddings = await loop.run_in_executor(
        None, lambda: list(self._encoder.embed([text]))
    )
    return embeddings[0].tolist()

Availability Check

async def check_availability(self) -> bool:
    # 1. Load model (lazy init)
    self._ensure_encoder()
    # 2. Test embed "ping"
    test_result = await loop.run_in_executor(
        None, lambda: list(self._encoder.embed(["ping"]))
    )
    # 3. Verify dimension matches config
    if len(test_result[0]) == self._dim:
        return True

Failure modes: - fastembed not installed → clear error message - Dimension mismatch → suggests updating onnx_dim config - Model download fails → reports the exception

Configuration

{
  "embed_backend": "onnx",
  "onnx_model": "BAAI/bge-small-en-v1.5",
  "onnx_dim": 384,
  "fastembed_cache_dir": ""
}

Set fastembed_cache_dir to override the default ~/.cache/fastembed/ location.

Ollama Backend

The Ollama backend connects to a local or remote Ollama server via HTTP API, supporting both native Ollama API and OpenAI-compatible endpoints.

Key Properties

Property	Value
Default Model	nomic-embed-text
Dimension	768
Model Size	~274MB
Latency	~20-50ms/query
Dependencies	Running Ollama server
API Modes	`ollama` (native) or `openai_compat`

HTTP Client

Uses httpx.AsyncClient with connection pooling:

self._client = httpx.AsyncClient(
    base_url=self._ollama_url,
    timeout=httpx.Timeout(connect=3.0, read=10.0, write=10.0, pool=10.0),
    transport=httpx.AsyncHTTPTransport(
        limits=httpx.Limits(max_connections=10, max_keepalive_connections=5),
    ),
)

The client is lazily created with an asyncio lock to prevent race conditions.

Retry Strategy

Every embed call has exponential backoff retry:

Attempt 1 → fail → wait 1.0s
Attempt 2 → fail → wait 2.0s
Attempt 3 → fail → raise exception

Parameters: - Max retries: 3 - Backoff base: 1.0s - Total timeout: 30s (kills even pending retries) - Concurrent batch limit: 4 (semaphore-controlled)

async def _embed_one_inner(self, text):
    for attempt in range(self._MAX_RETRIES):
        try:
            resp = await client.post(endpoint, json=payload)
            resp.raise_for_status()
            return self._parse_embeddings(resp.json())[0]
        except (ConnectError, TimeoutException) as e:
            wait = 1.0 * (2 ** attempt)  # 1s, 2s, 4s
            await asyncio.sleep(wait)

Batch Embedding

For multiple texts, batches are processed with controlled concurrency:

async def embed_batch(self, texts, batch_size=32):
    batches = [texts[i:i+batch_size] for i in range(0, len(texts), batch_size)]
    sem = asyncio.Semaphore(4)  # Max 4 concurrent batches
    results = await asyncio.gather(
        *[_limited(b, i) for i, b in enumerate(batches)],
        return_exceptions=True,
    )
    # Failed batches → None fill (partial success)

Failed batch positions are filled with None, allowing partial success — the caller decides how to handle missing embeddings.

API Compatibility

Two API modes are supported:

Mode	Endpoint	Request Format	Response Format
`ollama`	`/api/embed`	`{"model": "...", "input": "..."}`	`{"embeddings": [[...]]}`
`openai_compat`	`/v1/embeddings`	`{"model": "...", "input": "..."}`	`{"data": [{"embedding": [...], "index": 0}]}`

OpenAI-compatible mode sorts responses by index field to handle out-of-order returns.

Availability Check (Ollama Mode)

Three-phase verification:

Phase 1: Identity Check
    GET /api/version
    → Verify "version" in response (not a different service on same port)

Phase 2: Model Check
    GET /api/tags
    → Verify target model is downloaded

Phase 3: Embed Test
    POST /api/embed {"model": "...", "input": "ping"}
    → Verify non-empty vector returned

Each phase has specific error messages: - Port not Ollama → "Check port conflict" - Model not found → "Run: ollama pull <model>" - Embed timeout → "Possibly out of memory"

Configuration

{
  "embed_backend": "ollama",
  "embedding_model": "nomic-embed-text",
  "embedding_dim": 768,
  "ollama_url": "http://localhost:11434",
  "embed_api": "ollama"
}

For OpenAI-compatible servers:

{
  "embed_backend": "ollama",
  "embed_api": "openai_compat",
  "embedding_model": "your-model",
  "embedding_dim": 768,
  "ollama_url": "http://your-server:8080"
}

Service Cooldown

When the embedding service fails, a ServiceCooldown mechanism prevents repeated failed connections:

Failure 1 → cooldown 60s
Failure 2 → cooldown 120s
Failure 3 → cooldown 240s
Failure 4+ → cooldown 300s (max)
Success → reset counter

class ServiceCooldown:
    base_cooldown = 60.0    # Initial cooldown
    max_cooldown = 300.0    # Maximum cooldown (5 minutes)
    backoff_factor = 2.0    # Exponential multiplier

    @property
    def current_backoff(self):
        return min(
            base * (factor ** (failures - 1)),
            max_cooldown,
        )

During cooldown: - embedder.available returns False - recall degrades to BM25-only mode - No embedding requests are attempted - Cooldown expires → next check_availability() call retries

Vector Storage

Embeddings are stored in two places:

Location	Format	Purpose
`memories.embedding`	BLOB (float32 array)	Backup / re-indexing
`memories_vec` (vec0)	vec0 virtual table	KNN search

Conversion between Python floats and SQLite blobs:

import struct

def floats_to_blob(floats: list[float]) -> bytes:
    return struct.pack(f'{len(floats)}f', *floats)

def blob_to_floats(blob: bytes) -> list[float]:
    count = len(blob) // 4  # 4 bytes per float32
    return list(struct.unpack(f'{count}f', blob))

Backend Comparison

Dimension	ONNX (default)	Ollama
Setup	Zero config	Requires `ollama serve` + model pull
Speed	~5ms/query	~20-50ms/query
Model size	67MB (auto-downloaded)	274MB (nomic-embed-text)
CPU/GPU	Pure CPU	CPU (GPU optional)
Offline	Fully offline after first download	Requires local Ollama server
Dimension	384d	768d
Quality	MTEB retrieval 51.68	Higher for some tasks
Multilingual	Limited	Better with multilingual models
Failure mode	Process crash → restart needed	HTTP timeout → graceful degradation
Retry	No retry (local, fast)	3 retries with exponential backoff

When to Choose ONNX

Embedded/edge deployment with no external services
Low-latency requirements (<10ms)
English-primary workloads
Minimal setup preferred

When to Choose Ollama

Multilingual requirements (CJK, etc.)
Higher embedding quality needed
GPU available for faster inference
Already running Ollama for other tasks

Graceful Degradation

The embedding system is designed to never block core functionality:

Failure	Impact	Behavior
ONNX model not installed	No vector search	BM25-only search works fine
Ollama server down	No vector search	BM25-only, ServiceCooldown activated
Embedding fails for one memory	That memory has no vector	Still findable via BM25
sqlite-vec not installed	No vec0 table	BM25-only, embeddings still stored in memories.embedding

The learn tool always succeeds even if embedding fails — the memory is stored without an embedding, and can be backfilled later using get_memories_missing_embedding().

← Previous Search Pipeline Deep Dive Next → Dedup & Consolidation Deep Dive