Embedding
robotmem uses embedding vectors to enable semantic similarity search. Two backends are supported: ONNX (local, default) and Ollama (HTTP API).
Embedder Protocol
All embedding backends implement the same protocol:
class Embedder(Protocol):
@property
def available(self) -> bool: ...
@property
def unavailable_reason(self) -> str: ...
@property
def model(self) -> str: ...
@property
def dim(self) -> int: ...
async def embed_one(self, text: str) -> list[float]: ...
async def embed_batch(self, texts: list[str], batch_size: int = 32) -> list[list[float] | None]: ...
async def check_availability(self) -> bool: ...
async def close(self) -> None: ...
This protocol-based design allows clean backend switching through the factory function:
embedder = create_embedder(config)
# config.embed_backend == "onnx" → FastEmbedEmbedder
# config.embed_backend == "ollama" → OllamaEmbedder
ONNX Backend (Default)
The ONNX backend uses fastembed by Qdrant for local CPU inference with zero external service dependencies.
Key Properties
| Property | Value |
|---|---|
| Model | BAAI/bge-small-en-v1.5 |
| Dimension | 384 |
| Model Size | ~67MB (auto-downloaded) |
| Latency | ~5ms/query |
| Dependencies | fastembed (108KB wheel, no PyTorch) |
| Execution | Pure CPU, ONNX Runtime |
| Cache | ~/.cache/fastembed/ |
Initialization
The encoder is lazily initialized on first use (not at startup):
class FastEmbedEmbedder:
def __init__(self, model, dim, cache_dir):
self._encoder = None # lazy init
self._init_lock = threading.Lock()
def _ensure_encoder(self):
"""Thread-safe lazy init — only loads model once"""
if self._encoder is not None:
return
with self._init_lock:
if self._encoder is not None:
return
from fastembed import TextEmbedding
self._encoder = TextEmbedding(model_name=self._model_name)
Key design decisions:
- Lazy loading: Model downloaded/loaded only when first embedding is needed
- Thread-safe: threading.Lock() prevents duplicate model loading
- Double-check locking: Avoids lock contention after initialization
Async Execution
ONNX inference is synchronous, so it's wrapped in run_in_executor to avoid blocking the event loop:
async def embed_one(self, text: str) -> list[float]:
loop = asyncio.get_running_loop()
embeddings = await loop.run_in_executor(
None, lambda: list(self._encoder.embed([text]))
)
return embeddings[0].tolist()
Availability Check
async def check_availability(self) -> bool:
# 1. Load model (lazy init)
self._ensure_encoder()
# 2. Test embed "ping"
test_result = await loop.run_in_executor(
None, lambda: list(self._encoder.embed(["ping"]))
)
# 3. Verify dimension matches config
if len(test_result[0]) == self._dim:
return True
Failure modes:
- fastembed not installed → clear error message
- Dimension mismatch → suggests updating onnx_dim config
- Model download fails → reports the exception
Configuration
{
"embed_backend": "onnx",
"onnx_model": "BAAI/bge-small-en-v1.5",
"onnx_dim": 384,
"fastembed_cache_dir": ""
}
Set fastembed_cache_dir to override the default ~/.cache/fastembed/ location.
Ollama Backend
The Ollama backend connects to a local or remote Ollama server via HTTP API, supporting both native Ollama API and OpenAI-compatible endpoints.
Key Properties
| Property | Value |
|---|---|
| Default Model | nomic-embed-text |
| Dimension | 768 |
| Model Size | ~274MB |
| Latency | ~20-50ms/query |
| Dependencies | Running Ollama server |
| API Modes | ollama (native) or openai_compat |
HTTP Client
Uses httpx.AsyncClient with connection pooling:
self._client = httpx.AsyncClient(
base_url=self._ollama_url,
timeout=httpx.Timeout(connect=3.0, read=10.0, write=10.0, pool=10.0),
transport=httpx.AsyncHTTPTransport(
limits=httpx.Limits(max_connections=10, max_keepalive_connections=5),
),
)
The client is lazily created with an asyncio lock to prevent race conditions.
Retry Strategy
Every embed call has exponential backoff retry:
Attempt 1 → fail → wait 1.0s
Attempt 2 → fail → wait 2.0s
Attempt 3 → fail → raise exception
Parameters: - Max retries: 3 - Backoff base: 1.0s - Total timeout: 30s (kills even pending retries) - Concurrent batch limit: 4 (semaphore-controlled)
async def _embed_one_inner(self, text):
for attempt in range(self._MAX_RETRIES):
try:
resp = await client.post(endpoint, json=payload)
resp.raise_for_status()
return self._parse_embeddings(resp.json())[0]
except (ConnectError, TimeoutException) as e:
wait = 1.0 * (2 ** attempt) # 1s, 2s, 4s
await asyncio.sleep(wait)
Batch Embedding
For multiple texts, batches are processed with controlled concurrency:
async def embed_batch(self, texts, batch_size=32):
batches = [texts[i:i+batch_size] for i in range(0, len(texts), batch_size)]
sem = asyncio.Semaphore(4) # Max 4 concurrent batches
results = await asyncio.gather(
*[_limited(b, i) for i, b in enumerate(batches)],
return_exceptions=True,
)
# Failed batches → None fill (partial success)
Failed batch positions are filled with None, allowing partial success — the caller decides how to handle missing embeddings.
API Compatibility
Two API modes are supported:
| Mode | Endpoint | Request Format | Response Format |
|---|---|---|---|
ollama |
/api/embed |
{"model": "...", "input": "..."} |
{"embeddings": [[...]]} |
openai_compat |
/v1/embeddings |
{"model": "...", "input": "..."} |
{"data": [{"embedding": [...], "index": 0}]} |
OpenAI-compatible mode sorts responses by index field to handle out-of-order returns.
Availability Check (Ollama Mode)
Three-phase verification:
Phase 1: Identity Check
GET /api/version
→ Verify "version" in response (not a different service on same port)
Phase 2: Model Check
GET /api/tags
→ Verify target model is downloaded
Phase 3: Embed Test
POST /api/embed {"model": "...", "input": "ping"}
→ Verify non-empty vector returned
Each phase has specific error messages:
- Port not Ollama → "Check port conflict"
- Model not found → "Run: ollama pull <model>"
- Embed timeout → "Possibly out of memory"
Configuration
{
"embed_backend": "ollama",
"embedding_model": "nomic-embed-text",
"embedding_dim": 768,
"ollama_url": "http://localhost:11434",
"embed_api": "ollama"
}
For OpenAI-compatible servers:
{
"embed_backend": "ollama",
"embed_api": "openai_compat",
"embedding_model": "your-model",
"embedding_dim": 768,
"ollama_url": "http://your-server:8080"
}
Service Cooldown
When the embedding service fails, a ServiceCooldown mechanism prevents repeated failed connections:
Failure 1 → cooldown 60s
Failure 2 → cooldown 120s
Failure 3 → cooldown 240s
Failure 4+ → cooldown 300s (max)
Success → reset counter
class ServiceCooldown:
base_cooldown = 60.0 # Initial cooldown
max_cooldown = 300.0 # Maximum cooldown (5 minutes)
backoff_factor = 2.0 # Exponential multiplier
@property
def current_backoff(self):
return min(
base * (factor ** (failures - 1)),
max_cooldown,
)
During cooldown:
- embedder.available returns False
- recall degrades to BM25-only mode
- No embedding requests are attempted
- Cooldown expires → next check_availability() call retries
Vector Storage
Embeddings are stored in two places:
| Location | Format | Purpose |
|---|---|---|
memories.embedding |
BLOB (float32 array) | Backup / re-indexing |
memories_vec (vec0) |
vec0 virtual table | KNN search |
Conversion between Python floats and SQLite blobs:
import struct
def floats_to_blob(floats: list[float]) -> bytes:
return struct.pack(f'{len(floats)}f', *floats)
def blob_to_floats(blob: bytes) -> list[float]:
count = len(blob) // 4 # 4 bytes per float32
return list(struct.unpack(f'{count}f', blob))
Backend Comparison
| Dimension | ONNX (default) | Ollama |
|---|---|---|
| Setup | Zero config | Requires ollama serve + model pull |
| Speed | ~5ms/query | ~20-50ms/query |
| Model size | 67MB (auto-downloaded) | 274MB (nomic-embed-text) |
| CPU/GPU | Pure CPU | CPU (GPU optional) |
| Offline | Fully offline after first download | Requires local Ollama server |
| Dimension | 384d | 768d |
| Quality | MTEB retrieval 51.68 | Higher for some tasks |
| Multilingual | Limited | Better with multilingual models |
| Failure mode | Process crash → restart needed | HTTP timeout → graceful degradation |
| Retry | No retry (local, fast) | 3 retries with exponential backoff |
When to Choose ONNX
- Embedded/edge deployment with no external services
- Low-latency requirements (<10ms)
- English-primary workloads
- Minimal setup preferred
When to Choose Ollama
- Multilingual requirements (CJK, etc.)
- Higher embedding quality needed
- GPU available for faster inference
- Already running Ollama for other tasks
Graceful Degradation
The embedding system is designed to never block core functionality:
| Failure | Impact | Behavior |
|---|---|---|
| ONNX model not installed | No vector search | BM25-only search works fine |
| Ollama server down | No vector search | BM25-only, ServiceCooldown activated |
| Embedding fails for one memory | That memory has no vector | Still findable via BM25 |
| sqlite-vec not installed | No vec0 table | BM25-only, embeddings still stored in memories.embedding |
The learn tool always succeeds even if embedding fails — the memory is stored without an embedding, and can be backfilled later using get_memories_missing_embedding().