Deduplication & Consolidation
robotmem prevents redundant memories through a multi-layer deduplication pipeline at write time, and consolidates similar memories at episode end.
Deduplication Pipeline
When learn is called, new memories pass through three layers of duplicate detection before being stored.
Overview
learn(insight="grip_force=12.5N works best")
│
├── Layer 1: Exact Match
│ SHA-256(content)[:16] → content_hash lookup
│ O(1) — instant reject
│
├── Layer 2: Jaccard Token Similarity
│ FTS5 candidates → pairwise token overlap
│ Threshold: > 0.70 → duplicate
│
└── Layer 3: Cosine Vector Similarity
│ embed_one(content) → vec_search top-3
│ Threshold: > 0.85 → duplicate
│
▼
is_dup=True → return {status: "duplicate", method, existing_id, similarity}
is_dup=False → proceed to insert
Layer 1: Exact Match
The fastest check — O(1) hash lookup:
content_hash = hashlib.sha256(content.encode("utf-8")).hexdigest()[:16]
# Check in DB
existing = conn.execute(
"SELECT 1 FROM memories WHERE content_hash = ? AND collection = ? "
"AND status = 'active' LIMIT 1",
(content_hash, collection),
).fetchone()
- Uses first 16 characters of SHA-256 hash
- Scoped to same collection + active status
- Catches exact duplicates (character-for-character)
- Indexed via
idx_mem_hashfor fast lookup
Layer 2: Jaccard Token Similarity
For near-duplicates with minor wording differences:
def jaccard_similarity(a: str, b: str) -> float:
tokens_a = set(a.lower().split()) - STOPWORDS
tokens_b = set(b.lower().split()) - STOPWORDS
intersection = tokens_a & tokens_b
union = tokens_a | tokens_b
return len(intersection) / len(union)
Process:
1. Use FTS5 to find top-5 candidates matching the new content
2. Calculate Jaccard similarity for each candidate
3. Threshold: > 0.70 → mark as duplicate
Stopwords are filtered in both English and Chinese:
STOPWORDS = frozenset(
"的 了 在 是 把 被 给 和 与 从 到 也 都 就 对 又 所 而 且 但 或 "
"a an the is are was were be to of and in for on with "
"preference constraint decision observation code config pattern "
"architecture root_cause tradeoff revert".split()
)
Category-related terms are included as stopwords to prevent false matches between memories that only share classification keywords.
Layer 3: Cosine Vector Similarity
For semantic duplicates that use different words:
# Only runs if embedder is available
embedding = await embedder.embed_one(assertion)
vec_results = db_cog.vec_search_memories(
query_embedding=embedding, collection=collection, limit=3
)
for vr in vec_results:
cosine_sim = 1.0 - vr["distance"]
if cosine_sim >= 0.85:
return DedupResult(is_dup=True, method="cosine", similarity=cosine_sim)
- Only runs when an embedder backend is available
- Falls back gracefully — if embedding fails, Layer 2 still catches most duplicates
- Cannot run inside an already-running event loop (logs debug and skips)
Similar (Non-Duplicate) Tracking
Memories with moderate similarity (0.40–0.70 Jaccard or 0.40–0.85 cosine) are tracked as similar_facts in the result:
{
"is_dup": False,
"similar_facts": [
{"id": 15, "assertion": "similar content...", "similarity": 0.55}
],
"method": "none",
"similarity": 0.55
}
These are returned to the caller for informational purposes but do not block insertion.
Session-Scoped Cosine Dedup
An additional dedup layer operates within the current session:
def check_session_cosine_dup(assertion, session_id, collection, db_cog, embedder):
"""Within same session, check for semantic duplicates"""
embedding = embedder.embed_one(assertion)
results = db_cog.vec_search_memories(embedding, collection, limit=20)
for r in results:
if r["session_id"] != session_id:
continue # Only check same session
cosine_sim = 1.0 - r["distance"]
if cosine_sim >= 0.85:
return DedupResult(is_dup=True, method="session_cosine")
This catches cases where a robot records the same observation multiple times within a single episode.
Dedup Thresholds
| Layer | Method | Duplicate Threshold | Similar Threshold |
|---|---|---|---|
| 1 | Exact hash | 1.0 (exact) | — |
| 2 | Jaccard tokens | > 0.70 | > 0.40 |
| 3 | Cosine vectors | > 0.85 | > 0.40 |
| 3b | Session cosine | > 0.85 | — |
Batch Cleanup
For existing databases with accumulated duplicates, cleanup_exact_duplicates() provides a batch cleanup tool:
ops = cleanup_exact_duplicates(db_cog, collection="default", dry_run=True)
# Returns: [{"old_id": 5, "keep_id": 3, "assertion_preview": "..."}]
# Execute cleanup
ops = cleanup_exact_duplicates(db_cog, collection="default", dry_run=False)
Safety features:
- Dry run mode: Preview changes before applying
- Max 200 operations per run: Prevents runaway cleanup
- Keeps highest confidence: Among duplicates, the one with highest confidence (then newest) is preserved
- Soft delete: Duplicates are marked superseded, not physically deleted
Consolidation
At end_session, robotmem consolidates redundant memories from the episode through greedy Jaccard clustering.
Algorithm
end_session(session_id="abc-123")
│
1. Query consolidatable memories
├── Same session + collection
├── status = 'active'
├── category NOT IN (constraint, postmortem, gotcha)
├── confidence < 0.95
└── perception_type IS NULL
│
2. Skip if < 3 memories
│
3. Group by category
│
4. Within each group: pairwise Jaccard similarity
├── > 0.50 threshold → candidate pair
└── Greedy clustering (all pairs in cluster must exceed threshold)
│
5. Per cluster: select representative
├── Priority: confidence DESC
├── Tiebreak: access_count DESC
└── Tiebreak: created_at DESC
│
6. Non-representatives → status = 'superseded'
└── superseded_by = representative.id
Protected Memories
The following memories are never consolidated, even if similar:
| Protection | Reason |
|---|---|
Category constraint |
Safety rules must never be merged |
Category postmortem |
Lessons learned are individually valuable |
Category gotcha |
Each pitfall is unique context |
| Confidence >= 0.95 | High-confidence memories are too valuable to merge |
Type perception |
Sensor data should be preserved individually |
Greedy Clustering
The clustering algorithm ensures high quality clusters:
for i, a in enumerate(mems):
cluster = [a]
for j in range(i+1, len(mems)):
b = mems[j]
# ALL-PAIRS constraint: b must be similar to every member in cluster
sim_ok = all(
jaccard_similarity(b["content"], c["content"]) > 0.50
for c in cluster
)
if sim_ok:
cluster.append(b)
This is stricter than single-linkage clustering — every pair within a cluster must exceed the threshold, preventing "chain drift" where A→B→C are linked but A and C are actually dissimilar.
Representative Selection
Within each cluster, the representative is chosen by:
- Highest confidence — Most reliable memory survives
- Highest access_count — Most frequently recalled = most useful
- Newest (
created_at DESC) — Most recent information
Consolidation Response
{
"merged_groups": 2,
"superseded_count": 3,
"compression_ratio": 0.20,
"avg_similarity": 0.65
}
| Field | Description |
|---|---|
merged_groups |
Number of clusters found |
superseded_count |
Total memories marked superseded |
compression_ratio |
superseded_count / total_consolidatable |
avg_similarity |
Average Jaccard similarity within clusters |
Example
Given 5 memories in a session:
#1: "grip_force=12.5N works for cups" (category=observation, confidence=0.85)
#2: "12.5N grip force optimal for cylinders" (category=observation, confidence=0.80)
#3: "force 12.5N best for cup grasping" (category=observation, confidence=0.90)
#4: "red objects need 15N force" (category=observation, confidence=0.85)
#5: "always calibrate before grasping" (category=constraint, confidence=0.95)
Result:
- #5 is protected (constraint category + high confidence)
- #1, #2, #3 cluster together (Jaccard > 0.50 pairwise)
- #3 is selected as representative (highest confidence: 0.90)
- #1, #2 → superseded, superseded_by = 3
- #4 stays independent (not similar enough to join the cluster)
Time Decay
Also triggered by end_session, time decay reduces confidence of memories not recently accessed:
Formula
confidence_new = confidence × (1 - decay_rate) ^ days_since_last_access
Parameters
| Parameter | Default | Description |
|---|---|---|
decay_rate |
0.01 | Per-day decay rate (1%) |
min_interval_days |
1.0 | Only decay if last access > 1 day ago |
| Confidence floor | 0.05 | Stop decaying below this threshold |
Base Time
The decay reference point is last_accessed (updated by recall hits), with fallback to created_at:
julianday('now') - julianday(COALESCE(last_accessed, created_at))
This means: - Frequently recalled memories maintain high confidence (each recall resets the clock) - Unused memories gradually fade - Never-recalled memories decay from their creation date
Decay Curve
For a memory with confidence=0.90 and decay_rate=0.01:
| Days Since Last Access | Confidence |
|---|---|
| 0 | 0.900 |
| 7 | 0.839 |
| 30 | 0.664 |
| 90 | 0.365 |
| 180 | 0.148 |
| 365 | 0.024 |
At the default min_confidence=0.3 recall filter, this memory would stop appearing in search results after ~93 days without being recalled.
Database Operations
All consolidation and dedup operations use safe database primitives:
| Operation | Primitive | Behavior on Failure |
|---|---|---|
| Supersede memory | safe_db_transaction |
Atomic: all or nothing |
| Time decay batch | safe_db_transaction |
Atomic: all or nothing |
| Cleanup duplicates | safe_db_transaction |
Atomic per batch |
| Touch memories | safe_db_transaction |
Atomic per batch |
Failed transactions are logged but never crash the server.