ESC
Deep Dive Dedup & Consolidation

Deduplication & Consolidation

robotmem prevents redundant memories through a multi-layer deduplication pipeline at write time, and consolidates similar memories at episode end.

Deduplication Pipeline

When learn is called, new memories pass through three layers of duplicate detection before being stored.

Overview

learn(insight="grip_force=12.5N works best")
         │
         ├── Layer 1: Exact Match
         │   SHA-256(content)[:16] → content_hash lookup
         │   O(1) — instant reject
         │
         ├── Layer 2: Jaccard Token Similarity
         │   FTS5 candidates → pairwise token overlap
         │   Threshold: > 0.70 → duplicate
         │
         └── Layer 3: Cosine Vector Similarity
         │   embed_one(content) → vec_search top-3
         │   Threshold: > 0.85 → duplicate
         │
         ▼
    is_dup=True → return {status: "duplicate", method, existing_id, similarity}
    is_dup=False → proceed to insert

Layer 1: Exact Match

The fastest check — O(1) hash lookup:

content_hash = hashlib.sha256(content.encode("utf-8")).hexdigest()[:16]

# Check in DB
existing = conn.execute(
    "SELECT 1 FROM memories WHERE content_hash = ? AND collection = ? "
    "AND status = 'active' LIMIT 1",
    (content_hash, collection),
).fetchone()

Layer 2: Jaccard Token Similarity

For near-duplicates with minor wording differences:

def jaccard_similarity(a: str, b: str) -> float:
    tokens_a = set(a.lower().split()) - STOPWORDS
    tokens_b = set(b.lower().split()) - STOPWORDS
    intersection = tokens_a & tokens_b
    union = tokens_a | tokens_b
    return len(intersection) / len(union)

Process: 1. Use FTS5 to find top-5 candidates matching the new content 2. Calculate Jaccard similarity for each candidate 3. Threshold: > 0.70 → mark as duplicate

Stopwords are filtered in both English and Chinese:

STOPWORDS = frozenset(
    "的 了 在 是 把 被 给 和 与 从 到 也 都 就 对 又 所 而 且 但 或 "
    "a an the is are was were be to of and in for on with "
    "preference constraint decision observation code config pattern "
    "architecture root_cause tradeoff revert".split()
)

Category-related terms are included as stopwords to prevent false matches between memories that only share classification keywords.

Layer 3: Cosine Vector Similarity

For semantic duplicates that use different words:

# Only runs if embedder is available
embedding = await embedder.embed_one(assertion)
vec_results = db_cog.vec_search_memories(
    query_embedding=embedding, collection=collection, limit=3
)
for vr in vec_results:
    cosine_sim = 1.0 - vr["distance"]
    if cosine_sim >= 0.85:
        return DedupResult(is_dup=True, method="cosine", similarity=cosine_sim)

Similar (Non-Duplicate) Tracking

Memories with moderate similarity (0.40–0.70 Jaccard or 0.40–0.85 cosine) are tracked as similar_facts in the result:

{
    "is_dup": False,
    "similar_facts": [
        {"id": 15, "assertion": "similar content...", "similarity": 0.55}
    ],
    "method": "none",
    "similarity": 0.55
}

These are returned to the caller for informational purposes but do not block insertion.

Session-Scoped Cosine Dedup

An additional dedup layer operates within the current session:

def check_session_cosine_dup(assertion, session_id, collection, db_cog, embedder):
    """Within same session, check for semantic duplicates"""
    embedding = embedder.embed_one(assertion)
    results = db_cog.vec_search_memories(embedding, collection, limit=20)
    for r in results:
        if r["session_id"] != session_id:
            continue  # Only check same session
        cosine_sim = 1.0 - r["distance"]
        if cosine_sim >= 0.85:
            return DedupResult(is_dup=True, method="session_cosine")

This catches cases where a robot records the same observation multiple times within a single episode.

Dedup Thresholds

Layer Method Duplicate Threshold Similar Threshold
1 Exact hash 1.0 (exact)
2 Jaccard tokens > 0.70 > 0.40
3 Cosine vectors > 0.85 > 0.40
3b Session cosine > 0.85

Batch Cleanup

For existing databases with accumulated duplicates, cleanup_exact_duplicates() provides a batch cleanup tool:

ops = cleanup_exact_duplicates(db_cog, collection="default", dry_run=True)
# Returns: [{"old_id": 5, "keep_id": 3, "assertion_preview": "..."}]

# Execute cleanup
ops = cleanup_exact_duplicates(db_cog, collection="default", dry_run=False)

Safety features: - Dry run mode: Preview changes before applying - Max 200 operations per run: Prevents runaway cleanup - Keeps highest confidence: Among duplicates, the one with highest confidence (then newest) is preserved - Soft delete: Duplicates are marked superseded, not physically deleted

Consolidation

At end_session, robotmem consolidates redundant memories from the episode through greedy Jaccard clustering.

Algorithm

end_session(session_id="abc-123")
         │
    1. Query consolidatable memories
       ├── Same session + collection
       ├── status = 'active'
       ├── category NOT IN (constraint, postmortem, gotcha)
       ├── confidence < 0.95
       └── perception_type IS NULL
         │
    2. Skip if < 3 memories
         │
    3. Group by category
         │
    4. Within each group: pairwise Jaccard similarity
       ├── > 0.50 threshold → candidate pair
       └── Greedy clustering (all pairs in cluster must exceed threshold)
         │
    5. Per cluster: select representative
       ├── Priority: confidence DESC
       ├── Tiebreak: access_count DESC
       └── Tiebreak: created_at DESC
         │
    6. Non-representatives → status = 'superseded'
       └── superseded_by = representative.id

Protected Memories

The following memories are never consolidated, even if similar:

Protection Reason
Category constraint Safety rules must never be merged
Category postmortem Lessons learned are individually valuable
Category gotcha Each pitfall is unique context
Confidence >= 0.95 High-confidence memories are too valuable to merge
Type perception Sensor data should be preserved individually

Greedy Clustering

The clustering algorithm ensures high quality clusters:

for i, a in enumerate(mems):
    cluster = [a]
    for j in range(i+1, len(mems)):
        b = mems[j]
        # ALL-PAIRS constraint: b must be similar to every member in cluster
        sim_ok = all(
            jaccard_similarity(b["content"], c["content"]) > 0.50
            for c in cluster
        )
        if sim_ok:
            cluster.append(b)

This is stricter than single-linkage clustering — every pair within a cluster must exceed the threshold, preventing "chain drift" where A→B→C are linked but A and C are actually dissimilar.

Representative Selection

Within each cluster, the representative is chosen by:

  1. Highest confidence — Most reliable memory survives
  2. Highest access_count — Most frequently recalled = most useful
  3. Newest (created_at DESC) — Most recent information

Consolidation Response

{
    "merged_groups": 2,
    "superseded_count": 3,
    "compression_ratio": 0.20,
    "avg_similarity": 0.65
}
Field Description
merged_groups Number of clusters found
superseded_count Total memories marked superseded
compression_ratio superseded_count / total_consolidatable
avg_similarity Average Jaccard similarity within clusters

Example

Given 5 memories in a session:

#1: "grip_force=12.5N works for cups"        (category=observation, confidence=0.85)
#2: "12.5N grip force optimal for cylinders" (category=observation, confidence=0.80)
#3: "force 12.5N best for cup grasping"      (category=observation, confidence=0.90)
#4: "red objects need 15N force"             (category=observation, confidence=0.85)
#5: "always calibrate before grasping"       (category=constraint, confidence=0.95)

Result: - #5 is protected (constraint category + high confidence) - #1, #2, #3 cluster together (Jaccard > 0.50 pairwise) - #3 is selected as representative (highest confidence: 0.90) - #1, #2 → superseded, superseded_by = 3 - #4 stays independent (not similar enough to join the cluster)

Time Decay

Also triggered by end_session, time decay reduces confidence of memories not recently accessed:

Formula

confidence_new = confidence × (1 - decay_rate) ^ days_since_last_access

Parameters

Parameter Default Description
decay_rate 0.01 Per-day decay rate (1%)
min_interval_days 1.0 Only decay if last access > 1 day ago
Confidence floor 0.05 Stop decaying below this threshold

Base Time

The decay reference point is last_accessed (updated by recall hits), with fallback to created_at:

julianday('now') - julianday(COALESCE(last_accessed, created_at))

This means: - Frequently recalled memories maintain high confidence (each recall resets the clock) - Unused memories gradually fade - Never-recalled memories decay from their creation date

Decay Curve

For a memory with confidence=0.90 and decay_rate=0.01:

Days Since Last Access Confidence
0 0.900
7 0.839
30 0.664
90 0.365
180 0.148
365 0.024

At the default min_confidence=0.3 recall filter, this memory would stop appearing in search results after ~93 days without being recalled.

Database Operations

All consolidation and dedup operations use safe database primitives:

Operation Primitive Behavior on Failure
Supersede memory safe_db_transaction Atomic: all or nothing
Time decay batch safe_db_transaction Atomic: all or nothing
Cleanup duplicates safe_db_transaction Atomic per batch
Touch memories safe_db_transaction Atomic per batch

Failed transactions are logged but never crash the server.