Auto Classification

robotmem automatically classifies every memory at write time using a pure regex-based L0 engine. No LLM dependency — sub-millisecond execution.

Pipeline Overview

learn(insight="grip_force=12.5N works best because sensor was calibrated")
         │
         ├── classify_category()
         │   Regex first-match → "root_cause" (matched "because")
         │
         ├── estimate_confidence()
         │   Base 0.80 + signals → 0.90
         │   (causal +0.05, backtick +0.05)
         │
         ├── extract_scope()
         │   → {scope_files: [], scope_entities: ["grip_force"]}
         │
         ├── classify_tags()
         │   Multi-label regex → ["root_cause", "observation"]
         │
         └── build_context_json()
             Merge user context + source marker

Category Classification

Priority Order

Categories are matched in strict priority order — first match wins:

Priority High ─────────────────────────────── Priority Low

constraint → preference → worldview →
tradeoff → root_cause → decision → revert →
pattern → architecture → config →
postmortem → gotcha → self_defect →
observation_debug → observation_code → observation →
code (default fallback)

Category Rules

#	Category	Trigger Patterns	Decay	Example
1	`constraint`	must always, never, forbidden, must not, 必须, 禁止, 不允许, 强制	Protected	"Must never exceed 15N grip force"
2	`preference`	prefer X over Y, recommended to use, 优先使用, 推荐使用	Normal	"Prefer approach from left side"
3	`worldview`	X is better than Y, the right way, from now on, 比...好, 最佳做法	Normal	"ONNX is faster than Ollama for small models"
4	`tradeoff`	tradeoff, pros and cons, 权衡, advantage, 优缺点, vs	Normal	"Speed vs accuracy: use 10Hz for real-time"
5	`root_cause`	caused by, because, root cause, 原因, 根因, 导致, 由于	Normal	"Failure caused by sensor drift"
6	`decision`	chose, decided, instead of, 选择, 决定, 决策, 采用	Normal	"Chose PID over MPC for simplicity"
7	`revert`	reverted, rollback, undo, 回滚, 撤销	Normal	"Reverted to previous PID gains"
8	`pattern`	every time, whenever, recurring, 规律, 总是, 反复出现	Normal	"Every time humidity > 80%, grip fails"
9	`architecture`	architecture, module, pipeline, 架构, 模块, 系统设计, 分层	Normal	"Memory pipeline: learn → dedup → store"
10	`config`	configuration, env var, setting, port, 配置, 环境变量, 版本	Normal	"Port 6889 for web UI"
11	`postmortem`	postmortem, lesson learned, 教训, 复盘, 事后分析	Protected	"Lesson: always calibrate before new session"
12	`gotcha`	gotcha, pitfall, 踩坑, 陷阱, 坑:	Protected	"Pitfall: joint limits not checked in sim"
13	`self_defect`	AI defect, 训练偏好, 幻觉倾向, 注意力衰减, 讨好倾向	Protected	"AI overengineering tendency on simple tasks"
14	`observation_debug`	found/noticed/discovered + error/bug/crash/timeout	Normal	"Found that timeout errors spike at noon"
15	`observation_code`	found/noticed/discovered + .py/.rs/.js/function/module	Normal	"Noticed search.py returns stale results"
16	`observation`	found that, observed, noticed, discovered, 发现, 观察到, 实测	Normal	"Found that red cups require more force"
17	`code`	(default fallback)	Normal	General technical notes

Protected Categories

Three categories are marked as "protected" — they are never consolidated and never time-decayed:

Category	Why Protected
`constraint`	Safety rules must persist forever
`postmortem`	Lessons learned are irreplaceable
`gotcha`	Each pitfall has unique context

Bilingual Matching

All patterns support both English and Chinese:

# constraint pattern (excerpt)
r"(?:must\s+(?:always|never|not)|(?:必须|禁止|不[允准许]许|强制|绝不|一定要))"

# root_cause pattern (excerpt)
r"(?:root\s*cause|caused?\s*by|because|原因|根因|导致|问题出在|之所以|是因为|由于)"

Confidence Estimation

The confidence score indicates how reliable a memory is, based on content richness signals.

Formula

base = 0.80
+ 0.05 if file path detected (e.g., "src/search.py")
+ 0.05 if code reference detected (backticks or function calls)
+ 0.05 if causal language detected ("because", "caused by", etc.)
+ 0.05 if context JSON provided (> 20 chars)
= cap at 0.95

Signal Detection

Signal	Regex Pattern	Example
File path	`\w[\w./-]*\.(py\\|rs\\|js\\|ts\\|go\\|md\\|...)`	`src/robotmem/search.py`
Code reference	Backticks or `\w+()` pattern	`embed_one()`
Causal language	`because\\|caused by\\|原因\\|根因\\|导致`	"failed because of timeout"
Rich context	`len(context.strip()) > 20`	JSON context with params

Confidence Range

Signals	Confidence	Interpretation
0 signals	0.80	Base level — general observation
1 signal	0.85	Moderate — some evidence
2 signals	0.90	Good — well-evidenced
3 signals	0.95	High — strong evidence (max)
4 signals	0.95	Capped at 0.95

Scope Extraction

The scope extractor identifies referenced files, entities, and modules from the memory text.

File Path Detection

# Regex matches paths like: src/robotmem/search.py, /etc/config.yaml
_FILE_PATH_RE = re.compile(
    r"(?:^|[\s\"'`(,])(/?\w[\w./-]*\.(?:py|rs|js|ts|tsx|go|md|toml|yaml|yml|json|sql|sh|css|html))"
)

Supported extensions: .py, .rs, .js, .ts, .tsx, .go, .md, .toml, .yaml, .yml, .json, .sql, .sh, .css, .html

Entity Detection

Two extraction methods:

Backtick entities: `ClassName`, `function_name()`
PascalCase names: CogDatabase, FastEmbedEmbedder

# Backtick: `embed_one()` → "embed_one"
_BACKTICK_ENTITY_RE = re.compile(r"`(\w[\w.]*)(?:\(\))?`")

# PascalCase: CogDatabase → "CogDatabase"
_PASCAL_CASE_RE = re.compile(r"\b([A-Z][a-z]+(?:[A-Z][a-z]+)+)\b")

Module Inference

Modules are inferred from file paths:

# "src/robotmem/ops/search.py" → module "ops"
# (parent directory, excluding src/lib/app/tests)

Scope Output

{
    "scope_files": ["src/robotmem/search.py"],
    "scope_entities": ["embed_one", "CogDatabase"],
    "scope_modules": ["robotmem"]
}

Tag Classification

Tags provide multi-label classification — a single memory can have multiple tags.

Process

def classify_tags(text, context_json=None):
    tags = []
    # 1. Match ALL regex patterns (not just first)
    for category, pattern in _CATEGORY_PATTERNS:
        if pattern.search(text):
            tags.append(category)

    # 2. Extract scenario_tags from context JSON
    if context_json:
        ctx = json.loads(context_json)
        for tag in ctx.get("scenario_tags", []):
            if tag in VALID_TAGS:
                tags.append(tag)

    # 3. Fallback
    if not tags:
        tags.append("code")

    return tags

Key difference from classify_category(): - classify_category() → first match (single label) - classify_tags() → all matches (multi-label)

Context-Based Tags

Tags can also be injected via the context JSON:

{
    "scenario_tags": ["debug", "concurrency", "pattern"]
}

These are validated against the VALID_TAGS whitelist (50+ tags) before insertion.

Tag Taxonomy

The tag system uses a 9-dimension hierarchical tree with 50+ tags:

metacognition          ← reasoning, cognitive_bias, decision_framework,
│                        systems_thinking, risk_thinking, worldview, decision
│
capability             ← build, debug, design, review, explain, optimize,
│                        plan, architecture, code
│
domain                 ← cs_fundamentals, ai_ml, finance, business,
│                        cross_domain, config, observation, observation_code,
│                        observation_debug
│
technique              ← patterns, anti_patterns, recipes, language_specific,
│                        pattern
│
timing                 ← when_to_start, when_to_stop, when_to_switch
│
boundary               ← tradeoff, not_applicable, diminishing_returns,
│                        constraint
│
experience             ← war_story, postmortem, gotcha, root_cause, revert
│
self_defect            ← hallucination, sycophancy, overengineering,
│                        no_verification
│
reflection             ← accuracy_calibration, behavior_rule, blind_spot,
                         preference

Dimension Prefix

Each tag maps to a human-readable dimension prefix for display:

dimension_prefix("gotcha")       # → "[经验/踩坑]"
dimension_prefix("architecture") # → "[能力/架构设计]"
dimension_prefix("metacognition")# → "[元认知]"  (root node)

Tag Storage

Tags are stored in the memory_tags table:

Column	Description
`memory_id`	FK to memories.id
`tag`	Tag from controlled vocabulary
`source`	`"auto"` (regex inferred) or `"user"` (explicitly set)

Primary key: (memory_id, tag) — no duplicate tags per memory.

Tag Metadata

The tag_meta table maintains the taxonomy hierarchy:

Column	Description
`tag`	Tag identifier (PK)
`parent`	Parent tag (NULL = root dimension)
`display_name`	Human-readable name

This table is automatically synced from TAG_META_TREE at database initialization.

Context JSON Builder

The build_context_json() function merges user-provided context with automatic metadata:

def build_context_json(insight, context):
    result = {"source": "learn_tool"}

    if context:
        parsed = json.loads(context)
        if isinstance(parsed, dict):
            result.update(parsed)  # Merge user fields
        else:
            result["user_context"] = context  # Store as-is

    return json.dumps(result)

This ensures every memory has at least a source field for provenance tracking.

Path Normalization

File paths extracted by scope are normalized for consistency:

def normalize_scope_files(files, project_root=None):
    """Absolute → relative, deduplicated, sorted"""
    # /Users/jia/project/src/foo.py → src/foo.py

This prevents duplicate entries when the same file is referenced with both absolute and relative paths.

Design Principles

Pure regex, no LLM: Sub-millisecond execution, deterministic, no hallucination risk
First-match priority: Categories have strict ordering — safety constraints always win
Multi-label tags: A memory can be both root_cause and observation
Bilingual: All patterns work for both English and Chinese text
Graceful fallback: Unknown text defaults to code category with 0.80 confidence
Controlled vocabulary: All tags are validated against a whitelist — no free-form tagging

← Previous Dedup & Consolidation Deep Dive Next → Web UI Deep Dive