Auto Classification
robotmem automatically classifies every memory at write time using a pure regex-based L0 engine. No LLM dependency — sub-millisecond execution.
Pipeline Overview
learn(insight="grip_force=12.5N works best because sensor was calibrated")
│
├── classify_category()
│ Regex first-match → "root_cause" (matched "because")
│
├── estimate_confidence()
│ Base 0.80 + signals → 0.90
│ (causal +0.05, backtick +0.05)
│
├── extract_scope()
│ → {scope_files: [], scope_entities: ["grip_force"]}
│
├── classify_tags()
│ Multi-label regex → ["root_cause", "observation"]
│
└── build_context_json()
Merge user context + source marker
Category Classification
Priority Order
Categories are matched in strict priority order — first match wins:
Priority High ─────────────────────────────── Priority Low
constraint → preference → worldview →
tradeoff → root_cause → decision → revert →
pattern → architecture → config →
postmortem → gotcha → self_defect →
observation_debug → observation_code → observation →
code (default fallback)
Category Rules
| # | Category | Trigger Patterns | Decay | Example |
|---|---|---|---|---|
| 1 | constraint |
must always, never, forbidden, must not, 必须, 禁止, 不允许, 强制 | Protected | "Must never exceed 15N grip force" |
| 2 | preference |
prefer X over Y, recommended to use, 优先使用, 推荐使用 | Normal | "Prefer approach from left side" |
| 3 | worldview |
X is better than Y, the right way, from now on, 比...好, 最佳做法 | Normal | "ONNX is faster than Ollama for small models" |
| 4 | tradeoff |
tradeoff, pros and cons, 权衡, advantage, 优缺点, vs | Normal | "Speed vs accuracy: use 10Hz for real-time" |
| 5 | root_cause |
caused by, because, root cause, 原因, 根因, 导致, 由于 | Normal | "Failure caused by sensor drift" |
| 6 | decision |
chose, decided, instead of, 选择, 决定, 决策, 采用 | Normal | "Chose PID over MPC for simplicity" |
| 7 | revert |
reverted, rollback, undo, 回滚, 撤销 | Normal | "Reverted to previous PID gains" |
| 8 | pattern |
every time, whenever, recurring, 规律, 总是, 反复出现 | Normal | "Every time humidity > 80%, grip fails" |
| 9 | architecture |
architecture, module, pipeline, 架构, 模块, 系统设计, 分层 | Normal | "Memory pipeline: learn → dedup → store" |
| 10 | config |
configuration, env var, setting, port, 配置, 环境变量, 版本 | Normal | "Port 6889 for web UI" |
| 11 | postmortem |
postmortem, lesson learned, 教训, 复盘, 事后分析 | Protected | "Lesson: always calibrate before new session" |
| 12 | gotcha |
gotcha, pitfall, 踩坑, 陷阱, 坑: | Protected | "Pitfall: joint limits not checked in sim" |
| 13 | self_defect |
AI defect, 训练偏好, 幻觉倾向, 注意力衰减, 讨好倾向 | Protected | "AI overengineering tendency on simple tasks" |
| 14 | observation_debug |
found/noticed/discovered + error/bug/crash/timeout | Normal | "Found that timeout errors spike at noon" |
| 15 | observation_code |
found/noticed/discovered + .py/.rs/.js/function/module | Normal | "Noticed search.py returns stale results" |
| 16 | observation |
found that, observed, noticed, discovered, 发现, 观察到, 实测 | Normal | "Found that red cups require more force" |
| 17 | code |
(default fallback) | Normal | General technical notes |
Protected Categories
Three categories are marked as "protected" — they are never consolidated and never time-decayed:
| Category | Why Protected |
|---|---|
constraint |
Safety rules must persist forever |
postmortem |
Lessons learned are irreplaceable |
gotcha |
Each pitfall has unique context |
Bilingual Matching
All patterns support both English and Chinese:
# constraint pattern (excerpt)
r"(?:must\s+(?:always|never|not)|(?:必须|禁止|不[允准许]许|强制|绝不|一定要))"
# root_cause pattern (excerpt)
r"(?:root\s*cause|caused?\s*by|because|原因|根因|导致|问题出在|之所以|是因为|由于)"
Confidence Estimation
The confidence score indicates how reliable a memory is, based on content richness signals.
Formula
base = 0.80
+ 0.05 if file path detected (e.g., "src/search.py")
+ 0.05 if code reference detected (backticks or function calls)
+ 0.05 if causal language detected ("because", "caused by", etc.)
+ 0.05 if context JSON provided (> 20 chars)
= cap at 0.95
Signal Detection
| Signal | Regex Pattern | Example |
|---|---|---|
| File path | \w[\w./-]*\.(py\|rs\|js\|ts\|go\|md\|...) |
src/robotmem/search.py |
| Code reference | Backticks or \w+() pattern |
`embed_one()` |
| Causal language | because\|caused by\|原因\|根因\|导致 |
"failed because of timeout" |
| Rich context | len(context.strip()) > 20 |
JSON context with params |
Confidence Range
| Signals | Confidence | Interpretation |
|---|---|---|
| 0 signals | 0.80 | Base level — general observation |
| 1 signal | 0.85 | Moderate — some evidence |
| 2 signals | 0.90 | Good — well-evidenced |
| 3 signals | 0.95 | High — strong evidence (max) |
| 4 signals | 0.95 | Capped at 0.95 |
Scope Extraction
The scope extractor identifies referenced files, entities, and modules from the memory text.
File Path Detection
# Regex matches paths like: src/robotmem/search.py, /etc/config.yaml
_FILE_PATH_RE = re.compile(
r"(?:^|[\s\"'`(,])(/?\w[\w./-]*\.(?:py|rs|js|ts|tsx|go|md|toml|yaml|yml|json|sql|sh|css|html))"
)
Supported extensions: .py, .rs, .js, .ts, .tsx, .go, .md, .toml, .yaml, .yml, .json, .sql, .sh, .css, .html
Entity Detection
Two extraction methods:
- Backtick entities:
`ClassName`,`function_name()` - PascalCase names:
CogDatabase,FastEmbedEmbedder
# Backtick: `embed_one()` → "embed_one"
_BACKTICK_ENTITY_RE = re.compile(r"`(\w[\w.]*)(?:\(\))?`")
# PascalCase: CogDatabase → "CogDatabase"
_PASCAL_CASE_RE = re.compile(r"\b([A-Z][a-z]+(?:[A-Z][a-z]+)+)\b")
Module Inference
Modules are inferred from file paths:
# "src/robotmem/ops/search.py" → module "ops"
# (parent directory, excluding src/lib/app/tests)
Scope Output
{
"scope_files": ["src/robotmem/search.py"],
"scope_entities": ["embed_one", "CogDatabase"],
"scope_modules": ["robotmem"]
}
Tag Classification
Tags provide multi-label classification — a single memory can have multiple tags.
Process
def classify_tags(text, context_json=None):
tags = []
# 1. Match ALL regex patterns (not just first)
for category, pattern in _CATEGORY_PATTERNS:
if pattern.search(text):
tags.append(category)
# 2. Extract scenario_tags from context JSON
if context_json:
ctx = json.loads(context_json)
for tag in ctx.get("scenario_tags", []):
if tag in VALID_TAGS:
tags.append(tag)
# 3. Fallback
if not tags:
tags.append("code")
return tags
Key difference from classify_category():
- classify_category() → first match (single label)
- classify_tags() → all matches (multi-label)
Context-Based Tags
Tags can also be injected via the context JSON:
{
"scenario_tags": ["debug", "concurrency", "pattern"]
}
These are validated against the VALID_TAGS whitelist (50+ tags) before insertion.
Tag Taxonomy
The tag system uses a 9-dimension hierarchical tree with 50+ tags:
metacognition ← reasoning, cognitive_bias, decision_framework,
│ systems_thinking, risk_thinking, worldview, decision
│
capability ← build, debug, design, review, explain, optimize,
│ plan, architecture, code
│
domain ← cs_fundamentals, ai_ml, finance, business,
│ cross_domain, config, observation, observation_code,
│ observation_debug
│
technique ← patterns, anti_patterns, recipes, language_specific,
│ pattern
│
timing ← when_to_start, when_to_stop, when_to_switch
│
boundary ← tradeoff, not_applicable, diminishing_returns,
│ constraint
│
experience ← war_story, postmortem, gotcha, root_cause, revert
│
self_defect ← hallucination, sycophancy, overengineering,
│ no_verification
│
reflection ← accuracy_calibration, behavior_rule, blind_spot,
preference
Dimension Prefix
Each tag maps to a human-readable dimension prefix for display:
dimension_prefix("gotcha") # → "[经验/踩坑]"
dimension_prefix("architecture") # → "[能力/架构设计]"
dimension_prefix("metacognition")# → "[元认知]" (root node)
Tag Storage
Tags are stored in the memory_tags table:
| Column | Description |
|---|---|
memory_id |
FK to memories.id |
tag |
Tag from controlled vocabulary |
source |
"auto" (regex inferred) or "user" (explicitly set) |
Primary key: (memory_id, tag) — no duplicate tags per memory.
Tag Metadata
The tag_meta table maintains the taxonomy hierarchy:
| Column | Description |
|---|---|
tag |
Tag identifier (PK) |
parent |
Parent tag (NULL = root dimension) |
display_name |
Human-readable name |
This table is automatically synced from TAG_META_TREE at database initialization.
Context JSON Builder
The build_context_json() function merges user-provided context with automatic metadata:
def build_context_json(insight, context):
result = {"source": "learn_tool"}
if context:
parsed = json.loads(context)
if isinstance(parsed, dict):
result.update(parsed) # Merge user fields
else:
result["user_context"] = context # Store as-is
return json.dumps(result)
This ensures every memory has at least a source field for provenance tracking.
Path Normalization
File paths extracted by scope are normalized for consistency:
def normalize_scope_files(files, project_root=None):
"""Absolute → relative, deduplicated, sorted"""
# /Users/jia/project/src/foo.py → src/foo.py
This prevents duplicate entries when the same file is referenced with both absolute and relative paths.
Design Principles
- Pure regex, no LLM: Sub-millisecond execution, deterministic, no hallucination risk
- First-match priority: Categories have strict ordering — safety constraints always win
- Multi-label tags: A memory can be both
root_causeandobservation - Bilingual: All patterns work for both English and Chinese text
- Graceful fallback: Unknown text defaults to
codecategory with 0.80 confidence - Controlled vocabulary: All tags are validated against a whitelist — no free-form tagging