KeywordTreeDesign

关键词知识树系统设计（KeywordTree）

本文件是基于 基于关键词的知识树系统设计方案.md 的落地设计，范围限定： 骨架（完整 API）+ MVP 实现（Search / CreateKeyword / CreateInfo / Agent 下钻 / SuggestParent）。代码交付物：Agent/keyword_tree.py（单文件）+ Agent/test_keyword_tree.py（pytest）。

1. 范围

有实现（MVP 必须跑通）：
- Search(query, llm_expand_query=True)（精确倒排 + Agent 多层下钻）
- CreateKeyword(..., llm_auto_place=True)（可 LLM 自动选父）
- CreateInfo(content, keyword_ids=..., auto_link=False)
- LinkInfo / GetInfosOfKeyword / GetKeywordsOfInfo / GetKeyword / GetChildren / GetPath
- 归一化、精确倒排、Agent 下钻循环 _AgentDescend、_SuggestParent
- jsonl 追加存储、启动重建索引、MRU 缓存、threading.Lock 串行化写
只有骨架（... 占位）：
- UpdateKeyword / MoveKeyword / MergeKeywords / DeleteKeyword / AddAlias / RemoveAlias
- UpdateInfo / DeleteInfo / UnlinkInfo
- BatchCreateKeywords / BatchCreateInfos
- Reorganize / ApplyReorganizePlan / TreeMetrics / Undo / CompactStorage / Snapshot / Restore
不在本次范围内：持久化索引文件（indexes/*.json）、cache/ 目录、snapshots/、dedup_check / disambiguate_same_name / link_info Agent 函数（V2 再做）。

2. 关键设计决策

维度	选择	备注
代码布局	单文件 `Agent/keyword_tree.py`	与 `Agent/WeaveAgent.py` 同级
命名风格	公开 API 用 CamelCase	对齐 `WeaveAgent.py`，便于后者薄封装
LLM 依赖	抽象类 `LLMClient` + 依赖注入	测试用 `StubLLMClient`，不依赖真实 LLM
归一化	小写 + 去标点 / 空白 + 全/半角统一	不做繁简、不做拼音；留意 `_Normalize` 是可替换函数
ID 格式	UUID4 字符串	`"root"` 为唯一特例，保留可读
`RelationType`	固定四值枚举：`PRIMARY / RELATED / EXAMPLE / SOURCE`	字符串 Enum，jsonl 写 `.value`
`Search` 返回类型	单一 dataclass `SearchResult` + `status` 字段	取代 Union，便于字段统一检索
数据目录	默认 `./data/keyword_tree`，构造时可覆盖	相对当前工作目录
并发	单进程 + `threading.Lock`	多进程 `fcntl` 留给 V3
候选视野	按 `max_candidates`（默认 50）截断的扁平候选列表	不按深度展开，BFS 到达上限即停
LLM 提示	每候选只含 `{idx, name, path}`；不下发 UUID；LLM 回传 `idx` 整数，内部映射回 node_id	省 token：UUID~36 字符 → idx 1-3 字符
缓存	固定容量 MRU 队列，只存 `node_id`	容量默认 128，不持久化
根节点	首次构造时自动创建 `id="root"`, `level=0`	`parent_id=None` 的新节点实际挂 root 下
`description`	可为空；Agent prompt 里对空描述标注"无描述"	—
测试	`Agent/test_keyword_tree.py` pytest，用 `StubLLMClient`	`tmp_path` fixture 隔离

3. 数据模型

所有 dataclass 与 jsonl 行一一对应（字段名一致）。

class RelationType(str, Enum):
    PRIMARY = "PRIMARY"
    RELATED = "RELATED"
    EXAMPLE = "EXAMPLE"
    SOURCE  = "SOURCE"

class SearchStatus(str, Enum):
    MATCHED   = "matched"
    AMBIGUOUS = "ambiguous"
    NOT_FOUND = "not_found"

@dataclass
class KeywordNode:
    id: str                      # uuid4，或特例 "root"
    name: str
    aliases: list[str]
    normalized: list[str]        # name + aliases 归一化 token；用作倒排 key
    level: int                   # root=0
    parent_id: Optional[str]     # root 为 None
    children: list[str]          # 冗余：以 _Indexes.children 为准
    description: str
    metadata: dict
    version: int                 # 乐观锁，写一次 +1
    created_at: float
    updated_at: float
    deleted: bool = False        # 软删

@dataclass
class Info:
    id: str
    content: str
    source: str
    metadata: dict
    version: int
    created_at: float
    updated_at: float
    deleted: bool = False

@dataclass
class InfoKeywordLink:
    info_id: str
    keyword_id: str
    relation: RelationType
    created_by: str              # "user" / "agent" / 具体来源标识
    created_at: float
    deleted: bool = False        # 软解链

@dataclass
class SearchResult:
    status: SearchStatus
    node: Optional[KeywordNode] = None
    path: list[KeywordNode] = field(default_factory=list)
    infos: list[Info] = field(default_factory=list)
    candidates: list[KeywordNode] = field(default_factory=list)  # AMBIGUOUS 用
    suggested_parent_id: Optional[str] = None                    # NOT_FOUND 用
    suggested_name: Optional[str] = None                         # NOT_FOUND 用
    reason: str = ""

@dataclass
class AgentDecision:
    action: Literal["jump", "match", "missing", "ambiguous"]
    # jump / match 支持单目标或多目标：优先读 node_ids（列表），否则回退 node_id 单值。
    node_id: Optional[str] = None
    node_ids: list[str] = field(default_factory=list)
    jump_depth: int = 1
    suggest_name: Optional[str] = None
    candidate_ids: list[str] = field(default_factory=list)  # ambiguous 专用
    reason: str = ""

软删语义：删除 / 解链 = 追加 deleted=true 的新版本行，启动重建时丢弃该 id。jsonl 行永不原地修改。

4. 存储与内存索引

4.1 目录结构（MVP 阶段）

<data_dir>/
  nodes.jsonl              # KeywordNode append-only
  infos.jsonl              # Info append-only
  links.jsonl              # InfoKeywordLink append-only
  change_log.jsonl         # 写操作审计（op/operation_id/timestamp/after…）

MVP 不写 indexes/ / cache/ / snapshots/ 三个目录——那是 compaction / persistent cache / snapshot 的范围，仅在对应骨架方法完成时启用。

4.2 `_JsonlStore`

职责：文件 IO + fsync。对外接口仅处理 dict，不处理 dataclass。

class _JsonlStore:
    def __init__(self, data_dir: str): ...
    def _EnsureDirs(self): ...
    def AppendNode(self, row: dict): ...
    def AppendInfo(self, row: dict): ...
    def AppendLink(self, row: dict): ...
    def AppendChangeLog(self, row: dict): ...
    def LoadNodes(self) -> Iterator[dict]: ...
    def LoadInfos(self) -> Iterator[dict]: ...
    def LoadLinks(self) -> Iterator[dict]: ...

写操作：open(path, "a", encoding="utf-8") → write(json.dumps(row, ensure_ascii=False) + "\n") → flush() → os.fsync(fileno())。每次 append 打开/关闭的开销可接受（MVP 规模）；若未来成瓶颈改为常驻句柄。

change_log 行格式：

{"op": "create_keyword", "operation_id": "<uuid>", "timestamp": 1712345678.9,
 "after": {...node dict...}, "agent_io": null}

MVP 不写 before 字段（undo 场景在骨架里）。

4.3 `_Indexes`

class _Indexes:
    nodes: dict[str, KeywordNode]
    infos: dict[str, Info]
    inverted: dict[str, set[str]]                      # normalized_token -> {node_id}
    children: dict[str, list[str]]                     # parent_id -> [child_id...]，root 用 key "root"
    info_by_kw: dict[str, list[str]]                   # keyword_id -> [info_id...]
    kw_by_info: dict[str, list[InfoKeywordLink]]       # info_id -> [link...]

4.4 启动重建流程

_JsonlStore.LoadNodes() → 按 id 保留 max(version) 的行；deleted=True 的最终版丢弃；填 nodes / inverted / children。
LoadInfos() 同理填 infos。
LoadLinks() → 按 (info_id, keyword_id) 取最新状态，deleted=True 的丢弃；填 info_by_kw / kw_by_info。
若 nodes 里没有 id="root" → 追加一行 root 节点到 nodes.jsonl 并补到索引。

启动路径是幂等的：反复构造不会重复 append root。

4.5 `_MRUCache`

class _MRUCache:
    def __init__(self, capacity: int = 128): ...
    def Touch(self, node_id: str): ...   # 已存在则移到队首；否则 push 队首；超容淘汰队尾
    def Remove(self, node_id: str): ...  # 写删/合并/移动时清
    def Snapshot(self) -> list[str]: ... # 供 _AgentDescend 作为优先候选

不持久化；进程重启即空。

5. LLM 抽象

class LLMClient(ABC):
    @abstractmethod
    def Chat(self, messages: list[dict], json_schema: dict) -> dict:
        """
        messages: OpenAI 风格 [{"role":"system"|"user","content":"..."}]
        json_schema: 期望输出的 JSON schema（库会把它拼到 system prompt 末尾要求返回 JSON）
        返回：解析后的 dict。
        - 如果模型返回非法 JSON 或字段缺失，实现方应抛异常，_AgentDescend 会降级为 NOT_FOUND。
        """

class StubLLMClient(LLMClient):
    """
    MVP 测试用。预置决策队列，按顺序返回。
    用法：
        stub = StubLLMClient(decisions=[
            {"action":"jump","node_id":"u1"},
            {"action":"match","node_id":"u11"},
        ])
    队列耗尽时返回 {"action":"missing","reason":"stub exhausted"}。
    """

真实 LLM 客户端（DeepSeek / Kimi / 其他）由调用方（如 WeaveAgent.LongTermMemory）构造注入。

6. Agent 下钻循环

宽视野候选（vs. 原始稿）：每轮喂给 LLM 的候选是一份扁平列表——

candidates：从 start_ids 出发广度优先收集，受 max_candidates（默认 50）截断；每项只含 {id, name, path}，不含 description / aliases / 嵌套子树，让 LLM 基于关键词本身 + 层级路径判断；可一轮直接 match 到任意深度的节点 id，无需逐层 jump。

候选数量由 max_candidates 控制（不按深度展开）。对大树建议调高；若需极省 token 调低。MRU 入口使"上次命中的节点"自然前置。

效率约定：能 match 就不 jump；jump 只在需要切换分支换视野时使用。配合 MRU 入口，常见 query 1 轮即可结束。

6.1 `_AgentDescend(query, intent)`

入参：query: str、intent: Literal["search","suggest_parent"]

返回（内部 dict）：

{
  "outcome": "matched" | "missing" | "ambiguous",
  "node_id": Optional[str],         # matched 时
  "path": list[str],                # matched 时，root -> ... -> node_id
  "candidates": list[str],          # ambiguous 时
  "suggested_parent_id": Optional[str],  # missing 时
  "suggested_name": Optional[str],  # missing 时
  "reason": str,
}

6.2 循环步骤

初始候选集合：
- 先取 _MRUCache.Snapshot() 作为"优先入口"；叠加 root 的直接子节点；
- _DedupAncestors 去掉被他项覆盖的后代，剩下的是独立子树根；
- _CollectCandidates(start_ids) 广度优先收集至多 max_candidates（默认 50）条，每项 {id, name, path}。
构建 prompt：
- system：任务说明（依 intent 切措辞）+ 返回 schema（对应 AgentDecision）+ 候选说明 + 效率提示（能 match 就不 jump）。
- user：query、当前 path（首轮空）、candidates。
调 LLMClient.Chat，得到 AgentDecision。
按 action 分派（targets = node_ids if node_ids else ([node_id] if node_id else [])）：
- jump：
  - 空 targets → missing。
  - 单目标 → 推进到该节点（下轮从它的 children 展开）。
  - 多目标 → 下轮并行展开这些目标的 children 合集；path_ids 置为这些目标的 LCA；若所有目标都是叶子，直接返回 ambiguous。
- match：
  - 空 targets → missing。
  - 单目标 → 终结，outcome=matched。
  - 多目标 → 终结，outcome=ambiguous（多个可能的答案）。
- missing → 终结，outcome=missing，suggested_parent_id = current_path 最后一节点（首轮则为 root），suggested_name = suggest_name。
- ambiguous → 终结，outcome=ambiguous，candidates=candidate_ids。
降级（任一条件触发）：
- LLMClient.Chat 抛异常
- 返回 JSON 解析失败或缺字段
- 循环超过 descend_max_rounds（默认 6）
- jump 目标 node_id 不存在或不在当前子树展开里
→ outcome=missing，suggested_parent_id = current_path 最后一节点，reason="agent_failure" / "agent_timeout" / "invalid_jump"。
MRU 回填：outcome=matched 时 Touch(node_id)，以及 path 上每个节点都 Touch。

6.3 `Search(query, llm_expand_query)` 包装

tokens = _Normalize(query)
hits   = _ExactLookup(tokens)     # 基于 inverted 索引，token 任一匹配即算命中
if len(hits) == 1:
    return _BuildMatchedResult(hits[0])           # 拉 GetPath / 拉 infos / MATCHED
if len(hits) >= 2:
    return SearchResult(status=AMBIGUOUS,
                        candidates=[nodes[h] for h in hits])
# 0 条命中：
if not llm_expand_query:
    return SearchResult(status=NOT_FOUND, reason="exact_miss_llm_disabled")
desc = _AgentDescend(query, "search")
return _ConvertToSearchResult(desc)

6.4 `_SuggestParent(name)`

desc = _AgentDescend(name, "suggest_parent")
if desc.outcome == "matched":
    return desc.node_id
if desc.outcome == "missing":
    return desc.suggested_parent_id or "root"
# ambiguous → fallback root（MVP 不做消歧交互）
return "root"

7. 公开 API（完整骨架）

class KeywordTree:
    def __init__(
        self,
        data_dir: str = "./data/keyword_tree",
        llm_client: Optional[LLMClient] = None,   # 不传 → StubLLMClient()
        mru_capacity: int = 128,
        max_candidates: int = 50,                 # 每轮喂给 LLM 的候选上限
        descend_max_rounds: int = 6,
    ): ...

    # 检索（MVP ✅）
    def Search(self, query: str, llm_expand_query: bool = True) -> SearchResult: ...
    def GetKeyword(self, id: str) -> Optional[KeywordNode]: ...
    def GetChildren(self, id: str) -> list[KeywordNode]: ...
    def GetPath(self, id: str) -> list[KeywordNode]: ...
    def GetInfosOfKeyword(self, id: str,
                          relation: Optional[RelationType] = None,
                          page: int = 0, size: int = 50) -> list[Info]: ...
    def GetKeywordsOfInfo(self, info_id: str) -> list[tuple[KeywordNode, RelationType]]: ...

    # 关键词写
    def CreateKeyword(self, name: str,                            # ✅
                      parent_id: Optional[str] = None,
                      aliases: Optional[list[str]] = None,
                      description: str = "",
                      llm_auto_place: bool = True) -> KeywordNode: ...
    def UpdateKeyword(self, id: str, patch: dict, version: int): ...       # ⏳
    def MoveKeyword(self, id: str, new_parent_id: str): ...                # ⏳
    def MergeKeywords(self, src_id: str, dst_id: str): ...                 # ⏳
    def DeleteKeyword(self, id: str,                                       # ⏳
                      cascade: bool = False,
                      info_policy: str = "reattach"): ...
    def AddAlias(self, id: str, alias: str): ...                           # ⏳
    def RemoveAlias(self, id: str, alias: str): ...                        # ⏳

    # 信息写
    def CreateInfo(self, content: str, source: str = "",                   # ✅
                   keyword_ids: Optional[list[str]] = None,
                   auto_link: bool = False) -> Info: ...
    def UpdateInfo(self, info_id: str, patch: dict): ...                   # ⏳
    def DeleteInfo(self, info_id: str): ...                                # ⏳
    def LinkInfo(self, info_id: str, keyword_id: str,                      # ✅
                 relation: RelationType = RelationType.PRIMARY,
                 created_by: str = "user"): ...
    def UnlinkInfo(self, info_id: str, keyword_id: str): ...               # ⏳

    # 批量与运维（全部 ⏳）
    def BatchCreateKeywords(self, specs: list[dict]) -> list[KeywordNode]: ...
    def BatchCreateInfos(self, specs: list[dict]) -> list[Info]: ...
    def Reorganize(self, scope_id: Optional[str] = None) -> dict: ...
    def ApplyReorganizePlan(self, plan: dict): ...
    def TreeMetrics(self) -> dict: ...
    def Undo(self, operation_id: str): ...
    def CompactStorage(self): ...
    def Snapshot(self) -> str: ...
    def Restore(self, snapshot_ts: str): ...

    # 内部（下划线开头，MVP 全部 ✅）
    def _Normalize(self, text: str) -> list[str]: ...
        # MVP 返回长度 1 的 list（text 经小写/去标点/全半角归一化后的单 token）。
        # 返回 list 是为了未来扩展（拼音 / 多 token 分词）不改签名。
    def _ExactLookup(self, tokens: list[str]) -> list[str]: ...
        # 对每个 token 查 _Indexes.inverted，结果取并集，返回去重后的 node_id 列表。
    def _AgentDescend(self, query: str, intent: str) -> dict: ...
    def _SuggestParent(self, name: str) -> str: ...
    def _CollectCandidates(self, start_ids: list[str]) -> list[dict]: ...
        # BFS 收集，返回 [{id, name, path}, ...]，受 max_candidates 截断
    def _AppendNodeVersion(self, node: KeywordNode): ...
    def _AppendInfoVersion(self, info: Info): ...
    def _AppendLink(self, link: InfoKeywordLink): ...
    def _InvalidateNodeFromCache(self, node_id: str): ...

图例：✅ MVP 完整实现；⏳ 仅骨架 ...，docstring 指向后续阶段。

8. 写路径骨架

所有写方法在 threading.Lock 保护下执行：

参数校验 + 归一化
->  生成 id / 读取并 +1 version
->  _JsonlStore.AppendX(row)
->  更新 _Indexes
->  _JsonlStore.AppendChangeLog({op, operation_id, timestamp, after})
->  （按需）_MRUCache.Remove(旧 id) / Touch(新 id)
->  返回新 dataclass 实例

崩溃恢复靠 nodes/infos/links.jsonl 是 append-only + fsync：最坏丢失最后一行未 fsync 记录。

9. 测试 `Agent/test_keyword_tree.py`

所有用例用 tmp_path fixture 隔离 data_dir；用 StubLLMClient 驱动 Agent 分支。

用例	断言要点
`test_auto_root_on_fresh_init`	空目录构造 → `nodes.jsonl` 含 root 行，`GetKeyword("root")` 返回 level=0
`test_reload_from_jsonl`	建节点后销毁实例再构造同 data_dir → 索引一致、节点可查
`test_create_keyword_explicit_parent`	显式 `parent_id` 时 `llm_auto_place` 被忽略，父子关系落盘
`test_create_keyword_llm_auto_place`	`parent_id=None + llm_auto_place=True` → Stub 返回 `jump→match`，新节点挂到目标下
`test_search_exact_hit`	建 "Python" 节点 → `Search("python")` 归一化后精确命中 MATCHED，path 正确
`test_search_ambiguous_same_name`	两同名节点 → `Search(name)` 返回 AMBIGUOUS + candidates 两条
`test_search_not_found_falls_to_agent`	倒排 0 命中 → Stub `missing` → 返回 NOT_FOUND + suggested_parent
`test_search_agent_jump_match`	Stub 预置 `jump → match` → MATCHED、path 包含中间节点
`test_alias_hit`	aliases 进 inverted，按 alias 能搜到同节点
`test_create_info_explicit_link`	`CreateInfo(keyword_ids=[kid])` → `GetInfosOfKeyword(kid)` 拿到，`GetKeywordsOfInfo(iid)` 返回 PRIMARY
`test_link_info_relations`	指定 `RelationType.EXAMPLE` → `relation` 过滤器命中精确；`RELATED` 过滤为空
`test_mru_cache_touched_on_match`	连续两次 Search 同 query → 第二次 Agent prompt 候选里包含前次命中 node_id
`test_agent_failure_fallback`	Stub 抛异常 → NOT_FOUND，`reason` 含 `agent_failure` 或 `agent_timeout`
`test_llm_expand_query_false_skips_agent`	`llm_expand_query=False` + 倒排不命中 → 直接 NOT_FOUND，`LLMClient.Chat` 未被调

运行：pytest Agent/test_keyword_tree.py -v（cwd 可为仓库根或 Agent/）。

10. 交付物

Agent/keyword_tree.py — 单文件实现（MVP 实现 + 全量骨架）
Agent/test_keyword_tree.py — pytest 测试
Agent/KeywordTreeDesign.md — 本文件

运行时产物（不进 git）：data/keyword_tree/*.jsonl。

11. 与 `WeaveAgent.py` 的对接

WeaveAgent.LongTermMemory 将持有一个 KeywordTree 实例：

class LongTermMemory:
    def __init__(self, tree: KeywordTree):
        self._tree = tree
    def Search(self, query, llm_expand_query=True):
        return self._tree.Search(query, llm_expand_query)
    def CreateKeyword(self, name, parent_id=None, aliases=None,
                      description="", llm_auto_place=True):
        return self._tree.CreateKeyword(name, parent_id, aliases,
                                         description, llm_auto_place)
    # 其余方法一一薄转发

骨架方法在 KeywordTree 端保持 ...，LongTermMemory 的转发方法也同样 ...，两侧一起演进。

Transformer

KV Cache

ChatGLM3典型计算图

Tokenization

Transformer in CV

新方法

Attention是不是必须的

神经网络的逻辑

改进大规模训练稀疏自编码器的方法

Transformer NLP到底有没有智能？

LLM大语言模型的训练

Transformer黑盒

NSA 稀疏注意力机制 by deepseek

FlashAttention

MLA by Deekseek

幻觉

LLM信息空间的映射

LLM推理行为

梯度下降-拟合高层次的信息

生物脑（人脑）的优势

意识（自我）

人脑的工作原理

仿真人脑的基本条件

人脑的推理模型

人脑的本能

CoT & 强化学习

分层LLM推理与Scaling思维模板

自然语言的内在逻辑

记忆

RAG

GraphRAG

记忆方法学

爬虫

思维编织器 WeaveMind

Meaning Dataset 详细介绍文档

附带meaning tree信息的数据集

AI突破的可能

计算模型

自洽后就有意识了吗？

动态算法

短期记忆和长期记忆

通用Agent的发展

Agentic Engineering 智能体编排

私人LLM评测 数据集和结果

卷积计算的测试

复杂工程的实践测试

KeywordTreeDesign

README

SVO表达指令

SVO语义检索的系统化方案

SVO语义矢量算子手册

WeaveAgentDesign

WeaveAgentImplementationPlan

一种理想的智能体编排架构

基于关键词的知识树系统设计方案

大脑记忆的核心机制

KeywordTreeDesign

关键词知识树系统设计（KeywordTree）

1. 范围

2. 关键设计决策

3. 数据模型

4. 存储与内存索引

4.1 目录结构（MVP 阶段）

4.2 _JsonlStore

4.3 _Indexes

4.4 启动重建流程

4.5 _MRUCache

5. LLM 抽象

6. Agent 下钻循环

6.1 _AgentDescend(query, intent)

6.2 循环步骤

6.3 Search(query, llm_expand_query) 包装

6.4 _SuggestParent(name)

7. 公开 API（完整骨架）

8. 写路径骨架

9. 测试 Agent/test_keyword_tree.py

10. 交付物

11. 与 WeaveAgent.py 的对接

私人LLM评测数据集和结果

4.2 `_JsonlStore`

4.3 `_Indexes`

4.5 `_MRUCache`

6.1 `_AgentDescend(query, intent)`

6.3 `Search(query, llm_expand_query)` 包装

6.4 `_SuggestParent(name)`

9. 测试 `Agent/test_keyword_tree.py`

11. 与 `WeaveAgent.py` 的对接