单证提取智能体 (Document Extraction Agent)

概述

document_extraction 是平台内置的单证识别智能体类型，但当前产品边界已经收敛到 ai_service/document_recognition/。它负责从运输单证中提取结构化数据，并为后续人工审核生成字段级 review projection。

当前能力范围已经覆盖：

PDF 输入
图片输入（image/*）
结构化 JSON 输出
字段级审核持久化
单字段 revision timeline
Admin 监控与队列统计

这意味着它不再只是“PDF -> JSON”的一次性提取器，而是 document-recognition runtime family 之一。

模块结构

ai_service/agents/document_extraction/
├── __init__.py
├── graph.py
├── nodes.py
├── state.py
├── schemas.py
└── pdf_utils.py

ai_service/document_recognition/
├── domain/
├── application/
├── infrastructure/
└── interfaces/

其中：

graph.py、nodes.py 负责 built-in LangGraph 提取流程
schemas.py 定义结构化输出与字段检测模型
document_recognition/application 负责 orchestrator、runtime 解析与 review projection
document_recognition/infrastructure/runtimes/builtin_document_extraction_runtime.py 负责把当前 graph 包装成 runtime adapter

运行流

stateDiagram-v2
    [*] --> intake
    intake --> normalize
    normalize --> pdf_ingest
    pdf_ingest --> pdf_processor
    pdf_processor --> data_extraction
    data_extraction --> validation
    validation --> pdf_reporter
    pdf_reporter --> review_projection
    review_projection --> [*]

关键说明

图片输入会先被归一化为 PDF 资产，再复用现有提取图。
提取完成后，原始结构化结果仍写入对象存储。
同时会生成 summary_json、field_reviews、issue_list 等审核投影。
Demo 工作台通过字段级 PATCH 回写审核结果，作业级状态由字段行自动聚合。

数据库模型

`DocumentExtractionJob`

除了原有执行态字段，现在还承载识别运营投影与 runtime snapshot：

列名	说明
`source_media_type`	源文档 MIME type
`source_filename`	原始文件名
`runtime_agent_id`	底层 runtime agent
`runtime_agent_version_id`	runtime snapshot version
`runtime_agent_type_snapshot`	runtime family 快照
`execution_mode`	legacy runtime snapshot，仅用于持久化兼容，不属于 `/document-recognition/runs*` canonical response
`document_type`	识别出的单证类型
`review_status`	作业审核态
`summary_json`	前端摘要投影
`low_confidence_count`	低置信度字段数
`corrected_field_count`	已修正字段数
`last_reviewed_at`	最近审核时间
`last_reviewed_by`	最近审核人

`DocumentExtractionFieldReview`

字段级审核表用于承载人工修正与问题追踪：

列名	说明
`field_key`	字段标识
`field_label`	UI 标签
`extracted_value_json`	原始提取值
`current_value_json`	当前审核值
`confidence`	字段置信度
`page_number`	字段所在页
`bbox_json`	预览高亮框
`review_status`	pending / accepted / corrected / flagged
`issue_code`	缺失、低置信度、校验跟进等问题码
`reviewer_note`	审核备注

`DocumentExtractionFieldReviewRevision`

字段 revision ledger 以 append-only 方式记录每次有效人工修改：

列名	说明
`field_review_id`	关联当前态字段行
`revision_number`	同一字段内单调递增的版本号
`previous_value_json` / `next_value_json`	修改前后字段值
`previous_review_status` / `next_review_status`	修改前后审核状态
`previous_reviewer_note` / `next_reviewer_note`	修改前后审核备注
`reviewer_identity_snapshot`	展示型操作者快照
`change_source`	变更来源
`created_at`	该次修订时间

API 角色

旧 document_extraction Agent 的 /agents/{agent_id}/extraction-jobs* 路径已经下线，当前单证识别对外应使用 canonical document-recognition surface：

GET /document-recognition/runtime-agents
GET /document-recognition/runtime-agents/{runtime_agent_id}
POST /document-recognition/runs
GET /document-recognition/runs
GET /document-recognition/runs/{run_id}
PATCH /document-recognition/runs/{run_id}/field-reviews/{field_id}
GET /document-recognition/runs/{run_id}/field-reviews/{field_id}/revisions
GET /document-recognition/runs/{run_id}/source-document
GET /document-recognition/runs/{run_id}/source-pdf
GET /document-recognition/runs/{run_id}/result

Admin 监控接口：

GET /admin/document-recognition/overview
GET /admin/document-recognition/runs
GET /admin/document-recognition/runtime-agents
PUT /admin/document-recognition/runtime-agents/{agent_id}
DELETE /admin/document-recognition/runtime-agents/{agent_id}

其中 runtime agent 的选择不再依赖调用方猜测，而是由 admin registry 显式维护。调用方应先读取 /document-recognition/runtime-agents，再通过 /document-recognition/runtime-agents/{runtime_agent_id} 获取 published 版本、上传槽位解析和执行策略摘要。只有被注册的 Fusion agent，才会出现在 /document-recognition/runtime-agents* 中，也只有它们的 run 会进入 /document-recognition/runs*。canonical run response 只暴露 run/review/projection 所需字段，不再返回 execution_mode。

前端分工

demo-frontend 是主操作面：上传、预览、bbox 联动、字段接受/修正/标记
admin-frontend 是控制面：概览、筛选、队列、详情、跳转到 Demo 审核

这种分工是当前仓库单证识别产品面的固定模式。

API 参考

Document extraction agent type plugin.

Builds a multi-node LangGraph that processes PDF documents through ingestion, image conversion, LLM extraction, validation, and reporting.

属性：

名称	类型	描述
`agent_type`	`str`	`"document_extraction"`.
`display_name`	`str`	Human-readable name.
`description`	`str`	Brief description.

build_graph

build_graph(**kwargs)

Build and compile the document extraction LangGraph.

The graph implements a multi-node pipeline with retry logic: pdf_ingest -> pdf_processor -> data_extraction -> validation -> pdf_reporter

参数：

名称	类型	描述	默认
`**kwargs`	`Any`	Reserved for future configuration.	`{}`

返回：

名称	类型	描述
`CompiledStateGraph`	`CompiledStateGraph`	Compiled document extraction processing graph.

get_state_schema

get_state_schema()

Return the DocumentExtractionState schema.

返回：

名称	类型	描述
`type`	`type`	:class:`DocumentExtractionState`.

Bases: BaseModel

Complete shipping booking data model.

Comprehensive model for structured shipping document extraction, containing 100+ fields covering booking metadata, parties, ports, vessel info, containers, cargo, compliance, and logistics details.

to_dict

to_dict()

Return a JSON-serializable dict representation.

返回：

类型	描述
`Dict[str, Any]`	Dict[str, Any]: Model data as dictionary with aliases.

prompt_schema

prompt_schema()

Return a JSON schema block for prompt injection with extraction instructions.

返回：

名称	类型	描述
`str`	`str`	Formatted schema string with field type instructions.