解析器 / Parsers

本页汇总 ai_service.services.parsers 中的公共解析接口、选择函数和内置解析器对象，详情直接从源码 docstring 渲染。

子系统职责、选择逻辑和扩展约定见 Parser 子系统
摄取主流程如何使用 parser 见 IngestionService 概览

Base Types

Result of parsing a document.

属性：

名称	类型	描述
`text`	`str`	Extracted text content from the document.
`metadata`	`dict[str, Any]`	Additional metadata extracted during parsing.
`page_count`	`Optional[int]`	Number of pages (for paginated documents).
`char_count`	`int`	Total character count of extracted text.

__post_init__

__post_init__()

Calculate character count after initialization.

Bases: ABC

Abstract base class for document parsers.

All document parsers must implement the parse method to extract text content from documents.

supported_content_types `abstractmethod` `property`

supported_content_types

Get the list of supported MIME types.

返回：

类型	描述
`list[str]`	list[str]: List of supported MIME types.

supported_extensions `abstractmethod` `property`

supported_extensions

Get the list of supported file extensions.

返回：

类型	描述
`list[str]`	list[str]: List of supported file extensions (e.g., ['.txt', '.md']).

parse `abstractmethod`

parse(file_data, filename)

Parse a document and extract text content.

参数：

名称	类型	描述	默认
`file_data`	`bytes`	Raw file content as bytes.	必需
`filename`	`str`	Original filename (used for metadata).	必需

返回：

名称	类型	描述
`ParseResult`	`ParseResult`	Parsed content and metadata.

引发：

类型	描述
`ValueError`	If the document cannot be parsed.

can_parse

can_parse(content_type)

Check if this parser can handle the given content type.

参数：

名称	类型	描述	默认
`content_type`	`str`	MIME type to check.	必需

返回：

名称	类型	描述
`bool`	`bool`	True if this parser can handle the content type.

can_parse_extension

can_parse_extension(extension)

Check if this parser can handle the given file extension.

参数：

名称	类型	描述	默认
`extension`	`str`	File extension to check.	必需

返回：

名称	类型	描述
`bool`	`bool`	True if this parser can handle the extension.

Registry Helpers

Get a parser instance for the given content type.

参数：

名称	类型	描述	默认
`content_type`	`str`	MIME type of the document.	必需

返回：

名称	类型	描述
`BaseDocumentParser`	`BaseDocumentParser`	Parser instance for the content type.

引发：

类型	描述
`ValueError`	If no parser is available for the content type.

Get a parser instance for the given file extension.

参数：

名称	类型	描述	默认
`extension`	`str`	File extension (e.g., '.pdf', '.txt').	必需

返回：

名称	类型	描述
`BaseDocumentParser`	`BaseDocumentParser`	Parser instance for the extension.

引发：

类型	描述
`ValueError`	If no parser is available for the extension.

Get the content type for a file extension.

参数：

名称	类型	描述	默认
`extension`	`str`	File extension (e.g., '.pdf', '.txt').	必需

返回：

名称	类型	描述
`str`	`str`	MIME type for the extension.

引发：

类型	描述
`ValueError`	If no content type is mapped for the extension.

Check if a content type is supported.

参数：

名称	类型	描述	默认
`content_type`	`str`	MIME type to check.	必需

返回：

名称	类型	描述
`bool`	`bool`	True if the content type is supported.

Check if a file extension is supported.

参数：

名称	类型	描述	默认
`extension`	`str`	File extension to check.	必需

返回：

名称	类型	描述
`bool`	`bool`	True if the extension is supported.

Built-in Parsers

Bases: BaseDocumentParser

Parser for plain text and markdown documents.

This parser handles .txt and .md files by decoding them as UTF-8 text.

supported_content_types `property`

supported_content_types

Get the list of supported MIME types.

返回：

类型	描述
`list[str]`	list[str]: List of supported MIME types.

supported_extensions `property`

supported_extensions

Get the list of supported file extensions.

返回：

类型	描述
`list[str]`	list[str]: List of supported file extensions.

parse

parse(file_data, filename)

Parse a text document and extract content.

参数：

名称	类型	描述	默认
`file_data`	`bytes`	Raw file content as bytes.	必需
`filename`	`str`	Original filename (used for metadata).	必需

返回：

名称	类型	描述
`ParseResult`	`ParseResult`	Parsed content and metadata.

引发：

类型	描述
`ValueError`	If the document cannot be decoded.

Bases: BaseDocumentParser

Parser for PDF documents.

This parser uses pypdf to extract text content from PDF files.

supported_content_types `property`

supported_content_types

Get the list of supported MIME types.

返回：

类型	描述
`list[str]`	list[str]: List of supported MIME types.

supported_extensions `property`

supported_extensions

Get the list of supported file extensions.

返回：

类型	描述
`list[str]`	list[str]: List of supported file extensions.

parse

parse(file_data, filename)

Parse a PDF document and extract text content.

参数：

名称	类型	描述	默认
`file_data`	`bytes`	Raw file content as bytes.	必需
`filename`	`str`	Original filename (used for metadata).	必需

返回：

名称	类型	描述
`ParseResult`	`ParseResult`	Parsed content and metadata.

引发：

类型	描述
`ValueError`	If the PDF cannot be parsed.

Bases: BaseDocumentParser

Parser for Microsoft Word DOCX documents.

This parser uses python-docx to extract text content from DOCX files.

supported_content_types `property`

supported_content_types

Get the list of supported MIME types.

返回：

类型	描述
`list[str]`	list[str]: List of supported MIME types.

supported_extensions `property`

supported_extensions

Get the list of supported file extensions.

返回：

类型	描述
`list[str]`	list[str]: List of supported file extensions.

parse

parse(file_data, filename)

Parse a DOCX document and extract text content.

参数：

名称	类型	描述	默认
`file_data`	`bytes`	Raw file content as bytes.	必需
`filename`	`str`	Original filename (used for metadata).	必需

返回：

名称	类型	描述
`ParseResult`	`ParseResult`	Parsed content and metadata.

引发：

类型	描述
`ValueError`	If the DOCX cannot be parsed.

解析器 / Parsers

Base Types

__post_init__

supported_content_types abstractmethod property

supported_extensions abstractmethod property

parse abstractmethod

can_parse

can_parse_extension

Registry Helpers

Built-in Parsers

supported_content_types property

supported_extensions property

parse

supported_content_types property

supported_extensions property

parse

supported_content_types property

supported_extensions property

parse

supported_content_types `abstractmethod` `property`

supported_extensions `abstractmethod` `property`

parse `abstractmethod`

supported_content_types `property`

supported_extensions `property`

supported_content_types `property`

supported_extensions `property`

supported_content_types `property`

supported_extensions `property`