跳转至

解析器 / Parsers

本页汇总 ai_service.services.parsers 中的公共解析接口、选择函数和内置解析器对象,详情直接从源码 docstring 渲染。

Base Types

Result of parsing a document.

属性:

名称 类型 描述
text str

Extracted text content from the document.

metadata dict[str, Any]

Additional metadata extracted during parsing.

page_count Optional[int]

Number of pages (for paginated documents).

char_count int

Total character count of extracted text.

__post_init__

__post_init__()

Calculate character count after initialization.

Bases: ABC

Abstract base class for document parsers.

All document parsers must implement the parse method to extract text content from documents.

supported_content_types abstractmethod property

supported_content_types

Get the list of supported MIME types.

返回:

类型 描述
list[str]

list[str]: List of supported MIME types.

supported_extensions abstractmethod property

supported_extensions

Get the list of supported file extensions.

返回:

类型 描述
list[str]

list[str]: List of supported file extensions (e.g., ['.txt', '.md']).

parse abstractmethod

parse(file_data, filename)

Parse a document and extract text content.

参数:

名称 类型 描述 默认
file_data bytes

Raw file content as bytes.

必需
filename str

Original filename (used for metadata).

必需

返回:

名称 类型 描述
ParseResult ParseResult

Parsed content and metadata.

引发:

类型 描述
ValueError

If the document cannot be parsed.

can_parse

can_parse(content_type)

Check if this parser can handle the given content type.

参数:

名称 类型 描述 默认
content_type str

MIME type to check.

必需

返回:

名称 类型 描述
bool bool

True if this parser can handle the content type.

can_parse_extension

can_parse_extension(extension)

Check if this parser can handle the given file extension.

参数:

名称 类型 描述 默认
extension str

File extension to check.

必需

返回:

名称 类型 描述
bool bool

True if this parser can handle the extension.

Registry Helpers

Get a parser instance for the given content type.

参数:

名称 类型 描述 默认
content_type str

MIME type of the document.

必需

返回:

名称 类型 描述
BaseDocumentParser BaseDocumentParser

Parser instance for the content type.

引发:

类型 描述
ValueError

If no parser is available for the content type.

Get a parser instance for the given file extension.

参数:

名称 类型 描述 默认
extension str

File extension (e.g., '.pdf', '.txt').

必需

返回:

名称 类型 描述
BaseDocumentParser BaseDocumentParser

Parser instance for the extension.

引发:

类型 描述
ValueError

If no parser is available for the extension.

Get the content type for a file extension.

参数:

名称 类型 描述 默认
extension str

File extension (e.g., '.pdf', '.txt').

必需

返回:

名称 类型 描述
str str

MIME type for the extension.

引发:

类型 描述
ValueError

If no content type is mapped for the extension.

Check if a content type is supported.

参数:

名称 类型 描述 默认
content_type str

MIME type to check.

必需

返回:

名称 类型 描述
bool bool

True if the content type is supported.

Check if a file extension is supported.

参数:

名称 类型 描述 默认
extension str

File extension to check.

必需

返回:

名称 类型 描述
bool bool

True if the extension is supported.

Built-in Parsers

Bases: BaseDocumentParser

Parser for plain text and markdown documents.

This parser handles .txt and .md files by decoding them as UTF-8 text.

supported_content_types property

supported_content_types

Get the list of supported MIME types.

返回:

类型 描述
list[str]

list[str]: List of supported MIME types.

supported_extensions property

supported_extensions

Get the list of supported file extensions.

返回:

类型 描述
list[str]

list[str]: List of supported file extensions.

parse

parse(file_data, filename)

Parse a text document and extract content.

参数:

名称 类型 描述 默认
file_data bytes

Raw file content as bytes.

必需
filename str

Original filename (used for metadata).

必需

返回:

名称 类型 描述
ParseResult ParseResult

Parsed content and metadata.

引发:

类型 描述
ValueError

If the document cannot be decoded.

Bases: BaseDocumentParser

Parser for PDF documents.

This parser uses pypdf to extract text content from PDF files.

supported_content_types property

supported_content_types

Get the list of supported MIME types.

返回:

类型 描述
list[str]

list[str]: List of supported MIME types.

supported_extensions property

supported_extensions

Get the list of supported file extensions.

返回:

类型 描述
list[str]

list[str]: List of supported file extensions.

parse

parse(file_data, filename)

Parse a PDF document and extract text content.

参数:

名称 类型 描述 默认
file_data bytes

Raw file content as bytes.

必需
filename str

Original filename (used for metadata).

必需

返回:

名称 类型 描述
ParseResult ParseResult

Parsed content and metadata.

引发:

类型 描述
ValueError

If the PDF cannot be parsed.

Bases: BaseDocumentParser

Parser for Microsoft Word DOCX documents.

This parser uses python-docx to extract text content from DOCX files.

supported_content_types property

supported_content_types

Get the list of supported MIME types.

返回:

类型 描述
list[str]

list[str]: List of supported MIME types.

supported_extensions property

supported_extensions

Get the list of supported file extensions.

返回:

类型 描述
list[str]

list[str]: List of supported file extensions.

parse

parse(file_data, filename)

Parse a DOCX document and extract text content.

参数:

名称 类型 描述 默认
file_data bytes

Raw file content as bytes.

必需
filename str

Original filename (used for metadata).

必需

返回:

名称 类型 描述
ParseResult ParseResult

Parsed content and metadata.

引发:

类型 描述
ValueError

If the DOCX cannot be parsed.