Skip to content

Data Types

The core data structures used throughout Axetract.

axetract.data_types

AXEChunk

Bases: BaseModel

A single chunk of HTML content.

Attributes:

Name Type Description
chunkid str

Unique identifier for the chunk.

content str

The raw or cleaned HTML content.

Source code in src/axetract/data_types.py
class AXEChunk(BaseModel):
    """A single chunk of HTML content.

    Attributes:
        chunkid (str): Unique identifier for the chunk.
        content (str): The raw or cleaned HTML content.
    """

    chunkid: str
    content: str

AXEResult

Bases: BaseModel

Final extraction result returned to the user.

Attributes:

Name Type Description
id str

Sample identifier.

prediction Union[str, dict, Any]

The extracted structured data.

xpaths Optional[dict]

Reference XPaths for the extracted values.

status Status

Success or failure indicator.

error Optional[str]

Error message if processing failed.

Source code in src/axetract/data_types.py
class AXEResult(BaseModel):
    """Final extraction result returned to the user.

    Attributes:
        id (str): Sample identifier.
        prediction (Union[str, dict, Any]): The extracted structured data.
        xpaths (Optional[dict]): Reference XPaths for the extracted values.
        status (Status): Success or failure indicator.
        error (Optional[str]): Error message if processing failed.
    """

    id: str
    prediction: Union[str, dict, Any]
    xpaths: Optional[dict] = None
    status: Status
    error: Optional[str] = None

AXESample

Bases: BaseModel

A data container for a single extraction request throughout the pipeline.

Attributes:

Name Type Description
id str

Unique identifier for the sample.

content str

Input content (URL or raw HTML).

is_content_url bool

Whether the content is a URL.

query Optional[str]

Natural language extraction query.

schema_model Optional[Union[str, Type[BaseModel], dict]]

Desired JSON schema.

chunks List[AXEChunk]

List of processed HTML chunks.

original_html str

The original, uncleaned HTML content.

current_html str

The current state of HTML (e.g., after cleaning or pruning).

prediction Optional[Union[str, dict, Any]]

The LLM's raw output or parsed JSON.

xpaths Optional[dict]

Map of extracted fields to their source XPaths.

status Status

Current processing status.

Source code in src/axetract/data_types.py
class AXESample(BaseModel):
    """A data container for a single extraction request throughout the pipeline.

    Attributes:
        id (str): Unique identifier for the sample.
        content (str): Input content (URL or raw HTML).
        is_content_url (bool): Whether the content is a URL.
        query (Optional[str]): Natural language extraction query.
        schema_model (Optional[Union[str, Type[BaseModel], dict]]): Desired JSON schema.
        chunks (List[AXEChunk]): List of processed HTML chunks.
        original_html (str): The original, uncleaned HTML content.
        current_html (str): The current state of HTML (e.g., after cleaning or pruning).
        prediction (Optional[Union[str, dict, Any]]): The LLM's raw output or parsed JSON.
        xpaths (Optional[dict]): Map of extracted fields to their source XPaths.
        status (Status): Current processing status.
    """

    id: str
    content: str
    is_content_url: bool
    query: Optional[str] = None
    schema_model: Optional[Union[str, Type[BaseModel], dict]] = None
    chunks: List[AXEChunk] = []
    original_html: str = ""
    current_html: str = ""
    prediction: Optional[Union[str, dict, Any]] = None
    xpaths: Optional[dict] = None

    status: Status = Status.PENDING

Status

Bases: Enum

Execution status for a processing sample.

Source code in src/axetract/data_types.py
class Status(Enum):
    """Execution status for a processing sample."""

    PENDING = "pending"
    SUCCESS = "success"
    FAILED = "failed"