Data Types
The core data structures used throughout Axetract.
axetract.data_types
AXEChunk
Bases: BaseModel
A single chunk of HTML content.
Attributes:
| Name | Type | Description |
|---|---|---|
chunkid |
str
|
Unique identifier for the chunk. |
content |
str
|
The raw or cleaned HTML content. |
Source code in src/axetract/data_types.py
AXEResult
Bases: BaseModel
Final extraction result returned to the user.
Attributes:
| Name | Type | Description |
|---|---|---|
id |
str
|
Sample identifier. |
prediction |
Union[str, dict, Any]
|
The extracted structured data. |
xpaths |
Optional[dict]
|
Reference XPaths for the extracted values. |
status |
Status
|
Success or failure indicator. |
error |
Optional[str]
|
Error message if processing failed. |
Source code in src/axetract/data_types.py
AXESample
Bases: BaseModel
A data container for a single extraction request throughout the pipeline.
Attributes:
| Name | Type | Description |
|---|---|---|
id |
str
|
Unique identifier for the sample. |
content |
str
|
Input content (URL or raw HTML). |
is_content_url |
bool
|
Whether the content is a URL. |
query |
Optional[str]
|
Natural language extraction query. |
schema_model |
Optional[Union[str, Type[BaseModel], dict]]
|
Desired JSON schema. |
chunks |
List[AXEChunk]
|
List of processed HTML chunks. |
original_html |
str
|
The original, uncleaned HTML content. |
current_html |
str
|
The current state of HTML (e.g., after cleaning or pruning). |
prediction |
Optional[Union[str, dict, Any]]
|
The LLM's raw output or parsed JSON. |
xpaths |
Optional[dict]
|
Map of extracted fields to their source XPaths. |
status |
Status
|
Current processing status. |