Extractor API
axetract.extractor.axe_extractor.AXEExtractor
Bases: BaseExtractor
Component for extracting structured data from HTML using LLMs.
Attributes:
| Name | Type | Description |
|---|---|---|
llm_extractor_client |
BaseClient
|
The LLM client used for extraction. |
schema_prompt_template |
str
|
Template for schema-based extraction prompts. |
query_prompt_template |
str
|
Template for natural language query prompts. |
name |
str
|
Component name. |
batch_size |
int
|
Processing batch size. |
num_workers |
int
|
Number of parallel workers. |
Source code in src/axetract/extractor/axe_extractor.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | |
__call__(samples)
Run the extraction process on a batch of samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
samples
|
List[AXESample]
|
Input samples with clean HTML. |
required |
Returns:
| Type | Description |
|---|---|
List[AXESample]
|
List[AXESample]: Samples with LLM-generated predictions. |
Source code in src/axetract/extractor/axe_extractor.py
__init__(llm_extractor_client, schema_generation_prompt_template, query_generation_prompt_template, name='axe_extractor', batch_size=16, num_workers=4)
Initialize the extractor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
llm_extractor_client
|
BaseClient
|
LLM client. |
required |
schema_generation_prompt_template
|
str
|
Schema prompt template. |
required |
query_generation_prompt_template
|
str
|
Query prompt template. |
required |
name
|
str
|
Component name. |
'axe_extractor'
|
batch_size
|
int
|
Batch size. |
16
|
num_workers
|
int
|
Parallel workers. |
4
|
Source code in src/axetract/extractor/axe_extractor.py
axetract.extractor.base_extractor.BaseExtractor
Bases: ABC
Abstract base class for all extractors.