Preprocessor API
axetract.preprocessor.axe_preprocessor.AXEPreprocessor
Bases: BasePreprocessor
Component for fetching and chunking HTML content.
Attributes:
| Name | Type | Description |
|---|---|---|
fetch_workers |
int
|
Number of parallel threads for fetching URLs. |
cpu_workers |
int
|
Number of parallel processes/threads for cleaning and chunking. |
extra_remove_tags |
List[str]
|
Additional HTML tags to remove. |
strip_attrs |
bool
|
Whether to remove all tag attributes. |
strip_links |
bool
|
Whether to replace tags with text. |
keep_tags |
bool
|
Whether to preserve HTML tags in the output. |
use_clean_rag |
bool
|
Whether to use htmlrag for cleaning. |
use_clean_chunker |
bool
|
Whether the chunker should expect clean HTML. |
chunk_size |
int
|
Targeted token size for each chunk. |
attr_cutoff_len |
int
|
Length threshold for attribute retention. |
disable_chunking |
bool
|
Whether to skip the chunking step. |
Source code in src/axetract/preprocessor/axe_preprocessor.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | |
__call__(samples)
Fetch, clean, and chunk a batch of samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
samples
|
list[AXESample]
|
Input samples (URLs or raw HTML). |
required |
Returns:
| Type | Description |
|---|---|
List[AXESample]
|
list[AXESample]: Samples with chunks populated. |
Source code in src/axetract/preprocessor/axe_preprocessor.py
__init__(name='AXEPreprocessor', fetch_workers=mp.cpu_count(), cpu_workers=mp.cpu_count(), extra_remove_tags=None, strip_attrs=True, strip_links=True, keep_tags=True, use_clean_rag=True, use_clean_chunker=True, chunk_size=2000, attr_cutoff_len=5, disable_chunking=False)
Initialize the preprocessor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Component name. |
'AXEPreprocessor'
|
fetch_workers
|
int
|
Fetching thread count. |
cpu_count()
|
cpu_workers
|
int
|
Cleaning process count. |
cpu_count()
|
extra_remove_tags
|
List[str]
|
Tags to strip. |
None
|
strip_attrs
|
bool
|
Strip attributes flag. |
True
|
strip_links
|
bool
|
Strip tags flag. |
True
|
keep_tags
|
bool
|
Keep HTML tags flag. |
True
|
use_clean_rag
|
bool
|
Use htmlrag flag. |
True
|
use_clean_chunker
|
bool
|
Clean chunker flag. |
True
|
chunk_size
|
int
|
Chunk token limit. |
2000
|
attr_cutoff_len
|
int
|
Attribute length limit. |
5
|
disable_chunking
|
bool
|
Disable chunking flag. |
False
|
Source code in src/axetract/preprocessor/axe_preprocessor.py
axetract.preprocessor.base_preprocessor.BasePreprocessor
Bases: ABC
Abstract base class for all preprocessors.
Source code in src/axetract/preprocessor/base_preprocessor.py
__call__(samples)
abstractmethod
__init__(name)
Initialize the preprocessor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Component name. |
required |