Skip to content

AXEtract

Batch Processing

abdo-Mansour/axetract

Batch Processing

AXEtract is optimized for processing multiple documents efficiently. Use extract() for a single query across multiple URLs, or extract_batch() for heterogeneous tasks with different queries or schemas.

Using `extract()` with Multiple URLs

Use this when you want to extract the same type of information from several similar pages. The pipeline automatically uses pipelined (multi-threaded) execution for large batches.

from axetract import AXEPipeline

pipeline = AXEPipeline.from_config()

urls = [
    "https://example.com/item1",
    "https://example.com/item2",
    "https://example.com/item3",
]

results = pipeline.extract(
    input_data=urls,
    query="Extract the title and price"
)

for url, res in zip(urls, results):
    print(f"Results for {url}: {res.prediction}")

Using `extract_batch()`

Use this when each item requires a different query or schema. You must provide AXESample objects with all required fields (id, content, is_content_url).

import uuid
from axetract import AXEPipeline
from axetract.data_types import AXESample, Status
from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: float

pipeline = AXEPipeline.from_config()

batch = [
    AXESample(
        id=str(uuid.uuid4()),
        content="https://site1.com/article",
        is_content_url=True,
        query="Get the article abstract",
    ),
    AXESample(
        id=str(uuid.uuid4()),
        content="https://site2.com/product",
        is_content_url=True,
        schema_model=Product,
    ),
]

results = pipeline.extract_batch(batch)

for res in results:
    if res.status == Status.SUCCESS:
        print(res.prediction)
    else:
        print(f"Failed: {res.error}")