Skip to content

Batch Processing

AXEtract is optimized for processing multiple documents efficiently. Use extract() for a single query across multiple URLs, or extract_batch() for heterogeneous tasks with different queries or schemas.

Using extract() with Multiple URLs

Use this when you want to extract the same type of information from several similar pages. The pipeline automatically uses pipelined (multi-threaded) execution for large batches.

from axetract import AXEPipeline

pipeline = AXEPipeline.from_config()

urls = [
    "https://example.com/item1",
    "https://example.com/item2",
    "https://example.com/item3",
]

results = pipeline.extract(
    input_data=urls,
    query="Extract the title and price"
)

for url, res in zip(urls, results):
    print(f"Results for {url}: {res.prediction}")

Using extract_batch()

Use this when each item requires a different query or schema. You must provide AXESample objects with all required fields (id, content, is_content_url).

import uuid
from axetract import AXEPipeline
from axetract.data_types import AXESample, Status
from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: float

pipeline = AXEPipeline.from_config()

batch = [
    AXESample(
        id=str(uuid.uuid4()),
        content="https://site1.com/article",
        is_content_url=True,
        query="Get the article abstract",
    ),
    AXESample(
        id=str(uuid.uuid4()),
        content="https://site2.com/product",
        is_content_url=True,
        schema_model=Product,
    ),
]

results = pipeline.extract_batch(batch)

for res in results:
    if res.status == Status.SUCCESS:
        print(res.prediction)
    else:
        print(f"Failed: {res.error}")