Skip to content

Architecture

Axetract follows a modular pipeline architecture designed for high-performance extraction.

graph TD
    A[Raw URL/HTML] --> B[Preprocessor]
    B -->|Cleaned Chunks| C[Pruner]
    C -->|Relevant Nodes| D[Extractor]
    D -->|JSON + XPaths| E[Postprocessor]
    E --> F[Final AXEResult]

    subgraph "Small LLM (0.6B)"
        C
        D
    end

The Three Pillars

1. Preprocessor

The preprocessor fetches the raw HTML (if a URL is provided) and performs initial cleaning. It uses html-chunking to break down the document into manageable pieces while preserving semantic structure.

2. AI Extractor (LoRA Powered)

The AI Extractor is divided into two distinct sequential stages:

  • Pruner: Most web pages contain vast amounts of boilerplate (headers, footers, ads). The Pruner uses a specific LoRA adapter to identify and keep only the parts of the DOM relevant to the user's query. This drastically reduces the number of tokens passed to the Extractor.
  • Extractor: The Extractor is the brain of the pipeline. It takes the pruned HTML and the desired schema/query to produce a structured JSON output or answer. What makes Axetract unique is Grounded XPath Resolution (GXR): the extractor doesn't just return text; it returns the exact XPaths of the elements in the original document.

3. Postprocessor

The postprocessor performs final cleanup, schema validation (using json-repair if needed), and formatting of the results.

Multi-LoRA Strategy

Axetract uses a "base model + adapters" strategy. Both the Pruner and Extractor share the same base model (e.g., Qwen3-0.6B). When processing: 1. Load the base model. 2. Load Pruner adapter. 3. Switch to Extractor adapter for final inference.

This allows high intelligence without the VRAM cost of multiple models.