vLLM Serving
For production environments requiring high throughput, Axetract supports vLLM for serving the base model and dynamically routing LoRA requests.
Setup
Ensure you have a GPU with sufficient VRAM and that vllm is installed in your environment.
Usage
Simply pass use_vllm=True to from_config.
from axetract import AXEPipeline
# This will initialize the LocalVLLMClient
# and load the base model into the vLLM engine.
pipeline = AXEPipeline.from_config(use_vllm=True)
# Inference remains the same, but benefits from vLLM's
# continuous batching and efficient LoRA swapping.
result = pipeline.extract(
"https://example.com",
query="Extract the main headline"
)
Configuration
You can customize the vLLM engine by passing a config dictionary: