Intelligent Document
Processing
AI pipeline from ingestion through classification, dual-engine OCR, entity extraction, 3-way validation, and active-learning HITL loop — built for AP invoices, contracts, and compliance docs.
From Document Chaos to Structured Data
Six stages that turned 15,000 documents per day from a manual bottleneck into an automated competitive advantage.
The Document Problem: 12 Types, 3 Channels
AP, Legal, and Compliance receive 15,000+ documents per day via email, SFTP, and web portals. PDFs, TIFFs, scanned handwritten forms, mixed-language invoices — zero consistency across 200+ vendors. Eight people spent 6 hours daily just on data entry before anything was validated.
Classification First: LayoutLM Reads Layout Signals
Before extracting a single field, LayoutLM classifies every document. Unlike text-only classifiers, LayoutLM incorporates bounding-box coordinates — so it recognizes that a large number in the top-right corner is a total amount, not a page number. 98.5% accuracy on 12 doc types in under 300ms.
Multi-Engine OCR: Azure Primary, Tesseract Fallback
Azure Document Intelligence achieves 99.2% accuracy on clean prints but drops to 87% on handwritten annotations and faded receipts. Tesseract runs with adaptive deskew + binarization preprocessing on all low-confidence pages, recovering 40% of Azure failures. Combined accuracy: 97.2%.
Structured Extraction: 18–28 Fields Per Invoice
spaCy NER identifies vendor names, amounts, dates, PO numbers, and addresses. Azure’s key-value extraction captures structured fields with confidence scores. Table parser reconstructs line items preserving row/column relationships. GPT-4o handles novel layouts that no template matches.
3-Way Match + pHash: 8 Validation Rules in 1.5s
Every invoice is cross-referenced against Purchase Order and Goods Receipt before payment. Perceptual hashing (pHash) catches visually-identical re-submitted invoices even when filenames and numbers change — a vector that exact-match dedup misses entirely. Catches 94% of AP fraud pre-payment.
Active Learning: HITL Corrections Train the Model
Invoices below 95% confidence route to human review. Every correction becomes a labeled training example — zero manual annotation. Weekly fine-tuning cycles with confidence-weighted loss (low-confidence errors count more). STP rate climbed from 67% to 85% over six months with no additional engineering effort.
Document Processing Demo
Select your lens: see what it means for your role, or step through the engineering pipeline.
From 8 Hours of Typing to 90 Minutes of Reviewing
You used to manually type data from invoices into the ERP for 8 hours a day. Now the AI handles 85% automatically — you only review the 15% where confidence is below 95%. That’s roughly 2,250 invoices per day handled without you. You review the 405 uncertain ones.
Search Contracts Instead of Reading 200 PDFs
Every contract and compliance document gets structured extraction — parties, effective dates, obligations, renewal terms, liability caps. You type “contracts expiring this quarter with auto-renew clauses” and get 12 matches in 0.3 seconds, instead of reading 200 PDFs over two days.
Serverless, Zero-VM, Scale to 100K Docs/Day
The pipeline is stateless Azure Functions with Redis queues. No VMs to provision, patch, or scale. Going from 2,500 to 25,000 documents per day means adjusting one queue concurrency parameter. Total infrastructure cost: $0.0031 per document processed end-to-end.
Blob Trigger → 7 Functions → Full Lineage in Cosmos
Blob storage trigger → Logic App router → Queue → 7 Azure Functions in sequence (classify, OCR, extract, validate, route, export, audit). Every field has a confidence score, bounding box, and source page logged. New doc types: retrain the LayoutLM classifier, zero pipeline changes.
| Invoice | Vendor | Amount | Fields | Accuracy | Status | Time |
|---|
IDP Deep Dives
Six lessons that build from the core problem to the most sophisticated techniques in the pipeline.
Key Results That Matter
Four measurable improvements that justify every engineering decision in the pipeline.
Active Learning Drives Continuous Improvement
Straight-through processing rate rose from 67% to 85% over six months — purely from HITL corrections feeding back into weekly fine-tuning. Zero manual annotation effort required.
Combined OCR Accuracy Across All Document Types
Dual-engine architecture (Azure primary + Tesseract fallback) achieves 97.2% word accuracy even on degraded scans, handwritten annotations, and thermally-faded receipts.
Duplicate Payments Prevented Per Year
Perceptual hash deduplication catches the 40% of duplicates that invoice-number matching misses, running at 12ms per document with a 0.3% false positive rate.
End-to-End Processing Time vs 8+ Hours Manual
All 7 pipeline stages — intake, classify, OCR, extract, validate, route, export — complete in 4.8 seconds on average. Manual intake alone took 8+ hours per day for a team of 8.
Production Implementation
Core pipeline components — dual-engine OCR, active learning, and 3-way fraud matching.
Azure Doc Intelligence + Tesseract Dual-Engine Pipeline
import cv2
import numpy as np
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.identity import DefaultAzureCredential
import pytesseract
from PIL import Image
class DualEngineOCR:
"""Multi-engine OCR: Azure primary, Tesseract fallback for low-confidence pages."""
def __init__(self, endpoint: str, confidence_threshold: float = 0.85):
credential = DefaultAzureCredential()
self.client = DocumentAnalysisClient(endpoint, credential)
self.threshold = confidence_threshold
def preprocess(self, image_path: str) -> np.ndarray:
"""Deskew and binarize for Tesseract fallback."""
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
coords = np.column_stack(np.where(img < 128))
angle = cv2.minAreaRect(coords)[-1]
angle = -(90 + angle) if angle < -45 else -angle
h, w = img.shape
M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
img = cv2.warpAffine(img, M, (w, h), flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE)
img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 31, 11)
return img
async def extract(self, doc_path: str) -> dict:
"""Run Azure first; fall back to Tesseract on low-confidence pages."""
with open(doc_path, "rb") as f:
poller = await self.client.begin_analyze_document(
"prebuilt-invoice", document=f
)
result = await poller.result()
pages, fallback_pages = [], []
for page in result.pages:
avg_conf = np.mean([w.confidence for w in page.words]) if page.words else 0
if avg_conf >= self.threshold:
pages.append({"engine": "azure", "confidence": avg_conf,
"words": [w.content for w in page.words]})
else:
fallback_pages.append(page.page_number)
for page_num in fallback_pages:
preprocessed = self.preprocess(f"{doc_path}_page{page_num}.png")
text = pytesseract.image_to_data(
Image.fromarray(preprocessed), output_type=pytesseract.Output.DICT
)
conf_vals = [int(c) for c in text["conf"] if int(c) > 0]
pages.append({"engine": "tesseract", "page": page_num,
"confidence": np.mean(conf_vals) / 100 if conf_vals else 0,
"words": [w for w, c in zip(text["text"], text["conf"])
if int(c) > 40 and w.strip()]})
return {"pages": pages, "fallback_count": len(fallback_pages),
"avg_confidence": np.mean([p["confidence"] for p in pages])}
Active Learning: HITL Corrections Fine-Tune the Model
import torch
from torch.utils.data import DataLoader, WeightedRandomSampler
from transformers import LayoutLMv3ForTokenClassification, LayoutLMv3Processor
from datetime import datetime, timedelta
class ActiveLearningPipeline:
"""Ingest HITL corrections and fine-tune extraction model weekly."""
def __init__(self, model_name: str, corrections_db):
self.processor = LayoutLMv3Processor.from_pretrained(model_name)
self.model = LayoutLMv3ForTokenClassification.from_pretrained(model_name)
self.corrections_db = corrections_db
def ingest_corrections(self, since: timedelta = timedelta(days=7)) -> list:
"""Extract labeled correction pairs from HITL review queue."""
cutoff = datetime.utcnow() - since
corrections = self.corrections_db.find({
"reviewed_at": {"$gte": cutoff},
"status": "corrected"
})
return [
{
"document_id": c["document_id"],
"original_extraction": c["model_output"],
"corrected_extraction": c["human_correction"],
"field": c["field_name"],
"confidence_delta": c["original_confidence"]
}
for c in corrections
]
def confidence_weighted_loss(self, logits, labels, confidences):
"""Weight loss inversely by original confidence.
Low-confidence errors that humans corrected contribute more to learning."""
base_loss = torch.nn.functional.cross_entropy(
logits.view(-1, logits.size(-1)), labels.view(-1), reduction="none"
)
# High-confidence wrong predictions penalized 3x more
weights = 1.0 - confidences.view(-1).clamp(0.5, 0.99)
weights = weights / weights.sum() * len(weights)
return (base_loss * weights).mean()
def retrain(self, training_pairs: list, epochs: int = 3, lr: float = 2e-5) -> dict:
"""Fine-tune on HITL corrections with confidence-weighted loss."""
dataset = self._build_dataset(training_pairs)
# Oversample rare correction types proportional to their surprise
weights = [1.0 / max(p["confidence_delta"], 0.01) for p in training_pairs]
sampler = WeightedRandomSampler(weights, len(weights))
loader = DataLoader(dataset, batch_size=8, sampler=sampler)
optimizer = torch.optim.AdamW(self.model.parameters(), lr=lr)
self.model.train()
epoch_losses = []
for epoch in range(epochs):
total = 0
for batch in loader:
logits = self.model(**batch["inputs"]).logits
loss = self.confidence_weighted_loss(
logits, batch["labels"], batch["confidences"]
)
loss.backward()
optimizer.step()
optimizer.zero_grad()
total += loss.item()
epoch_losses.append(total / len(loader))
return {"corrections_used": len(training_pairs),
"final_loss": epoch_losses[-1],
"epoch_losses": epoch_losses}
3-Way PO Match + pHash Duplicate Detection
import imagehash
from dataclasses import dataclass
from typing import Optional
from PIL import Image
@dataclass
class MatchResult:
status: str # "matched" | "variance" | "no_po" | "no_receipt" | "duplicate"
variance_pct: float
details: dict
class ThreeWayMatcher:
"""Cross-reference PO + Goods Receipt + Invoice; catch visual duplicates via pHash."""
AMOUNT_TOLERANCE = 0.02 # 2% for currency rounding differences
PHASH_THRESHOLD = 6 # Hamming distance threshold for visual duplicate
def __init__(self, po_store, receipt_store, invoice_store):
self.po_store = po_store
self.receipt_store = receipt_store
self.invoice_store = invoice_store
def detect_visual_duplicate(self, invoice_path: str) -> Optional[str]:
"""Perceptual hash to catch re-submitted PDFs with modified filenames/numbers."""
current_hash = imagehash.phash(Image.open(invoice_path))
for inv in self.invoice_store.find({"status": "processed"}):
if current_hash - imagehash.hex_to_hash(inv["phash"]) <= self.PHASH_THRESHOLD:
return inv["invoice_number"]
return None
def match(self, invoice: dict) -> MatchResult:
# 1. Visual dedup via pHash (catches re-submissions with altered metadata)
if invoice.get("file_path"):
dup = self.detect_visual_duplicate(invoice["file_path"])
if dup:
return MatchResult("duplicate", 0,
{"duplicate_of": dup, "method": "perceptual_hash"})
# 2. Exact-match dedup (invoice number + vendor)
existing = self.invoice_store.find_one({
"invoice_number": invoice["invoice_number"],
"vendor_id": invoice["vendor_id"],
"status": {"$ne": "rejected"}
})
if existing:
return MatchResult("duplicate", 0,
{"duplicate_of": existing["invoice_number"],
"method": "invoice_number+vendor"})
# 3. PO existence check
po = self.po_store.find_one({"po_number": invoice["po_number"]})
if not po:
return MatchResult("no_po", 0, {"po_number": invoice["po_number"]})
# 4. Goods receipt confirmation
receipt = self.receipt_store.find_one({
"po_number": invoice["po_number"], "status": "received"
})
if not receipt:
return MatchResult("no_receipt", 0, {"po_number": invoice["po_number"],
"reason": "goods not yet received"})
# 5. Amount tolerance check (2%)
variance = abs(invoice["total_amount"] - po["total_amount"]) / po["total_amount"]
if variance > self.AMOUNT_TOLERANCE:
return MatchResult("variance", round(variance * 100, 2),
{"po_amount": po["total_amount"],
"invoice_amount": invoice["total_amount"],
"threshold": "2%"})
return MatchResult("matched", round(variance * 100, 2),
{"po": po["po_number"],
"receipt": receipt["receipt_id"],
"invoice": invoice["invoice_number"],
"amount": invoice["total_amount"]})
Technology Stack
Every library and service that powers the end-to-end IDP pipeline in production.
Intelligent Document Processing
A 7-stage Azure-native pipeline — dual-engine OCR, LayoutLMv3 classification, spaCy NER, 3-way PO matching, pHash deduplication, active learning HITL loop — processing 15,000 documents per day with 97.2% field accuracy and 85% straight-through rate.