Why vision-language models finally cracked messy financial documents
For most of the last decade, document AI for finance was a catalogue of brittle pipelines. OCR followed by rules. Templates per format. Per-bank fine-tunes that broke when a layout changed. In emerging markets, where documents are screenshots, photos at angles, and locally-formatted PDFs, none of those approaches scaled past a single pilot.
What changed in 2025 was simple, and I think under-appreciated: vision-language models started reading documents the way a human does. Not as a 2D grid of pixels to OCR, then a string to parse, but as a single perceptual object — page layout, table structure, handwritten annotations, language, currency, and tampering signals all in one forward pass.
What changed
Three things, technically. First, training corpora finally started including high-quality, multilingual financial documents — not just English bank statements. Second, the models got long-context enough to read a 30-page audited financial in one shot, with cited line numbers. Third, instruction-tuning on structured-output formats (JSON schemas, cited fields) became reliable enough to skip the schema-coercion hacks we used to need.
The first time we ran a tilted GCash screenshot through one of these models and got back a clean JSON with cited bounding boxes, we knew underwriting in emerging markets was about to change.
Why this makes underwriting a software problem
The bottleneck of underwriting in emerging markets has never really been the credit model. It has been the file. Officers spend most of their time chasing missing pages, keying in numbers, and reconciling figures across documents. Modeling is the last 20%.
When document reading becomes reliable software, the whole stack flips. The officer's job collapses into the part that should always have been theirs: judgment. Everything else is software.
That is what LendTrace is for.
See it on your own documents.
Bring a hard borrower file. We'll extract it live.