Baidu’s model reads the whole document, not just the text on the scan
Baidu’s 4B OCR marries vision and language📷 Scraped: Mar 18, 2026
- ★Qianfan-OCR scores 93.12 on OmniDocBench v1.5, outperforming rivals in the end-to-end category
- ★Model supports prompt-driven features: table extraction, document Q&A, and two-column PDF processing
- ★Unlike Tesseract or ABBYY, it skips multi-stage pipelines and goes straight from pixels to Markdown
Baidu's Qianfan team has released a 4-billion-parameter model that collapses layout analysis, text recognition, and document understanding into a single end-to-end neural stack. Most OCR still runs through brittle, multi-stage pipelines that chain detection, recognition, and parsing modules like so many rusty pipe couplings. Qianfan-OCR slices through this complexity by pushing the entire workflow straight from pixels to Markdown. The parameter count is not mere marketing math—4 billion transformer weights buy a shared understanding of shapes, text, and structure that monolithic architectures simply cannot replicate.
The model scores 93.12 on OmniDocBench v1.5, outperforming rivals in the end-to-end category. This matters because benchmark leadership in document intelligence has historically belonged to modular systems that stitch together specialized components. A unified model beating that paradigm suggests the field is approaching an inflection point similar to what happened in machine translation when attention mechanisms displaced phrase-based systems.
Prompt-driven features separate this release from conventional OCR tooling. Beyond raw text extraction, the stack accepts instructions for table extraction and document Q&A, transforming static pages into queryable knowledge representations. Early demonstrations show it handling two-column PDFs and nested tables without degradation—scenarios that routinely fracture modular pipelines where layout detection errors cascade catastrophically into recognition failures.
Chinese document intelligence model converts images directly to Markdown, including tables and question answering
One architecture, zero glue-code overhead📷 Scraped: Mar 18, 2026
The direct image-to-Markdown conversion is what gives this launch practical teeth. Traditional OCR pipelines export plain text or malformed HTML; downstream applications then wrestle with layout metadata reconstruction. Qianfan-OCR bakes formatting awareness into its decoder, so a scanned resume outputs clean Markdown that renders identically across GitHub, Obsidian, or static site generators. This eliminates an entire class of post-processing scripts that engineering teams currently maintain as technical debt.
Baidu's release notes claim up to 6% accuracy improvements over state-of-the-art two-stage pipelines on public benchmarks. Whether these numbers survive contact with real-world filing cabinets—smudged receipts, skewed mobile captures, century-old typewriter pages—remains the open question that separates research demonstrations from production reliability. The open-source SDK and cloud API wrapper suggest Baidu is betting on developer adoption rather than keeping this capability proprietary, a strategy that accelerates iteration through community stress-testing.
For practitioners, the operational implication is significant: one model endpoint replaces three to five specialized services, cutting latency budgets and failure modes simultaneously. The trade-off is familiar from other unified architectures—slightly worse at any single task than a purpose-built specialist, but dramatically more robust at the messy boundaries where real documents actually live.

