- JPMorgan introduced DocLLM, a new AI model for understanding complex documents like forms, invoices, reports etc.
- DocLLM strategically focuses on spatial layout information instead of expensive image encoders, making it lightweight.
- The model uses a novel disentangled spatial attention mechanism to align text and layout.
- DocLLM is adept at tackling irregular document layouts via its text infilling pre-training objective.
- Extensive evaluations show DocLLM outperforming state-of-the-art models on a majority of document intelligence benchmarks.
- JPMorgan plans to infuse computer vision into DocLLM to further improve its capabilities.
JPMorgan has introduced a new AI model called DocLLM that is designed to understand and analyze documents spanning various formats like forms, invoices, reports and contracts. DocLLM stands out from other multimodal language models in its strategic focus on lightweight spatial layout information instead of expensive image encoders.
JPMorgan’s DocLLM Brings Multimodal Intelligence to Document Analysis
The model features a novel disentangled spatial attention mechanism that decomposes the traditional attention mechanism in transformers. This allows for separate cross-alignment of textual and spatial modalities in visual documents. DocLLM is adept at tackling irregular layouts and heterogeneous content often found in real-world documents. Its pre-training objective involves learning to infill or generate missing text segments in documents, enabling it to handle irregular layouts effectively.
For pre-training data, JPMorgan gathered documents from two primary sources – the IIT-CDIP Test Collection 1.0 and DocBank. The IIT-CDIP Test Collection comprises over 5 million legal documents related to tobacco industry lawsuits from the 1990s. DocBank consists of 500,000 documents spanning different layouts and structures.
Extensive evaluations demonstrate DocLLM’s superior performance over state-of-the-art language models on a battery of document intelligence tasks. It achieved the top score on 14 out of 16 benchmark datasets. The model also displayed robust generalization ability on 4 out of 5 unseen datasets.
Going forward, JPMorgan aims to further enhance DocLLM’s capabilities by infusing lightweight computer vision, allowing the model to leverage both text and images for document understanding.
Two products launched right now:
AI Tool Agent: https://orbitmoonalpha.com/shop/ai-tool-agent/
AI Tool Drawsth: https://orbitmoonalpha.com/shop/ai-tool-drawsth/