Embedding Layout in Text for Document Understanding Using Large Language Models

Embedding Layout in Text for Document Understanding Using Large Language Models
Mohammad Minouei, Mohammad Reza Soheili, Didier Stricker (Hrsg.)
International Conference on Document Analysis and Recognition (ICDAR-2024), Springer, Cham, 2024.

Abstract:
In this paper, we address the challenge of effectively utilizing Large Language Models (LLMs) for Visually Rich Document Understanding (VRDU), a key part of intelligent document processing systems. While LLMs excel in various Natural Language Processing (NLP) tasks, their application for extracting information from complex structured documents like invoices and forms is limited. This limitation arises from the difficulty in contextually understanding these documents, largely due to the lack of layout information. Our research is dedicated to unlocking the full potential of LLMs for VRDU by integrating OCR data into an HTML format, which preserves the essential spatial layout for accurate information extraction. The empirical results show a notable improvement, with a more than 20% increase over baseline performances. This research highlights the promising potential of LLMs in VRDU and sets the stage for further innovations in automated document processing.