Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Résumé

With the advent of end-to-end models and the remarkable performance of foundation models, the question arises regarding the relevance of preliminary steps, such as layout analysis and optical character recognition (OCR), for information extraction from document images. We attempt to provide some answers through experiments conducted on a new database of food labels. The goal is to extract nutritional values from cellphone pictures taken in grocery stores. We compare the results of OCR-free models that take the raw images as input (Donut and GPT-4-Vision) with two-stage systems that first perform OCR and then extract information using large language models (LLMs) from the recognized text (Mistral, GPT-3, and GPT-4). To assess the impact of layout analysis, we applied the same systems to three different views of the image: the original full image, a large manual crop containing the entire food label, and a small crop focusing on the relevant nutrition information. Comparative experiments are also conducted on the CORD database of receipts. Our results demonstrate that although OCR-free models achieve a remarkable performance, they still require some guidance regarding the layout, and two-stage systems achieve better results overall.

Détails

Actions