DocDB: A Database for Unstructured Document Analysis

Zequn Li, Yuanhao Zhong, Chengliang Chai*, Zhaoze Sun, Yuhao Deng*, Ye Yuan, Guoren Wang, Lei Cao

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

Recent studies have developed LLM-powered data systems that enable database-like analysis of unstructured text documents. While LLMs excel at attribute extraction from documents, their high computational costs and latency make extraction operations the primary performance bottleneck. Existing systems typically adopt traditional relational database query optimization strategies, which prove ineffective in minimizing LLM-related expenses. To fill this gap, we propose DocDB, a prototype system that features a bunch of novel optimization strategies designated to unstructured document analysis. First, we employ a two-level index to reduce LLM extraction costs by selectively retrieving and processing only text segments relevant to target attributes. Second, DocDB employs adaptive execution, generating document-specific plans to minimize LLM extraction frequency based on varying per-document attribute extraction costs. With a real-life scenario, we demonstrate that DocDB allows users to analyze unstructured documents accurately and affordably using SQL-like queries. The corresponding video is available at http://youtu.be/8yDIKOBHIOg.

Original languageEnglish
Pages (from-to)5387-5390
Number of pages4
JournalProceedings of the VLDB Endowment
Volume18
Issue number12
DOIs
Publication statusPublished - 2025
Event51st International Conference on Very Large Data Bases, VLDB 2025 - London, United Kingdom
Duration: 1 Sept 20255 Sept 2025

Fingerprint

Dive into the research topics of 'DocDB: A Database for Unstructured Document Analysis'. Together they form a unique fingerprint.

Cite this