DocDB: A Database for Unstructured Document Analysis

Zequn Li; Yuanhao Zhong; Chengliang Chai; Zhaoze Sun; Yuhao Deng; Ye Yuan; Guoren Wang; Lei Cao

doi:10.14778/3750601.3750678

DocDB: A Database for Unstructured Document Analysis

Zequn Li, Yuanhao Zhong, Chengliang Chai^*, Zhaoze Sun, Yuhao Deng^*, Ye Yuan, Guoren Wang, Lei Cao

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Conference article › peer-review

Abstract

Recent studies have developed LLM-powered data systems that enable database-like analysis of unstructured text documents. While LLMs excel at attribute extraction from documents, their high computational costs and latency make extraction operations the primary performance bottleneck. Existing systems typically adopt traditional relational database query optimization strategies, which prove ineffective in minimizing LLM-related expenses. To fill this gap, we propose DocDB, a prototype system that features a bunch of novel optimization strategies designated to unstructured document analysis. First, we employ a two-level index to reduce LLM extraction costs by selectively retrieving and processing only text segments relevant to target attributes. Second, DocDB employs adaptive execution, generating document-specific plans to minimize LLM extraction frequency based on varying per-document attribute extraction costs. With a real-life scenario, we demonstrate that DocDB allows users to analyze unstructured documents accurately and affordably using SQL-like queries. The corresponding video is available at http://youtu.be/8yDIKOBHIOg.

Original language	English
Pages (from-to)	5387-5390
Number of pages	4
Journal	Proceedings of the VLDB Endowment
Volume	18
Issue number	12
DOIs	http://doi.org/10.14778/3750601.3750678
Publication status	Published - 2025
Event	51st International Conference on Very Large Data Bases, VLDB 2025 - London, United Kingdom Duration: 1 Sept 2025 → 5 Sept 2025

Access to Document

10.14778/3750601.3750678

Cite this

@article{e696f396b9494113aedd5f9d0592d672,

title = "DocDB: A Database for Unstructured Document Analysis",

abstract = "Recent studies have developed LLM-powered data systems that enable database-like analysis of unstructured text documents. While LLMs excel at attribute extraction from documents, their high computational costs and latency make extraction operations the primary performance bottleneck. Existing systems typically adopt traditional relational database query optimization strategies, which prove ineffective in minimizing LLM-related expenses. To fill this gap, we propose DocDB, a prototype system that features a bunch of novel optimization strategies designated to unstructured document analysis. First, we employ a two-level index to reduce LLM extraction costs by selectively retrieving and processing only text segments relevant to target attributes. Second, DocDB employs adaptive execution, generating document-specific plans to minimize LLM extraction frequency based on varying per-document attribute extraction costs. With a real-life scenario, we demonstrate that DocDB allows users to analyze unstructured documents accurately and affordably using SQL-like queries. The corresponding video is available at http://youtu.be/8yDIKOBHIOg.",

author = "Zequn Li and Yuanhao Zhong and Chengliang Chai and Zhaoze Sun and Yuhao Deng and Ye Yuan and Guoren Wang and Lei Cao",

note = "Publisher Copyright: {\textcopyright} 2025, VLDB Endowment. All rights reserved.; 51st International Conference on Very Large Data Bases, VLDB 2025 ; Conference date: 01-09-2025 Through 05-09-2025",

year = "2025",

doi = "10.14778/3750601.3750678",

language = "English",

volume = "18",

pages = "5387--5390",

journal = "Proceedings of the VLDB Endowment",

issn = "2150-8097",

publisher = "Very Large Data Base Endowment Inc.",

number = "12",

}

TY - JOUR

T1 - DocDB

T2 - 51st International Conference on Very Large Data Bases, VLDB 2025

AU - Li, Zequn

AU - Zhong, Yuanhao

AU - Chai, Chengliang

AU - Sun, Zhaoze

AU - Deng, Yuhao

AU - Yuan, Ye

AU - Wang, Guoren

AU - Cao, Lei

PY - 2025

Y1 - 2025

N2 - Recent studies have developed LLM-powered data systems that enable database-like analysis of unstructured text documents. While LLMs excel at attribute extraction from documents, their high computational costs and latency make extraction operations the primary performance bottleneck. Existing systems typically adopt traditional relational database query optimization strategies, which prove ineffective in minimizing LLM-related expenses. To fill this gap, we propose DocDB, a prototype system that features a bunch of novel optimization strategies designated to unstructured document analysis. First, we employ a two-level index to reduce LLM extraction costs by selectively retrieving and processing only text segments relevant to target attributes. Second, DocDB employs adaptive execution, generating document-specific plans to minimize LLM extraction frequency based on varying per-document attribute extraction costs. With a real-life scenario, we demonstrate that DocDB allows users to analyze unstructured documents accurately and affordably using SQL-like queries. The corresponding video is available at http://youtu.be/8yDIKOBHIOg.

AB - Recent studies have developed LLM-powered data systems that enable database-like analysis of unstructured text documents. While LLMs excel at attribute extraction from documents, their high computational costs and latency make extraction operations the primary performance bottleneck. Existing systems typically adopt traditional relational database query optimization strategies, which prove ineffective in minimizing LLM-related expenses. To fill this gap, we propose DocDB, a prototype system that features a bunch of novel optimization strategies designated to unstructured document analysis. First, we employ a two-level index to reduce LLM extraction costs by selectively retrieving and processing only text segments relevant to target attributes. Second, DocDB employs adaptive execution, generating document-specific plans to minimize LLM extraction frequency based on varying per-document attribute extraction costs. With a real-life scenario, we demonstrate that DocDB allows users to analyze unstructured documents accurately and affordably using SQL-like queries. The corresponding video is available at http://youtu.be/8yDIKOBHIOg.

UR - http://www.scopus.com/pages/publications/105016625642

U2 - 10.14778/3750601.3750678

DO - 10.14778/3750601.3750678

M3 - Conference article

AN - SCOPUS:105016625642

SN - 2150-8097

VL - 18

SP - 5387

EP - 5390

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 12

Y2 - 1 September 2025 through 5 September 2025

ER -

DocDB: A Database for Unstructured Document Analysis

Abstract

Access to Document

Other files and links

Fingerprint

Cite this