TY - JOUR
T1 - DocDB
T2 - 51st International Conference on Very Large Data Bases, VLDB 2025
AU - Li, Zequn
AU - Zhong, Yuanhao
AU - Chai, Chengliang
AU - Sun, Zhaoze
AU - Deng, Yuhao
AU - Yuan, Ye
AU - Wang, Guoren
AU - Cao, Lei
N1 - Publisher Copyright:
© 2025, VLDB Endowment. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Recent studies have developed LLM-powered data systems that enable database-like analysis of unstructured text documents. While LLMs excel at attribute extraction from documents, their high computational costs and latency make extraction operations the primary performance bottleneck. Existing systems typically adopt traditional relational database query optimization strategies, which prove ineffective in minimizing LLM-related expenses. To fill this gap, we propose DocDB, a prototype system that features a bunch of novel optimization strategies designated to unstructured document analysis. First, we employ a two-level index to reduce LLM extraction costs by selectively retrieving and processing only text segments relevant to target attributes. Second, DocDB employs adaptive execution, generating document-specific plans to minimize LLM extraction frequency based on varying per-document attribute extraction costs. With a real-life scenario, we demonstrate that DocDB allows users to analyze unstructured documents accurately and affordably using SQL-like queries. The corresponding video is available at http://youtu.be/8yDIKOBHIOg.
AB - Recent studies have developed LLM-powered data systems that enable database-like analysis of unstructured text documents. While LLMs excel at attribute extraction from documents, their high computational costs and latency make extraction operations the primary performance bottleneck. Existing systems typically adopt traditional relational database query optimization strategies, which prove ineffective in minimizing LLM-related expenses. To fill this gap, we propose DocDB, a prototype system that features a bunch of novel optimization strategies designated to unstructured document analysis. First, we employ a two-level index to reduce LLM extraction costs by selectively retrieving and processing only text segments relevant to target attributes. Second, DocDB employs adaptive execution, generating document-specific plans to minimize LLM extraction frequency based on varying per-document attribute extraction costs. With a real-life scenario, we demonstrate that DocDB allows users to analyze unstructured documents accurately and affordably using SQL-like queries. The corresponding video is available at http://youtu.be/8yDIKOBHIOg.
UR - http://www.scopus.com/pages/publications/105016625642
U2 - 10.14778/3750601.3750678
DO - 10.14778/3750601.3750678
M3 - Conference article
AN - SCOPUS:105016625642
SN - 2150-8097
VL - 18
SP - 5387
EP - 5390
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 12
Y2 - 1 September 2025 through 5 September 2025
ER -