DocDB: A Database for Unstructured Document Analysis

Zequn Li, Yuanhao Zhong, Chengliang Chai*, Zhaoze Sun, Yuhao Deng*, Ye Yuan, Guoren Wang, Lei Cao

*此作品的通讯作者

科研成果: 期刊稿件会议文章同行评审

摘要

Recent studies have developed LLM-powered data systems that enable database-like analysis of unstructured text documents. While LLMs excel at attribute extraction from documents, their high computational costs and latency make extraction operations the primary performance bottleneck. Existing systems typically adopt traditional relational database query optimization strategies, which prove ineffective in minimizing LLM-related expenses. To fill this gap, we propose DocDB, a prototype system that features a bunch of novel optimization strategies designated to unstructured document analysis. First, we employ a two-level index to reduce LLM extraction costs by selectively retrieving and processing only text segments relevant to target attributes. Second, DocDB employs adaptive execution, generating document-specific plans to minimize LLM extraction frequency based on varying per-document attribute extraction costs. With a real-life scenario, we demonstrate that DocDB allows users to analyze unstructured documents accurately and affordably using SQL-like queries. The corresponding video is available at http://youtu.be/8yDIKOBHIOg.

源语言英语
页(从-至)5387-5390
页数4
期刊Proceedings of the VLDB Endowment
18
12
DOI
出版状态已出版 - 2025
活动51st International Conference on Very Large Data Bases, VLDB 2025 - London, 英国
期限: 1 9月 20255 9月 2025

指纹

探究 'DocDB: A Database for Unstructured Document Analysis' 的科研主题。它们共同构成独一无二的指纹。

引用此