Researchers at Stanford Introduce SUQL: A Formal Query Language for Integrating Structured and Unstructured Data

Researchers at Stanford Introduce SUQL: A Formal Query Language for Integrating Structured and Unstructured Data

Large Language Models (LLMs) have gained traction for their exceptional performance in various tasks. Recent research aims to enhance their factuality by integrating external resources, including structured data and free text. However, numerous data sources, such as patient records and financial databases, contain a mix of both types of information.  “Can you find me an Italian restaurant with a romantic atmosphere?”, an agent needs to combine the structured attribute cuisines and the free-text attribute reviews.

Wuz3U14NVsPQs9V E0rQE8ubXsY TW4ZQ lXc sNhuHXfkz5I YuUQS8XP0eb YZCW67osVeKx3xVvDVG7IJA8j00b48WZ uOLuV

Previous chat systems typically employ classifiers to direct queries to specialized modules for handling structured data, unstructured data, or chitchat. However, this method falls short for questions requiring both structured and free-text data. Another approach involves converting structured data into free text, limiting the use of SQL for database queries and the effectiveness of free text retrievers. The necessity for hybrid data queries is underscored by datasets like HybridQA, containing questions necessitating information from both structured and free text sources. Prior endeavours to ground question-answering systems on hybrid data either operate on small datasets, sacrifice the richness of structured data queries or support limited combinations of structured and unstructured knowledge queries.

Stanford researchers introduce an approach to grounding conversational agents in hybrid data sources, utilizing both structured data queries and free-text retrieval techniques. It empirically demonstrates that users frequently ask questions spanning both structured and unstructured data in real-life conversations, with over 49% of queries requiring knowledge from both types. To enhance expressiveness and precision, they propose SUQL (Structured and Unstructured Query Language), a formal language augmenting SQL with primitives for processing free text, enabling a combination of off-the-shelf retrieval models and LLMs with SQL semantics and operators.

The SUQL’s design aims for expressiveness, accuracy, and efficiency. SUQL extends SQL with NLP operators like SUMMARY and ANSWER, facilitating full-spectrum queries on hybrid knowledge sources. LLMs proficiently translate complex text into SQL queries, empowering SUQL for complex queries. While SUQL queries can run on standard SQL compilers, a naive implementation may be inefficient. Outlining SUQL’s free-text primitives, highlighting its distinction from retrieval-based methods by expressing queries comprehensively.

8kImXlNPNK6M6sp9wjzxOI5cvXg NfhnwaOsi2JBOtx7Uc in a Cmyhs0L8FYQ6J CTq Po

Researchers evaluate SUQL through two experiments: one on HybridQA, a question-answering dataset, and another on real restaurant data from Yelp.com. The HybridQA experiment utilizes LLMs and SUQL to achieve 59.3% Exact Match (EM) and 68.3% F1 score. SUQL outperforms existing models by 8.9% EM and 7.1% F1 on the test set. In real-life restaurant experiments, SUQL demonstrates 93.8% and 90.3% turn accuracy in single-turn and conversational queries respectively, surpassing linearization-based methods by up to 36.8% and 26.9%.

PA5yTvZsf 4IaDKYL UCWpPsQf dOuM oQ BHFOOQOPnKT7 HQmERobglVaSuf4QAz0X0HqKh6nuEXi7DBsRkoHUl3nsWjIQJj J

To conclude, this paper introduces SUQL as the inaugural formal query language for hybrid knowledge corpora, encompassing structured and unstructured data. Its innovation lies in integrating free-text primitives into a precise and succinct query framework. In-context learning applied to HybridQA achieves results within 8.9% of the SOTA, trainable on 62K samples. Unlike prior methods, SUQL accommodates large databases and free-text corpora. Experiments on Yelp data demonstrate SUQL’s effectiveness, with a 90.3% success rate in satisfying user queries compared to 63.4% for linearization baselines.


Share this
Send this to a friend