SQLyzr: A Comprehensive Benchmark and Framework for Evaluating Text-to-SQL Systems
Loading...
Date
Authors
Advisor
Özsu, M. Tamer
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Natural language–to–SQL (text-to-SQL) systems aim to enable users to interact with relational databases using natural language instead of SQL. Recent advances in large language models have significantly improved the performance of these systems, making them increasingly practical for real-world applications. With the rapid pace of progress and the growing adoption of text-to-SQL systems, robust benchmarking has become essential. However, existing benchmarks typically rely on a single correctness metric, lack alignment with real-world query usage patterns, and do not evaluate the scalability of generated queries, which limits their ability to provide realistic and practically meaningful evaluation.
This thesis introduces SQLyzr, a comprehensive text-to-SQL benchmark and evaluation framework designed to address these limitations. SQLyzr incorporates a fine-grained taxonomy of SQL queries and reports evaluation results at the level of query categories and subcategories, enabling detailed insights into system performance across different query types. In addition, SQLyzr extends traditional evaluation by introducing complementary metrics that assess not only the correctness but also the efficiency and structural complexity of generated SQL queries. To better reflect real-world usage, SQLyzr aligns the distribution of query categories with empirical SQL workload distributions and supports dataset scaling to enable evaluation on larger databases.
Building on these ideas, we also introduce a configurable text-to-SQL benchmarking framework that allows users to customize and extend benchmark components such as workloads, datasets, and evaluation metrics. The framework further provides novel features such as detailed error analysis for identifying incorrect queries with minor issues and workload augmentation for synthesizing additional question-SQL pairs that target weaknesses of a given text-to-SQL system.
We use SQLyzr to evaluate two state-of-the-art text-to-SQL systems with similar overall correctness scores. Our results demonstrate that SQLyzr enables clearer comparison between systems and reveals deeper insights into their relative strengths and weaknesses.