SQLyzr: A Comprehensive Benchmark and Framework for Evaluating Text-to-SQL Systems

Abedini, Sepideh

SQLyzr: A Comprehensive Benchmark and Framework for Evaluating Text-to-SQL Systems

Files

Abedini_Sepideh.pdf (6.33 MB)

Date

2026-04-23

Authors

Abedini, Sepideh

Publisher

University of Waterloo

Abstract

Natural language–to–SQL (text-to-SQL) systems aim to enable users to interact with relational databases using natural language instead of SQL. Recent advances in large language models have significantly improved the performance of these systems, making them increasingly practical for real-world applications. With the rapid pace of progress and the growing adoption of text-to-SQL systems, robust benchmarking has become essential. However, existing benchmarks typically rely on a single correctness metric, lack alignment with real-world query usage patterns, and do not evaluate the scalability of generated queries, which limits their ability to provide realistic and practically meaningful evaluation. This thesis introduces SQLyzr, a comprehensive text-to-SQL benchmark and evaluation framework designed to address these limitations. SQLyzr incorporates a fine-grained taxonomy of SQL queries and reports evaluation results at the level of query categories and subcategories, enabling detailed insights into system performance across different query types. In addition, SQLyzr extends traditional evaluation by introducing complementary metrics that assess not only the correctness but also the efficiency and structural complexity of generated SQL queries. To better reflect real-world usage, SQLyzr aligns the distribution of query categories with empirical SQL workload distributions and supports dataset scaling to enable evaluation on larger databases. Building on these ideas, we also introduce a configurable text-to-SQL benchmarking framework that allows users to customize and extend benchmark components such as workloads, datasets, and evaluation metrics. The framework further provides novel features such as detailed error analysis for identifying incorrect queries with minor issues and workload augmentation for synthesizing additional question-SQL pairs that target weaknesses of a given text-to-SQL system. We use SQLyzr to evaluate two state-of-the-art text-to-SQL systems with similar overall correctness scores. Our results demonstrate that SQLyzr enables clearer comparison between systems and reveals deeper insights into their relative strengths and weaknesses.

URI

https://hdl.handle.net/10012/23045

Collections

Theses
Computer Science

Full item page

SQLyzr: A Comprehensive Benchmark and Framework for Evaluating Text-to-SQL Systems

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By