A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going?

Understanding the Evolution of Text-to-SQL with Large Language Models

The transformation of natural language (NL) queries into executable SQL statements—commonly known as Text-to-SQL—has become a cornerstone in democratizing access to relational databases. With the rise of Large Language Models (LLMs), this field has seen unprecedented advancements, enabling non-experts to interact with complex data systems using everyday language. This comprehensive survey explores the current state and future trajectory of Text-to-SQL technologies, focusing on model design, data synthesis, evaluation frameworks, error analysis, and real-world deployment challenges.

Modern Text-to-SQL systems no longer operate as monolithic end-to-end models. Instead, they follow a modular architecture composed of pre-processing, translation, and post-processing stages. This shift reflects the increasing complexity of real-world database environments and the need for fine-grained control over query generation.

👉 Discover how AI is revolutionizing data interaction through intelligent query generation.

Core Components of LLM-Powered Text-to-SQL Systems

Pre-Processing: Enhancing Input Understanding

Before any SQL generation occurs, effective pre-processing modules prepare the input for accurate translation.

Schema Linking

Schema linking identifies relevant database tables and columns based on the natural language query. In the LLM era, three primary strategies dominate:

String Matching: Uses lexical similarity between NL and schema elements.
Neural Network-Based: Leverages deep learning to capture semantic relationships.
In-Context Learning (ICL): Employs LLMs like GPT-4 to dynamically link schema components using few-shot reasoning.

While ICL shows strong performance, it faces limitations with large schemas due to context length constraints.

Database Content Retrieval

This module extracts specific cell values referenced in the NL query (e.g., “orders placed on May 1st”). Approaches include:

String Matching: Fast but struggles with synonyms.
Neural Methods: Use attention mechanisms or knowledge graphs for better semantic alignment.
Indexing Strategies: Employ BM25 or Locality-Sensitive Hashing for efficient retrieval in large databases.

Efficiency remains a key challenge, especially when dealing with dirty or voluminous data.

Additional Information Acquisition

To improve accuracy, models incorporate domain-specific knowledge such as date conventions ("Labor Day" = May 1 in China), unit conversions, or business rules. Two main approaches exist:

Sample-Based: Injects demonstration examples into prompts.
Retrieval-Based: Dynamically fetches relevant context from external knowledge bases.

Despite their effectiveness, these methods increase token usage and computational cost.

Translation Phase: From Natural Language to SQL

The translation stage is where the actual NL-to-SQL conversion happens. It consists of several interrelated components.

Encoding Strategies

Encoding transforms unstructured text and structured schema into a format suitable for model processing.

Sequential Encoding: Treats NL and schema as a flat token sequence (used in T5-based models).
Graph-Based Encoding: Preserves relational structure using graph neural networks (e.g., RAT-SQL).
Separate Encoding: Processes NL and schema independently before fusion (e.g., SC-Prompt).

Graph-based methods excel in complex joins but require more training data.

Decoding Strategies

Decoding determines how SQL tokens are generated step by step.

Greedy Search: Fast but prone to early errors.
Beam Search: Explores multiple paths for higher accuracy at increased computational cost.
Constraint-Aware Incremental Decoding: Enforces SQL grammar during generation (e.g., PICARD), ensuring syntactic validity.

The latter is particularly valuable for generating deeply nested queries.

Task-Specific Prompting Techniques

With LLMs, prompt engineering plays a crucial role:

Chain-of-Thought (CoT): Encourages step-by-step reasoning, improving interpretability.
Decomposition: Breaks down queries into subtasks (e.g., identify SELECT first, then WHERE).

These strategies enhance both accuracy and transparency but can increase latency.

Intermediate Representations

To bridge the gap between free-form NL and rigid SQL syntax, researchers use intermediate representations (IR):

SQL-like Syntax Language (e.g., NatSQL): Simplifies SQL grammar while preserving structure.
Sketch Structures: Defines templates with placeholders for entities.

IRs reduce ambiguity and allow models to focus on one aspect at a time.

Post-Processing: Refining the Output

Even advanced models produce imperfect SQL. Post-processing techniques help correct and validate outputs.

SQL Correction

Self-correction modules identify and fix syntax errors. For example, DIN-SQL uses zero-shot prompts to repair faulty queries.

Output Consistency

Self-consistency sampling generates multiple reasoning paths and selects the most frequent valid output, reducing randomness.

Execution-Guided Refinement

By executing candidate queries and analyzing results (e.g., NULL returns), systems can iteratively refine their output. CHESS and CodeS use this feedback loop effectively.

N-Best Reranking

Top-k candidates are re-ranked using a secondary model or execution results, improving final selection accuracy.

👉 Explore how next-generation AI tools are making database queries more intuitive and accurate.

Evaluating Text-to-SQL Performance: Beyond Accuracy Metrics

Accurate evaluation is essential for guiding development and deployment decisions.

Key Evaluation Metrics

Execution Accuracy (EX): Measures whether the predicted SQL returns the same result set as the ground truth.
Exact Match (EM): Requires full string-level match between predicted and reference SQL.
Component Match (CM): Evaluates correctness per SQL clause (SELECT, WHERE, etc.).
Valid Efficiency Score (VES): Introduced in BIRD, this metric also assesses query performance efficiency.
Query Variance Testing (QVT): Tests robustness against paraphrased inputs.

Comprehensive Evaluation Frameworks

NL2SQL360: Enables fine-grained analysis across different query types, domains, and complexity levels.
MT-TEQL: Uses metamorphic testing to automatically generate input variations for stress-testing models.

These frameworks move beyond static benchmarks, offering scenario-based insights.

Error Analysis: Diagnosing Failures to Improve Models

Understanding why models fail is critical for improvement.

A Two-Level Error Taxonomy

We propose a structured approach:

Error Localization: Identifies which SQL component contains the error (e.g., incorrect JOIN condition).
Cause of Error: Determines the root cause:
- Schema linking failure
- Misinterpreted database content
- Missing domain knowledge
- Logical reasoning gap
- Syntax violation

This taxonomy helps developers pinpoint weaknesses and target improvements systematically.

Practical Guidance for Building Text-to-SQL Solutions

Roadmap for Optimizing LLMs in Text-to-SQL

Your optimization strategy should depend on two key factors:

Data Privacy

Use open-source LLMs when handling sensitive data to maintain full control.
Closed-source APIs may expose data to third parties.

Data Volume

Large labeled datasets: Pre-train or fine-tune open-source models.
Medium datasets: Fine-tuning works well.
Small or no labeled data: Rely on few-shot or zero-shot prompting with closed-source LLMs.

Hardware availability and API budget also influence choices.

Decision Flow for Module Selection

Choose components based on your use case:

Scenario	Recommended Module	Benefit	Trade-off
Complex schema	Schema linking	Reduces noise and token cost	Increases latency
Dirty or large DB	Index-based content retrieval	Improves speed	Requires index maintenance
Ambiguous queries	Chain-of-thought prompting	Enhances reasoning	Higher token cost
High accuracy needed	Execution-guided refinement	Filters invalid queries	Slower response time

Balancing performance, cost, and reliability is key to successful deployment.

Open Challenges and Future Directions

Despite rapid progress, significant hurdles remain:

Open-Domain Text-to-SQL

Current systems assume a single known database. Real-world applications often require querying multiple databases across domains. Challenges include:

Cross-database schema matching
Federated query planning
Result aggregation and conflict resolution

Cost-Efficient Solutions

LLMs consume high tokens during inference. Hybrid approaches combining lightweight PLMs with selective LLM calls show promise for reducing costs without sacrificing quality.

Trustworthiness and Debuggability

Users need confidence in generated SQL. Future systems must offer:

Explainable reasoning traces
Interactive debugging tools
Semantic error detection

These features will be essential for enterprise adoption.

Frequently Asked Questions

What is Text-to-SQL?

Text-to-SQL is the process of converting natural language questions into executable SQL queries. It allows users without technical expertise to retrieve data from relational databases using plain English (or other languages).

Why are Large Language Models important for Text-to-SQL?

LLMs bring emergent reasoning capabilities that enable them to understand complex queries, handle ambiguity, and generate syntactically correct SQL—even with minimal training data—through in-context learning and chain-of-thought prompting.

How do you evaluate a Text-to-SQL system?

Evaluation goes beyond simple accuracy. Use execution accuracy to check result correctness, exact match for structural fidelity, VES for efficiency, and QVT for robustness across paraphrased inputs. Tools like NL2SQL360 provide comprehensive multi-angle assessments.

What causes most errors in Text-to-SQL models?

Common error sources include:

Incorrect schema linking (wrong table/column selection)
Misinterpretation of database values (e.g., date formats)
Poor handling of aggregations or nested queries
Failure to apply domain-specific knowledge

Can Text-to-SQL work without training data?

Yes—through zero-shot prompting with powerful LLMs like GPT-4. However, performance improves significantly with even small amounts of fine-tuning data or well-designed few-shot examples.

What are the limitations of current Text-to-SQL systems?

Key limitations include:

High inference costs due to token consumption
Lack of cross-database query support
Limited interpretability and debugging capabilities
Sensitivity to schema complexity and naming inconsistencies

👉 See how cutting-edge AI platforms are addressing these challenges with smarter query generation.

Conclusion

Text-to-SQL has evolved from rule-based parsers to sophisticated LLM-powered systems capable of handling complex, real-world queries. While significant progress has been made, challenges around scalability, efficiency, trustworthiness, and open-domain applicability remain active areas of research. By adopting modular architectures, leveraging hybrid PLM/LLM strategies, and implementing robust evaluation and error analysis practices, developers can build more reliable and accessible data interfaces for the future.