NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
Command a: An enterprise-ready large language model
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
CALIBER elicits and supervises pre-reasoning confidence with prompt-level success probability and post-reasoning confidence with answer-level correctness, cutting ECE by 52.5% on BigMathDigits for a 7B model while remaining competitive on accuracy.
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.
Merging any combination of monolingual pre-trained models leads to performance collapse due to interference, indicating that merging flexibility from fine-tuning does not extend to pre-training.
A multi-agent LLM framework with schema enrichment and business rules achieves 78.1% semantic accuracy on the BIRD NL2SQL benchmark.
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
The thesis identifies theoretical, empirical, and conceptual flaws in offline fairness measures for recommender systems and contributes new evaluation methods and practical guidelines.
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.
LuckyStar 111B adapts Cohere's Command A model with four scaling techniques to improve tool-use, math reasoning, and NL2SQL in Korean-English while preserving general instruction following.
citing papers explorer
-
NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
-
CALIBER: Calibrating Confidence Before and After Reasoning in Language Models
CALIBER elicits and supervises pre-reasoning confidence with prompt-level success probability and post-reasoning confidence with answer-level correctness, cutting ECE by 52.5% on BigMathDigits for a 7B model while remaining competitive on accuracy.
-
LLMs Get Lost In Multi-Turn Conversation
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
-
Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.
-
On the Limits of Model Merging for Multilinguality in Pre-Training
Merging any combination of monolingual pre-trained models leads to performance collapse due to interference, indicating that merging flexibility from fine-tuning does not extend to pre-training.
-
AgentNLQ: A General-Purpose Agent for Natural Language to SQL
A multi-agent LLM framework with schema enrichment and business rules achieves 78.1% semantic accuracy on the BIRD NL2SQL benchmark.
-
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
-
Offline Evaluation Measures of Fairness in Recommender Systems
The thesis identifies theoretical, empirical, and conceptual flaws in offline fairness measures for recommender systems and contributes new evaluation methods and practical guidelines.
-
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.
-
Think in English, Answer in Korean: Efficient Adaptation of Multilingual Tool-Using Agents
LuckyStar 111B adapts Cohere's Command A model with four scaling techniques to improve tool-use, math reasoning, and NL2SQL in Korean-English while preserving general instruction following.
- RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
- Reinforcement Learning from Human Feedback