Command a: An enterprise-ready large language model

Team Cohere, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, 15 Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, et al · 2025 · arXiv 2504.00698

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 3 dataset 1

citation-polarity summary

background 3 use dataset 1

representative citing papers

NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions

cs.DB · 2026-04-13 · conditional · novelty 7.0

NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.

CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

cs.CL · 2026-06-23 · unverdicted · novelty 6.0

CALIBER elicits and supervises pre-reasoning confidence with prompt-level success probability and post-reasoning confidence with answer-level correctness, cutting ECE by 52.5% on BigMathDigits for a 7B model while remaining competitive on accuracy.

LLMs Get Lost In Multi-Turn Conversation

cs.CL · 2025-05-09 · unverdicted · novelty 6.0

LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

cs.AI · 2026-06-05 · unverdicted · novelty 5.0

LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.

On the Limits of Model Merging for Multilinguality in Pre-Training

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

Merging any combination of monolingual pre-trained models leads to performance collapse due to interference, indicating that merging flexibility from fine-tuning does not extend to pre-training.

AgentNLQ: A General-Purpose Agent for Natural Language to SQL

cs.AI · 2026-05-18 · unverdicted · novelty 5.0

A multi-agent LLM framework with schema enrichment and business rules achieves 78.1% semantic accuracy on the BIRD NL2SQL benchmark.

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

cs.AI · 2026-04-13 · unverdicted · novelty 5.0

Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.

Offline Evaluation Measures of Fairness in Recommender Systems

cs.IR · 2026-04-27 · unverdicted · novelty 4.0

The thesis identifies theoretical, empirical, and conceptual flaws in offline fairness measures for recommender systems and contributes new evaluation methods and practical guidelines.

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

cs.CL · 2025-10-06 · unverdicted · novelty 4.0

This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.

Think in English, Answer in Korean: Efficient Adaptation of Multilingual Tool-Using Agents

cs.AI · 2026-06-30 · unverdicted · novelty 3.0

LuckyStar 111B adapts Cohere's Command A model with four scaling techniques to improve tool-use, math reasoning, and NL2SQL in Korean-English while preserving general instruction following.

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

cs.LG · 2026-02-20 · 2 refs

Reinforcement Learning from Human Feedback

cs.LG · 2025-04-16

citing papers explorer

Showing 12 of 12 citing papers.

NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions cs.DB · 2026-04-13 · conditional · none · ref 7
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
CALIBER: Calibrating Confidence Before and After Reasoning in Language Models cs.CL · 2026-06-23 · unverdicted · none · ref 4
CALIBER elicits and supervises pre-reasoning confidence with prompt-level success probability and post-reasoning confidence with answer-level correctness, cutting ECE by 52.5% on BigMathDigits for a 7B model while remaining competitive on accuracy.
LLMs Get Lost In Multi-Turn Conversation cs.CL · 2025-05-09 · unverdicted · none · ref 16
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators cs.AI · 2026-06-05 · unverdicted · none · ref 27
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.
On the Limits of Model Merging for Multilinguality in Pre-Training cs.CL · 2026-05-25 · unverdicted · none · ref 12
Merging any combination of monolingual pre-trained models leads to performance collapse due to interference, indicating that merging flexibility from fine-tuning does not extend to pre-training.
AgentNLQ: A General-Purpose Agent for Natural Language to SQL cs.AI · 2026-05-18 · unverdicted · none · ref 1
A multi-agent LLM framework with schema enrichment and business rules achieves 78.1% semantic accuracy on the BIRD NL2SQL benchmark.
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model cs.AI · 2026-04-13 · unverdicted · none · ref 19
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
Offline Evaluation Measures of Fairness in Recommender Systems cs.IR · 2026-04-27 · unverdicted · none · ref 47
The thesis identifies theoretical, empirical, and conceptual flaws in offline fairness measures for recommender systems and contributes new evaluation methods and practical guidelines.
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights cs.CL · 2025-10-06 · unverdicted · none · ref 9
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.
Think in English, Answer in Korean: Efficient Adaptation of Multilingual Tool-Using Agents cs.AI · 2026-06-30 · unverdicted · none · ref 1
LuckyStar 111B adapts Cohere's Command A model with four scaling techniques to improve tool-use, math reasoning, and NL2SQL in Korean-English while preserving general instruction following.
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference cs.LG · 2026-02-20 · unreviewed · ref 5 · 2 links
Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unreviewed · ref 57

Command a: An enterprise-ready large language model

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer