Cornetto is the first benchmark that synthesizes 231 network misconfiguration problems across topologies of 20-754 nodes and uses formal verification to show that nine state-of-the-art LLMs often introduce regressions and degrade at scale.
hub
Chain-of-thought prompting elicits reasoning in large language models
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Latent chain-of-thought via recurrent feedback tokens from compressed hidden states improves transformer performance on time-series forecasting and tabular prediction across 36 datasets.
StepFly automates TSG execution via TSG Mentor, LLM-based DAG extraction with QPPs, and a DAG-guided parallel scheduler, reaching 94% success on GPT-4.1 with 32.9-70.4% time savings on parallelizable guides.
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
BEDTime benchmark tests 17 models on describing time series structure and finds vision-language models outperform dedicated time-series-language models and language-only approaches, with all models fragile to robustness tests.
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
A small language model fine-tuned on tool-augmented chain-of-thought data generated by a larger LLM learns to selectively call tools, delivering better content moderation accuracy at lower inference cost.
Proposes token-significance and dynamic length rewards in RL to reduce LLM response length while preserving or improving reasoning correctness across benchmarks.
TableMaster improves LM table understanding by verbalizing tables with enriched semantics and using adaptive textual-symbolic reasoning, reaching 78.13% accuracy on WikiTQ with GPT-4o-mini.
A survey deriving a unified policy gradient framework for LLM post-training methods and providing technical comparisons of PPO, GRPO, DPO variants.
The paper introduces a collaborative multi-agent framework for LLMs and applies it conceptually to existing models like Auto-GPT, BabyAGI, and Gorilla through case studies in domains such as courtroom simulations and software development.
citing papers explorer
-
Benchmarking LLM-Driven Network Configuration Repair
Cornetto is the first benchmark that synthesizes 231 network misconfiguration problems across topologies of 20-754 nodes and uses formal verification to show that nine state-of-the-art LLMs often introduce regressions and degrade at scale.
-
Latent Chain-of-Thought Improves Structured-Data Transformers
Latent chain-of-thought via recurrent feedback tokens from compressed hidden states improves transformer performance on time-series forecasting and tabular prediction across 36 datasets.
-
StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis
StepFly automates TSG execution via TSG Mentor, LLM-based DAG extraction with QPPs, and a DAG-guided parallel scheduler, reaching 94% success on GPT-4.1 with 32.9-70.4% time savings on parallelizable guides.
-
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
-
BEDTime: A Unified Benchmark for Automatically Describing Time Series
BEDTime benchmark tests 17 models on describing time series structure and finds vision-language models outperform dedicated time-series-language models and language-only approaches, with all models fragile to robustness tests.
-
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
-
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
-
Llemma: An Open Language Model For Mathematics
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
-
Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation
A small language model fine-tuned on tool-augmented chain-of-thought data generated by a larger LLM learns to selectively call tools, delivering better content moderation accuracy at lower inference cost.
-
Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning
Proposes token-significance and dynamic length rewards in RL to reduce LLM response length while preserving or improving reasoning correctness across benchmarks.
-
TableMaster: A Recipe to Advance Table Understanding with Language Models
TableMaster improves LM table understanding by verbalizing tables with enriched semantics and using adaptive textual-symbolic reasoning, reaching 78.13% accuracy on WikiTQ with GPT-4o-mini.
-
Reinforcement Learning for LLM Post-Training: A Survey
A survey deriving a unified policy gradient framework for LLM post-training methods and providing technical comparisons of PPO, GRPO, DPO variants.
-
Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents
The paper introduces a collaborative multi-agent framework for LLMs and applies it conceptually to existing models like Auto-GPT, BabyAGI, and Gorilla through case studies in domains such as courtroom simulations and software development.