Derives first lower bound on γ_t for mean-based algorithms in unknown-horizon bandit settings, proposes two new algorithms, and shows some are also no-regret.
hub
Manning, Peter Henderson, and Daniel E
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 13roles
background 2polarities
background 2representative citing papers
LLMs show severe staleness after training cutoffs and recency bias on historical German statutes; RAG with version filtering mitigates both better than web search.
Error verifiability is a distinct dimension of LLM quality separate from accuracy that requires targeted, domain-aware interventions like reflect-and-rephrase and oracle-rephrase to improve.
The paper identifies three pathologies of probabilistic RAG in legal retrieval (mereological blindness, diachronic blindness, causal opacity) and derives four deterministic architectural commitments to address the hierarchical, temporal, and institutional structure of legal knowledge.
HARVE removes the component of the reward-head vector aligned with a multi-directional hacking subspace from residual streams using a small set of contrastive examples, improving robustness on RewardHackBench across eight models without fine-tuning while preserving general capability.
The Pareto frontier of fair algorithmic decisions consists of deterministic group-specific threshold rules on predicted success probabilities, which can include upper bounds for some fairness metrics and holds independently of model training approach.
KAHM yields a compute-efficient query encoder that outperforms matched learned adapters in reconstructing a frozen Mixedbread embedding space on an Austrian-law retrieval task while delivering an 8.53x CPU speedup.
A survey that categorizes RIR benchmarks by domain and modality, proposes a taxonomy for integrating reasoning into retrieval pipelines, and outlines key challenges.
Introduces BMC, a manifold bandit framework that organizes problems into a hierarchical task tree and applies Bayesian learning to balance productivity, diversity, and utility in LLM curriculum sampling.
Reasoning-oriented LLMs reach up to 0.91 quadratic weighted kappa agreement with experts on public law cases when given sample solutions and grading rubrics, but only 0.60 on criminal law cases.
Automatic prompt optimization using lenient LLM judges improves performance and transferability in legal QA evaluations compared to human design or strict judges.
NJ BriefBank is a domain-adapted legal retrieval tool for public defenders that improves on standard benchmarks by incorporating legal reasoning, domain data, and synthetic examples, with a new released taxonomy and annotated evaluation dataset.
Further pre-training ModernBERT on US court opinions improves results on legal datasets compared to the base model, with gains similar to early BERT domain adaptation work.
citing papers explorer
-
Mean-based algorithms: A lower bound and regret
Derives first lower bound on γ_t for mean-based algorithms in unknown-horizon bandit settings, proposes two new algorithms, and shows some are also no-regret.
-
Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering
LLMs show severe staleness after training cutoffs and recency bias on historical German statutes; RAG with version filtering mitigates both better than web search.
-
Justified or Just Convincing? Error Verifiability as a Dimension of LLM Quality
Error verifiability is a distinct dimension of LLM quality separate from accuracy that requires targeted, domain-aware interventions like reflect-and-rephrase and oracle-rephrase to improve.
-
Beyond Probabilistic Similarity: Structural, Temporal, and Causal Limitations of Retrieval-Augmented Generation in the Legal Domain
The paper identifies three pathologies of probabilistic RAG in legal retrieval (mereological blindness, diachronic blindness, causal opacity) and derives four deterministic architectural commitments to address the hierarchical, temporal, and institutional structure of legal knowledge.
-
HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models
HARVE removes the component of the reward-head vector aligned with a multi-directional hacking subspace from residual streams using a small set of contrastive examples, improving robustness on RewardHackBench across eight models without fine-tuning while preserving general capability.
-
Fairness vs Performance: Characterizing the Pareto Frontier of Algorithmic Decision Systems
The Pareto frontier of fair algorithmic decisions consists of deterministic group-specific threshold rules on predicted success probabilities, which can include upper bounds for some fairness metrics and holds independently of model training approach.
-
Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces
KAHM yields a compute-efficient query encoder that outperforms matched learned adapters in reconstructing a frozen Mixedbread embedding space on an Austrian-law retrieval task while delivering an 8.53x CPU speedup.
-
A Survey of Reasoning-Intensive Retrieval: Progress and Challenges
A survey that categorizes RIR benchmarks by domain and modality, proposes a taxonomy for integrating reasoning into retrieval pipelines, and outlines key challenges.
-
Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models
Introduces BMC, a manifold bandit framework that organizes problems into a hierarchical task tree and applies Bayesian learning to balance productivity, diversity, and utility in LLM curriculum sampling.
-
GradeLegal: Automated Grading for German Legal Cases
Reasoning-oriented LLMs reach up to 0.91 quadratic weighted kappa agreement with experts on public law cases when given sample solutions and grading rubrics, but only 0.60 on criminal law cases.
-
Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization
Automatic prompt optimization using lenient LLM judges improves performance and transferability in legal QA evaluations compared to human design or strict judges.
-
Legal Domain Adaptation of Modern BERT Models
Further pre-training ModernBERT on US court opinions improves results on legal datasets compared to the base model, with gains similar to early BERT domain adaptation work.