RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Bamboo: A com- prehensive benchmark for evaluating long text modeling capacities of large language models
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
SPEED-Bench is a new standardized benchmark for speculative decoding that supplies semantically diverse qualitative data and throughput-oriented splits across concurrency levels, integrated with vLLM and TensorRT-LLM.
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
citing papers explorer
-
RULER: What's the Real Context Size of Your Long-Context Language Models?
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
-
SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding
SPEED-Bench is a new standardized benchmark for speculative decoding that supplies semantically diverse qualitative data and throughput-oriented splits across concurrency levels, integrated with vLLM and TensorRT-LLM.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.