super hub Mixed citations

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Ahmed Awadallah, Ammar Ahmad Awan, Hany Awadalla, Jyoti Aneja, Marah Abdin, Nguyen Bach · 2024 · cs.CL · arXiv 2404.14219

Mixed citation behavior. Most common role is background (62%).

204 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 204 citing papers more from Ahmed Awadallah arXiv PDF

abstract

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 25 baseline 9 method 4 dataset 2

citation-polarity summary

background 25 baseline 9 use method 4 use dataset 2

claims ledger

abstract We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide param

authors

Ahmed Awadallah Ammar Ahmad Awan Hany Awadalla Jyoti Aneja Marah Abdin Nguyen Bach

co-cited works

representative citing papers

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

cs.CL · 2026-06-14 · unverdicted · novelty 8.0

EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

cs.LG · 2026-06-01 · unverdicted · novelty 8.0

KV cache quantization silently erodes LLM safety alignment via vulnerable low-dimensional subspaces, diagnosed by Per-Channel Reduction into three failure modes and mitigated training-free with up to 97% recovery.

Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this protocol SafeLoRA fails the full-card pass on Gemma-2-2B-it.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

RULER: What's the Real Context Size of Your Long-Context Language Models?

cs.CL · 2024-04-09 · accept · novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

Disentangling Pictorial Cue Understanding from Language Bias in VLMs via Depth Ordering Task

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

VLMs show chance-level depth ordering performance (47-56%) on controlled images, driven by language bias rather than pictorial cues, with no improvement from CoT or ICL.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

GaussLite: Online Task-Conditioned 3D Gaussian Splatting for Real-Time Robotic Mapping

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

GaussLite conditions 3D Gaussian Splatting seeding density, gradient flow, and scaling on task relevance masks derived from LLM-parsed natural language and open-vocabulary detection, yielding +2.72 dB ROI PSNR gains on Replica and +2.23 dB on real hardware at fixed budget.

Information Dynamics of Language Communication

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

The paper defines STE and SPID, two information-theoretic measures of semantic flow and decomposition in language exchanges, and applies them to four dialogue datasets.

Obliviate: Erasing Concepts from Autoregressive Image Generation Models

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

Obliviate erases targeted concepts from autoregressive image generators via KL supervision on visual tokens over full trajectories, cutting nudity rates sharply on benchmarks while keeping general performance.

Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents

cs.AI · 2026-06-15 · unverdicted · novelty 7.0

Attention analysis shows that LLM tool selection failures occur at the readout/decision stage, not because the model fails to attend to the correct tool definition.

Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

Fine-tuned Mistral-7B via QLoRA achieves up to 12% higher F1 than GPT-4o on biomedical claim verification with 1008 examples, identifies a structural shortcut in SciFact, and shows robust cross-domain transfer from sound data.

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

cs.CL · 2026-06-05 · unverdicted · novelty 7.0

UrduMMLU is a new native-source MCQ benchmark for Urdu that reveals top LLMs reach only ~90% accuracy with large gaps on region-specific humanities content.

Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

ATWU jointly optimizes model parameters and token weights via a linear scorer on hidden states, recovering oracle forget-specific tokens under a separation condition and achieving SOTA forget-retain trade-offs on TOFU and RWKU.

DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

DisasterBench is a new multi-stage multimodal reasoning benchmark for UAV disaster response with 14 scenes and 9 tasks; the accompanying 2B DisasterVL model outperforms open-source MLLMs and approaches GPT-4o efficiency.

Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

Defines representational capacity as the upper bound on distinguishable near-orthogonal directions in transformer latent spaces, derived from embedding similarity distributions and an adjusted Johnson-Lindenstrauss formula dependent on the k/d ratio.

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

cs.DC · 2026-06-01 · unverdicted · novelty 7.0

A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

No tested model showed robust format-independent refusal on biosecurity hazards; a new divergence score between behavioral labels and SAE activations separated responses in one preliminary case.

AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

AsymVLM introduces asymmetric token pruning for vision and text in VLMs to deliver up to 54% FLOPs reduction while matching or exceeding prior methods on localized visual tasks.

citing papers explorer

Showing 50 of 204 citing papers.

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries cs.CL · 2026-06-14 · unverdicted · none · ref 1 · internal anchor
EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.
Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation cs.LG · 2026-06-01 · unverdicted · none · ref 72 · internal anchor
KV cache quantization silently erodes LLM safety alignment via vulnerable low-dimensional subspaces, diagnosed by Per-Channel Reduction into three failure modes and mitigated training-free with up to 97% recovery.
Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims cs.CR · 2026-05-11 · unverdicted · none · ref 24 · internal anchor
Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this protocol SafeLoRA fails the full-card pass on Gemma-2-2B-it.
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks cs.CL · 2026-04-19 · unverdicted · none · ref 84 · internal anchor
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection cs.CL · 2024-10-06 · unverdicted · none · ref 2 · internal anchor
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 1 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark cs.CL · 2024-09-04 · accept · none · ref 1 · internal anchor
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
LiveBench: A Challenging, Contamination-Limited LLM Benchmark cs.CL · 2024-06-27 · unverdicted · none · ref 1 · internal anchor
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
RULER: What's the Real Context Size of Your Long-Context Language Models? cs.CL · 2024-04-09 · accept · none · ref 1 · internal anchor
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Disentangling Pictorial Cue Understanding from Language Bias in VLMs via Depth Ordering Task cs.CV · 2026-07-01 · unverdicted · none · ref 1 · internal anchor
VLMs show chance-level depth ordering performance (47-56%) on controlled images, driven by language bias rather than pictorial cues, with no improvement from CoT or ICL.
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model cs.SD · 2026-06-30 · unverdicted · none · ref 211 · internal anchor
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
GaussLite: Online Task-Conditioned 3D Gaussian Splatting for Real-Time Robotic Mapping cs.CV · 2026-06-29 · unverdicted · none · ref 41 · internal anchor
GaussLite conditions 3D Gaussian Splatting seeding density, gradient flow, and scaling on task relevance masks derived from LLM-parsed natural language and open-vocabulary detection, yielding +2.72 dB ROI PSNR gains on Replica and +2.23 dB on real hardware at fixed budget.
Information Dynamics of Language Communication cs.CL · 2026-06-29 · unverdicted · none · ref 73 · internal anchor
The paper defines STE and SPID, two information-theoretic measures of semantic flow and decomposition in language exchanges, and applies them to four dialogue datasets.
Obliviate: Erasing Concepts from Autoregressive Image Generation Models cs.CV · 2026-06-26 · unverdicted · none · ref 1 · internal anchor
Obliviate erases targeted concepts from autoregressive image generators via KL supervision on visual tokens over full trajectories, cutting nudity rates sharply on benchmarks while keeping general performance.
Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents cs.AI · 2026-06-15 · unverdicted · none · ref 21 · internal anchor
Attention analysis shows that LLM tool selection failures occur at the readout/decision stage, not because the model fails to attend to the correct tool definition.
Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization cs.CL · 2026-06-11 · unverdicted · none · ref 6 · internal anchor
Fine-tuned Mistral-7B via QLoRA achieves up to 12% higher F1 than GPT-4o on biomedical claim verification with 1008 examples, identifies a structural shortcut in SciFact, and shows robust cross-domain transfer from sound data.
UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding cs.CL · 2026-06-05 · unverdicted · none · ref 32 · internal anchor
UrduMMLU is a new native-source MCQ benchmark for Urdu that reveals top LLMs reach only ~90% accuracy with large gaps on region-specific humanities content.
Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance cs.LG · 2026-06-04 · unverdicted · none · ref 39 · internal anchor
ATWU jointly optimizes model parameters and token weights via a linear scorer on hidden states, recovering oracle forget-specific tokens under a separation condition and achieving SOTA forget-retain trade-offs on TOFU and RWKU.
DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments cs.CV · 2026-06-04 · unverdicted · none · ref 59 · internal anchor
DisasterBench is a new multi-stage multimodal reasoning benchmark for UAV disaster response with 14 scenes and 9 tasks; the accompanying 2B DisasterVL model outperforms open-source MLLMs and approaches GPT-4o efficiency.
Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models cs.LG · 2026-06-01 · unverdicted · none · ref 27 · internal anchor
Defines representational capacity as the upper bound on distinguishable near-orthogonal directions in transformer latent spaces, derived from embedding similarity distributions and an adjusted Johnson-Lindenstrauss formula dependent on the k/d ratio.
Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference cs.DC · 2026-06-01 · unverdicted · none · ref 1 · internal anchor
A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.
CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs cs.CL · 2026-06-01 · unverdicted · none · ref 1 · internal anchor
CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.
BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders cs.AI · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
No tested model showed robust format-independent refusal on biosecurity hazards; a new divergence score between behavioral labels and SAE activations separated responses in one preliminary case.
AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference cs.LG · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
AsymVLM introduces asymmetric token pruning for vision and text in VLMs to deliver up to 54% FLOPs reduction while matching or exceeding prior methods on localized visual tasks.
Large Language Model Selection with Limited Annotations cs.CL · 2026-05-24 · unverdicted · none · ref 1 · internal anchor
SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.
IQA-Spider: Unifying Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring cs.CV · 2026-05-23 · unverdicted · none · ref 1 · internal anchor
IQA-Spider unifies reasoning, grounding, and referring for multi-granularity image quality assessment via a four-task paradigm and two-stage LMM training with training-free text-to-point mapping.
Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning cs.CL · 2026-05-22 · unverdicted · none · ref 1 · internal anchor
Representational convergence across 16 LLMs on 800 reasoning problems is stronger for failed tasks and pre-decision stages but shows minimal causal influence on predictions, pointing to shared processing constraints over shared reasoning.
TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization cs.CL · 2026-05-20 · unverdicted · none · ref 49 · internal anchor
TextReg mitigates prompt distributional overfitting via regularized text-space optimization, reporting up to +16.5% OOD accuracy gains over prior methods on reasoning benchmarks.
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment cs.LG · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.
DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 12 · internal anchor
DisaBench supplies a participatory taxonomy of twelve disability harm types, paired benign-adversarial prompts across seven life domains, and human-annotated data showing that standard safety tests miss context-dependent harms.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark cs.CV · 2026-05-12 · unverdicted · none · ref 43 · internal anchor
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents cs.MA · 2026-05-05 · unverdicted · none · ref 1 · internal anchor
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 14 · internal anchor
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks cs.CR · 2026-04-30 · unverdicted · none · ref 1 · internal anchor
MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.
Evaluating Temporal Consistency in Multi-Turn Language Models cs.CL · 2026-04-24 · unverdicted · none · ref 1 · internal anchor
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding eess.AS · 2026-04-24 · unverdicted · none · ref 23 · internal anchor
LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation cs.CL · 2026-04-22 · conditional · none · ref 19 · internal anchor
Clinical narrative format beats raw JSON for LLMs up to 8B parameters on medication reconciliation but raw JSON wins at 70B scale, with omissions as the main error type.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms cs.CL · 2026-04-21 · unverdicted · none · ref 7 · internal anchor
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
DocQAC: Adaptive Trie-Guided Decoding for Effective In-Document Query Auto-Completion cs.IR · 2026-04-20 · conditional · none · ref 1 · internal anchor
Adaptive trie-guided decoding with document context and tunable penalties improves in-document query auto-completion, outperforming baselines and larger models like LLaMA-3 on seen queries.
Towards Unconstrained Human-Object Interaction cs.CV · 2026-04-15 · unverdicted · none · ref 1 · internal anchor
Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation cs.CV · 2026-04-15 · conditional · none · ref 55 · internal anchor
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing cs.CV · 2026-04-10 · unverdicted · none · ref 1 · internal anchor
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses cs.CV · 2026-02-26 · conditional · none · ref 1 · internal anchor
SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.
Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models cs.CL · 2026-02-10 · unverdicted · none · ref 1 · internal anchor
Top-W applies Wasserstein-regularized truncation on token-embedding geometry to create a closed-form optimal crop for LLM sampling that outperforms prior methods by up to 33.7% on GSM8K, GPQA, AlpacaEval, and MT-Bench.
LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding cs.CL · 2026-01-17 · unverdicted · none · ref 48 · internal anchor
LSTM-MAS uses a chained multi-agent architecture modeled on LSTM input, forget, and output gates to improve long-context QA performance and reduce hallucinations compared with prior multi-agent baselines.
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance? cs.CV · 2025-11-27 · unverdicted · none · ref 18 · internal anchor
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation cs.CL · 2025-09-02 · unverdicted · none · ref 1 · internal anchor
Top-H decoding is a computationally efficient greedy algorithm for an entropy-constrained mass maximization problem that improves the creativity-coherence trade-off over min-p sampling in LLM text generation.
PRIMETIME : Limits of LLMs in Temporal Primitives cs.NE · 2025-04-22 · unverdicted · none · ref 99 · internal anchor
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding cs.CV · 2025-04-14 · unverdicted · none · ref 1 · internal anchor
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory cs.CL · 2024-10-14 · unverdicted · none · ref 58 · internal anchor
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer