pith. sign in

arxiv: 2606.19706 · v1 · pith:D6RYSXRUnew · submitted 2026-06-18 · 💻 cs.CV · cs.CL

NEST: Narrative Event Structures in Time for Long Video Understanding

Pith reviewed 2026-06-26 17:55 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords long video understandingnarrative eventsevent detectionmovie datasetevent relation extractionmultimodal annotationsvideo benchmarkstemporal relations
0
0 comments X

The pith

The NEST dataset of 1005 movies shows vision-language models fail at discovering narrative events but handle their relations better once events are supplied.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark to measure whether models can track how low-level actions become events and how those events link into coherent stories across full-length videos with gaps and flashbacks. It supplies 1005 movies each with 102 events annotated from visual content, dialogue, and audio, connected by temporal order, hierarchy, and long-range links. Models score below 8 percent on event trigger detection, under 6 percent on localization, and below 11 percent on argument extraction, yet reach 35 percent F1 zero-shot on relation extraction. A reader would care because narrative understanding requires connecting distant plot points rather than retrieving isolated moments.

Core claim

The authors claim that the capacity to ingest long token streams does not produce narrative understanding, and they demonstrate this by releasing NEST together with baselines showing that grounded event discovery stays very hard while relation extraction becomes tractable once events are given.

What carries the argument

The NEST dataset of 1005 full-length movies each annotated with 102 multimodal narrative events and the relations among them.

If this is right

  • Grounded discovery of events from video, dialogue, and audio forms the main remaining obstacle for long-video narrative tasks.
  • Relation extraction improves markedly once events are supplied, reaching 44 percent F1 after fine-tuning.
  • Standard long-video benchmarks centered on retrieval miss the harder problem of linking events across time.
  • Multimodal grounding is required because many events depend on dialogue or audio cues in addition to visuals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures may need explicit mechanisms for chaining events over hour-long spans rather than relying on uniform attention over tokens.
  • The same annotation style could be applied to long documents or podcasts to test whether narrative deficits appear across modalities.
  • Pre-training losses that reward event boundary prediction might close part of the performance gap observed here.
  • Flashback and reframing scenes in the data offer a natural test for whether models can revise earlier event interpretations.

Load-bearing premise

The human annotations of the 102 events per movie and their narrative relations are accurate, consistent, and sufficient to measure genuine narrative understanding.

What would settle it

A model that reaches above 30 percent on event localization or argument extraction on the NEST test set would show that the discovery tasks are more solvable by current methods than the reported baselines indicate.

Figures

Figures reproduced from arXiv: 2606.19706 by Ali Asgarov, Anushka Sivakumar, Chia-Wei Tang, Chris Thomas, Hani Alomari, Kaushik Narasimhan, Najibul Haque Sarker, Shaurya Mallampati, Zaber Ibn Abdul Hakim.

Figure 1
Figure 1. Figure 1: NEST evaluates four narrative event tasks on full-length movies: Event Trigger Detection (ETD), Event [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Genre distribution across the 1,005 movies in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Full-length movies with plots/scripts and audio descriptions (AD) are segmented into scenes at contextual [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Event Argument Extraction example from American Psycho (102 min). Given the ”attack” event trigger and four semantic roles (ARG0: attacker, ARG1: entity attacked, ARGM-INS: instrument, ARGM-MNR: manner), models extract argument values. Both Qwen2.5-VL variants (7B and 32B) correctly identify all roles, with the 7B model providing richer visual descriptions. Qwen3-VL (8B) hallucinates a celebrity name (”Chr… view at source ↗
Figure 5
Figure 5. Figure 5: Average number of annotated events per movie in the NEST dataset, broken down by visual events, audio dialogue events, and audio sound events. Bars show per-movie averages, with total counts re￾ported for each modality. the boundaries of narrative events are inherently subjective. An event such as “breakup” may ar￾guably begin as tension escalates, include several decisive moments, and resolve as character… view at source ↗
Figure 6
Figure 6. Figure 6: Annotation agreement on NEST. Weighted Cohen’s κ and mean semantic similarity for GOLD–GOLD (two annotators) and GOLD–SILVER. a correction layer on the SILVER data. Agreement Analysis. Although the GOLD sub￾set consists of five movies, this corresponds to ap￾proximately 350 annotated events and 250 anno￾tated relations. We measured weighted Cohen’s κ ≈ 0.50 for GOLD–SILVER agreement, com￾pared to inter-ann… view at source ↗
Figure 7
Figure 7. Figure 7: Temporal distance between related events [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Task-specific training loss curves for Qwen3- [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Event Localization example from Caddo Lake (97 min). The “escape” event occurs between approximately 19–40 minutes (green span). Given the full movie and the event description with structured arguments (left panel), seven models predict temporal boundaries. Only Qwen3-VL (30B) and LongVU-LLaMA3 correctly localize the event, while others predict timestamps far from the ground truth. Common failure modes inc… view at source ↗
Figure 10
Figure 10. Figure 10: Event Relation Extraction example from Caddo Lake (97 min). Two events are separated by approximately 50 minutes: a “help” event (E1, near 19 min) and a “search” event (E2, near 71 min). The ground-truth relation is PRECONDITIONED, as the earlier helping event creates conditions that enable the later search. Only Qwen2.5-VL (32B) correctly identifies this relation. Three models (Qwen3-VL, LongVU-LLaMA3, L… view at source ↗
Figure 11
Figure 11. Figure 11: Event Relation Extraction example from Gangs of Lagos (120 min). A “fight” event (E1, near 60 min) is followed by a “mourn” event (E2, near 80 min), with the ground-truth relation being CAUSAL. Only Qwen2.5- VL (32B) and LongVU-LLaMA3 correctly identify the causal link. Qwen2.5-VL (7B) predicts TEMPORAL, recognizing the sequential ordering but missing the causal dependency. The remaining four models predi… view at source ↗
Figure 12
Figure 12. Figure 12: Event Trigger Detection example from Caddo Lake (97 min). Given the scene between 61–100 seconds, models must identify the narrative event trigger. Only Qwen3-VL (8B) correctly predicts “search” with accurate context describing the narrative situation. Qwen2.5-VL (32B) identifies the correct trigger (“search”) but provides a generic context that misses the specific narrative details. Qwen2.5-VL (7B) predi… view at source ↗
Figure 13
Figure 13. Figure 13: Event Trigger Detection example from Already Tomorrow in Hong Kong (67 min). Given the scene between 1170–1210 seconds, all four models fail to identify the correct narrative event. Three out of four models (Qwen2.5-VL 7B, Qwen3-VL 30B, and Qwen3-VL 8B) predict atomic verbs (“walk” or “leave”), describing surface-level physical motions rather than the narrative-level event taking place. Qwen2.5-VL (32B) p… view at source ↗
read the original abstract

Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the NEST dataset of 1005 full-length movies (avg. 98 min) each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio, linked by temporal, hierarchical, and long-range relations. It defines four tasks (event trigger detection ETD, event localization EL, event argument extraction EAE, event relation extraction ERE) and reports baseline results claiming the first three tasks are highly challenging (ETD <8%, EL <6%, EAE <11%) while ERE reaches 35.45% F1 zero-shot and 44.42% F1 after fine-tuning, arguing that current vision-language models fail to capture narrative structure beyond retrieval.

Significance. If the annotations are shown to be reliable and the baselines are reproducible, the work could be significant by shifting long-video evaluation from needle-in-haystack retrieval to structured narrative understanding, supplying a large-scale resource with explicit multimodal grounding and relational annotations that current benchmarks lack.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'the benchmark is highly challenging for grounded event discovery' with ETD below 8%, EL under 6%, and EAE below 11% rests on the 102 events per movie and their narrative relations constituting accurate, consistent ground truth. The abstract supplies no information on annotation protocol, number of annotators, inter-annotator agreement, adjudication, or external validation, so low baseline scores cannot be unambiguously attributed to model shortcomings rather than label noise.
  2. [Abstract] Abstract: baseline performance figures are stated without any description of model implementations, data splits, training details, or statistical measures (error bars, significance tests). This prevents assessment of whether the reported numbers support the difficulty claim.
minor comments (1)
  1. [Abstract] The abstract repeats the phrase 'structured annotations grounded in visual content, dialogue, and audio' without clarifying how multimodal grounding is operationalized or verified during annotation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that it should more explicitly support the claims regarding annotation quality and baseline reproducibility. We will revise the abstract to address both points by incorporating concise references to the relevant sections of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'the benchmark is highly challenging for grounded event discovery' with ETD below 8%, EL under 6%, and EAE below 11% rests on the 102 events per movie and their narrative relations constituting accurate, consistent ground truth. The abstract supplies no information on annotation protocol, number of annotators, inter-annotator agreement, adjudication, or external validation, so low baseline scores cannot be unambiguously attributed to model shortcomings rather than label noise.

    Authors: We agree that the abstract would be strengthened by briefly indicating annotation reliability. The manuscript (Section 3) details the annotation protocol, annotator count, agreement metrics, and adjudication process. We will revise the abstract to include a short summary of these elements so that the ground-truth quality is more clearly established. revision: yes

  2. Referee: [Abstract] Abstract: baseline performance figures are stated without any description of model implementations, data splits, training details, or statistical measures (error bars, significance tests). This prevents assessment of whether the reported numbers support the difficulty claim.

    Authors: We agree that the abstract should point readers to the supporting experimental details. The manuscript (Section 4) specifies the model implementations, data splits, training procedures, and reports statistical measures including error bars. We will revise the abstract to reference these details or add a brief clause summarizing them. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with no derivations or self-referential predictions

full rationale

The paper introduces the NEST dataset of 1005 movies with 102 annotated events each and reports baseline performance on four tasks (ETD, EL, EAE, ERE). No equations, parameters, or derivations appear in the provided text. Claims rest on empirical annotation and model evaluation rather than any chain that reduces a result to its own inputs by construction. Self-citations are absent from the abstract and described sections; the work is self-contained as dataset release plus baseline reporting against external models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the created annotations faithfully represent narrative structure; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Human annotations of narrative events and relations are reliable and capture the intended narrative phenomena
    All reported tasks and performance numbers depend on the quality of these annotations.

pith-pipeline@v0.9.1-grok · 5838 in / 1253 out tokens · 25146 ms · 2026-06-26T17:55:36.840513+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references · 48 canonical work pages

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=

    Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

  4. [4]

    arXiv preprint arXiv:2502.14786 , year=

    Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features , author=. arXiv preprint arXiv:2502.14786 , year=

  5. [5]

    2022 , eprint=

    CLAP: Learning Audio Concepts From Natural Language Supervision , author=. 2022 , eprint=

  6. [6]

    Advances in neural information processing systems , volume=

    wav2vec 2.0: A framework for self-supervised learning of speech representations , author=. Advances in neural information processing systems , volume=

  7. [7]

    2024 , eprint=

    DINOv2: Learning Robust Visual Features without Supervision , author=. 2024 , eprint=

  8. [8]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  9. [9]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  10. [10]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  11. [11]

    Dan Gusfield , title =. 1997

  12. [12]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  13. [13]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  14. [14]

    arXiv preprint arXiv:2503.13377 , year =

    Time-R1: Post-Training Large Vision-Language Model for Temporal Video Grounding , author =. arXiv preprint arXiv:2503.13377 , year =

  15. [15]

    arXiv preprint arXiv:2506.18883 , year =

    UniTime: Universal Video Temporal Grounding with Generative Multimodal Large Language Models , author =. arXiv preprint arXiv:2506.18883 , year =

  16. [16]

    arXiv preprint arXiv:2410.03290 , year =

    Grounded-VideoLLM: Fine-Grained Temporal Grounding for Video Large Language Models , author =. arXiv preprint arXiv:2410.03290 , year =

  17. [17]

    Proceedings of the International Conference on Learning Representations (ICLR) , year =

    TimeSuite: Scalable Long-Context Adaptation for Temporal Video Grounding , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

  18. [18]

    Technical Report, Tsinghua University , year =

    Localizing Step-by-Step: Multimodal Long-Video Temporal Grounding with Large Language Models , author =. Technical Report, Tsinghua University , year =

  19. [19]

    arXiv preprint arXiv:2508.10922 , year =

    A Survey on Video Temporal Grounding with Multimodal Large Language Models , author =. arXiv preprint arXiv:2508.10922 , year =

  20. [20]

    2022 , month = jul, howpublished =

    Where the Crawdads Sing , author =. 2022 , month = jul, howpublished =

  21. [21]

    Proceedings of the 34th International Conference on Neural Information Processing Systems , pages=

    Belief propagation neural networks , author=. Proceedings of the 34th International Conference on Neural Information Processing Systems , pages=

  22. [22]

    International Conference on Artificial Intelligence and Statistics , pages=

    Neural enhanced belief propagation on factor graphs , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

  23. [23]

    Computational linguistics , volume=

    The proposition bank: An annotated corpus of semantic roles , author=. Computational linguistics , volume=. 2005 , publisher=

  24. [24]

    arXiv preprint arXiv:2505.15734 , year=

    DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning , author=. arXiv preprint arXiv:2505.15734 , year=

  25. [25]

    Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL'25) , year =

    Meng Lu and Yuzhang Xie and Zhenyu Bi and Shuxiang Cao and Xuan Wang , title =. Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL'25) , year =

  26. [26]

    Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL'25) , year =

    Priya Pitre and Naren Ramakrishnan and Xuan Wang , title =. Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL'25) , year =

  27. [27]

    Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

    OntoType: Ontology-Guided and Pre-Trained Language Model Assisted Fine-Grained Entity Typing , author=. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

  28. [28]

    Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024) , pages=

    PromptRE: Weakly-Supervised Document-Level Relation Extraction via Prompting-Based Data Programming , author=. Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024) , pages=

  29. [29]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

    Reaction miner: An integrated system for chemical reaction extraction from textual data , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

  30. [30]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    TTM-RE: Memory-Augmented Document-Level Relation Extraction , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  31. [31]

    arXiv preprint arXiv:2502.11569 , year=

    Towards reasoning ability of small language models , author=. arXiv preprint arXiv:2502.11569 , year=

  32. [32]

    bioRxiv , pages=

    PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model , author=. bioRxiv , pages=. 2024 , publisher=

  33. [33]

    arXiv preprint arXiv:2407.03687 , year=

    STOC-TOT: Stochastic Tree-of-Thought with Constrained Decoding for Complex Reasoning in Multi-Hop Question Answering , author=. arXiv preprint arXiv:2407.03687 , year=

  34. [34]

    2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) , pages=

    REACTCLASS: cross-modal supervision for subword-guided reactant entity classification , author=. 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) , pages=. 2022 , organization=

  35. [35]

    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

    New frontiers of scientific text mining: tasks, data, and tools , author=. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

  36. [36]

    2020 IEEE International Conference on Big Data (Big Data) , pages=

    Textual evidence mining via spherical heterogeneous information network embedding , author=. 2020 IEEE International Conference on Big Data (Big Data) , pages=. 2020 , organization=

  37. [37]

    2020 IEEE International Conference on Big Data (Big Data) , pages=

    Pattern-enhanced named entity recognition with distant supervision , author=. 2020 IEEE International Conference on Big Data (Big Data) , pages=. 2020 , organization=

  38. [38]

    2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 , pages=

    COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation , author=. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 , pages=. 2021 , organization=

  39. [39]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=

    Evidenceminer: Textual evidence discovery for life sciences , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=

  40. [40]

    Bioinformatics , volume=

    Cross-type biomedical named entity recognition with deep multi-task learning , author=. Bioinformatics , volume=. 2018 , publisher=

  41. [41]

    Journal of visualized experiments: JoVE , number=

    Cloud-based phrase mining and analysis of user-defined phrase-category association in biomedical publications , author=. Journal of visualized experiments: JoVE , number=

  42. [42]

    2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) , pages=

    Pattern discovery for wide-window open information extraction in biomedical literature , author=. 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) , pages=. 2018 , organization=

  43. [43]

    2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) , pages=

    PENNER: Pattern-enhanced nested named entity recognition in biomedical literature , author=. 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) , pages=. 2018 , organization=

  44. [44]

    American Journal of Physiology-Heart and Circulatory Physiology , volume=

    Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease , author=. American Journal of Physiology-Heart and Circulatory Physiology , volume=. 2018 , publisher=

  45. [45]

    Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics , pages=

    Open information extraction with meta-pattern discovery in biomedical literature , author=. Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics , pages=

  46. [46]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

    Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

  47. [47]

    Advances in neural information processing systems , volume=

    Predicting the politics of an image using webly supervised data , author=. Advances in neural information processing systems , volume=

  48. [48]

    Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XVIII 16 , pages=

    Preserving semantic neighborhoods for robust cross-modal retrieval , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XVIII 16 , pages=. 2020 , organization=

  49. [49]

    International Journal of Computer Vision , volume=

    Predicting visual political bias using webly supervised data and an auxiliary task , author=. International Journal of Computer Vision , volume=. 2021 , publisher=

  50. [50]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Emphasizing complementary samples for non-literal cross-modal retrieval , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  51. [51]

    Advances in Neural Information Processing Systems , volume=

    Journeybench: A challenging one-stop vision-language understanding benchmark of generated images , author=. Advances in Neural Information Processing Systems , volume=

  52. [52]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Automatic understanding of image and video advertisements , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  53. [53]

    European Conference on Computer Vision , pages=

    Fine-grained visual entailment , author=. European Conference on Computer Vision , pages=. 2022 , organization=

  54. [54]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    MetaSumPerceiver: Multimodal Multi-Document Evidence Summarization for Fact-Checking , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  55. [55]

    I nfo S urgeon: Cross-Media Fine-grained Information Consistency Checking for Fake News Detection

    Fung, Yi and Thomas, Christopher and Gangi Reddy, Revanth and Polisetty, Sandeep and Ji, Heng and Chang, Shih-Fu and McKeown, Kathleen and Bansal, Mohit and Sil, Avi. I nfo S urgeon: Cross-Media Fine-grained Information Consistency Checking for Fake News Detection. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and...

  56. [56]

    and Mehrab, Kazi Sajeed and Ishmam, Alvi Md and Thomas, Chris

    Tang, Chia-Wei and Chen, Ting-Chih and Nguyen, Kiet A. and Mehrab, Kazi Sajeed and Ishmam, Alvi Md and Thomas, Chris. M 3 D : M ulti M odal M ulti D ocument Fine-Grained Inconsistency Detection. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1243

  57. [57]

    2023 , eprint=

    YaRN: Efficient Context Window Extension of Large Language Models , author=. 2023 , eprint=

  58. [58]

    2024 , eprint=

    LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens , author=. 2024 , eprint=

  59. [59]

    arXiv preprint arXiv:2309.10400 , year=

    Pose: Efficient context window extension of llms via positional skip-wise training , author=. arXiv preprint arXiv:2309.10400 , year=

  60. [60]

    2023 , eprint=

    Effective Long-Context Scaling of Foundation Models , author=. 2023 , eprint=

  61. [61]

    2023 , eprint=

    LongNet: Scaling Transformers to 1,000,000,000 Tokens , author=. 2023 , eprint=

  62. [62]

    2024 , eprint=

    LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning , author=. 2024 , eprint=

  63. [63]

    2024 , eprint=

    Training-Free Long-Context Scaling of Large Language Models , author=. 2024 , eprint=

  64. [64]

    2022 , eprint=

    Memorizing Transformers , author=. 2022 , eprint=

  65. [65]

    2023 , eprint=

    Unlimiformer: Long-Range Transformers with Unlimited Length Input , author=. 2023 , eprint=

  66. [66]

    2023 , eprint=

    Augmenting Language Models with Long-Term Memory , author=. 2023 , eprint=

  67. [67]

    Long-Context Language Modeling with Parallel Context Encoding

    Yen, Howard and Gao, Tianyu and Chen, Danqi. Long-Context Language Modeling with Parallel Context Encoding. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.142

  68. [68]

    2024 , eprint=

    Jamba: A Hybrid Transformer-Mamba Language Model , author=. 2024 , eprint=

  69. [69]

    2024 , eprint=

    Differential Transformer , author=. 2024 , eprint=

  70. [70]

    2024 , eprint=

    You Only Cache Once: Decoder-Decoder Architectures for Language Models , author=. 2024 , eprint=

  71. [71]

    2024 , eprint=

    Data Engineering for Scaling Language Models to 128K Context , author=. 2024 , eprint=

  72. [72]

    2024 , eprint=

    How to Train Long-Context Language Models (Effectively) , author=. 2024 , eprint=

  73. [73]

    2024 , eprint=

    Make Your LLM Fully Utilize the Context , author=. 2024 , eprint=

  74. [74]

    L ong A lign: A Recipe for Long Context Alignment of Large Language Models

    Bai, Yushi and Lv, Xin and Zhang, Jiajie and He, Yuze and Qi, Ji and Hou, Lei and Tang, Jie and Dong, Yuxiao and Li, Juanzi. L ong A lign: A Recipe for Long Context Alignment of Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.74

  75. [75]

    arXiv preprint arXiv:2410.21252 , year=

    LongReward: Improving Long-context Large Language Models with AI Feedback , author=. arXiv preprint arXiv:2410.21252 , year=

  76. [76]

    Advances in Neural Information Processing Systems , volume=

    Multimodal few-shot learning with frozen language models , author=. Advances in Neural Information Processing Systems , volume=

  77. [77]

    arXiv preprint arXiv:2209.06794 , year=

    Pali: A jointly-scaled multilingual language-image model , author=. arXiv preprint arXiv:2209.06794 , year=

  78. [78]

    International conference on machine learning , pages=

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

  79. [79]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  80. [80]

    arXiv preprint arXiv:2308.01390 , year=

    Openflamingo: An open-source framework for training large autoregressive vision-language models , author=. arXiv preprint arXiv:2308.01390 , year=

Showing first 80 references.