pith. sign in

arxiv: 2605.20268 · v1 · pith:MPKQHTAUnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.CL

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

Pith reviewed 2026-05-21 07:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords multimodal foundation modeltime seriesjoint pretrainingdecoder-only transformerlanguage understandingfrozen embeddingsmultimodal forecasting
0
0 comments X

The pith

A compact decoder-only transformer trained jointly on text and time series from scratch matches a dedicated language model on NLU tasks while topping time series classification and multimodal forecasting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Chronicle, a 324M-parameter model that uses the same transformer blocks, attention, and residual stream for both natural language and time series. It trains mostly on unimodal batches so that cross-modal understanding arises from parameter sharing alone, followed by a brief alignment stage. A sympathetic reader would care because this setup questions whether separate foundation models or post-hoc adaptations are necessary, and shows that one backbone can compete with the strongest unimodal models in each domain while also handling mixed text-and-numbers tasks.

Core claim

Chronicle is the first model jointly pretrained on text and time series from scratch in a unified architecture; it matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.

What carries the argument

Shared transformer blocks, attention mechanism, and residual stream across language and time series, with the bulk of pretraining on unimodal batches and only a short alignment stage for interleaving modalities.

If this is right

  • Joint pretraining on shared parameters produces a single backbone competitive with the strongest unimodal foundation models in both language and time series.
  • Frozen embeddings from the joint model achieve state-of-the-art results on time series classification without task-specific fine-tuning.
  • Multimodal forecasting improves when modalities are trained together rather than fused after separate pretraining.
  • Evaluation against dedicated models in each domain, rather than only multimodal baselines, is required to establish the value of joint training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same shared-parameter approach could be tested on additional modalities such as images or audio that also contain temporal structure.
  • Varying the length or schedule of the alignment stage might reveal how much interleaving is needed for stronger cross-modal transfer.
  • Domains that naturally pair text reports with numerical series, such as finance or sensor networks, could adopt this single-model pattern instead of maintaining separate pipelines.

Load-bearing premise

Cross-modal capability emerges from shared parameters when the bulk of pretraining uses unimodal batches, with only a short alignment stage needed to interleave the modalities.

What would settle it

A controlled experiment that trains an identical architecture on time series data alone and compares its frozen-embedding classification accuracy on the 24 UCR/UEA datasets against Chronicle's reported results.

Figures

Figures reproduced from arXiv: 2605.20268 by Jeremy Levasseur, Paul Quinlan, Qingguo Li, Xiaodan Zhu.

Figure 1
Figure 1. Figure 1: The Chronicle architecture. Text tokens and time series patches share a 16-layer decoder-only transformer, modality-specific components are limited to the input and output interfaces. Modality-specific output heads produce quantile forecasts (LQL) and next-token predictions (LCE); the same backbone produces frozen embeddings for downstream classification. instruction-tunes LLaMA-2 with discretized series b… view at source ↗
Figure 2
Figure 2. Figure 2: GIFT-Eval leaderboard (97 tasks; lower is better). MASE (left) and CRPS (right) for comparative models, plus Chronicle Stage 1 and Stage 2 (highlighted). Stage 1 is the stronger pure forecaster, while Stage 2 is the aligned checkpoint used for multimodal transfer. language training budget, and mirrors the pattern observed across all unimodal benchmarks. On ARC-Easy, Stage 2 (0.651) closely approaches LLaMA… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of TS-token repetition on multimodal classification. Accuracy (left), AUC (middle), and macro-F1 (right) as a function of TS-token repeats r, evaluated on the three TimeCAP domains and averaged (dashed black). Repetition rebalances the text–TS token ratio in the shared sequence; performance peaks near r=64 then degrades as attention dilutes across identical copies. 5.4 TS-Token Repetition for Short … view at source ↗
Figure 4
Figure 4. Figure 4: Effect of channel-aware multivariate handling on UEA classification. Bars show the per-dataset delta between joint multivariate handling and mean-channel pooling (joint minus mean) for accuracy, macro-F1, and AUC. Positive values favor joint channel-aware handling. Averaged across the 10 multivariate UEA datasets, joint handling improves accuracy by +0.039, macro-F1 by +0.035, and AUC by +0.020 [PITH_FULL… view at source ↗
read the original abstract

Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains. It matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Chronicle, a 324M-parameter decoder-only transformer pretrained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention, and residual stream; the bulk of pretraining uses unimodal batches, with a short alignment stage to interleave modalities. The model is claimed to match Gemma-3-270M-PT on 19 NLU tasks, set a new state-of-the-art for frozen-embedding time series classification on 24 UCR/UEA datasets, and outperform all supervised fusion baselines on multimodal forecasting in Time-MMD, positioning it as the first jointly pretrained from-scratch multimodal model evaluated against dedicated unimodal foundation models in both domains.

Significance. If the empirical results hold under rigorous verification, the work would be significant as the first demonstration of a compact from-scratch multimodal foundation model for text and time series that competes with strong unimodal models without post-hoc adaptation. It would provide evidence that shared parameters can enable cross-modal capability with minimal interleaved data, potentially reducing the need for separate modality-specific pretraining pipelines and offering a practical backbone for real-world applications involving metadata, reports, and temporal data.

major comments (2)
  1. Abstract: The central claim that cross-modal capability emerges purely from shared parameters when the bulk of pretraining uses unimodal batches plus a short alignment stage lacks supporting ablation evidence. No direct comparison is described to a fully interleaved pretraining baseline or to a model using modality-specific adapters, so it remains unclear whether the reported gains on Time-MMD multimodal forecasting and competitive NLU/TS performance arise from the claimed emergence mechanism rather than data scale, tokenizer design, or the 324M decoder-only architecture.
  2. Abstract and results sections: Strong performance claims (matching Gemma-3-270M-PT on 19 NLU tasks, new bar on 24 UCR/UEA datasets, beating supervised fusion baselines on Time-MMD) are stated without details on training data sources, exact hyperparameters, statistical significance tests, or ablation studies, rendering it impossible to verify whether the results support the soundness of the central claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and have revised the manuscript to incorporate additional evidence and details where this strengthens the presentation of our claims.

read point-by-point responses
  1. Referee: Abstract: The central claim that cross-modal capability emerges purely from shared parameters when the bulk of pretraining uses unimodal batches plus a short alignment stage lacks supporting ablation evidence. No direct comparison is described to a fully interleaved pretraining baseline or to a model using modality-specific adapters, so it remains unclear whether the reported gains on Time-MMD multimodal forecasting and competitive NLU/TS performance arise from the claimed emergence mechanism rather than data scale, tokenizer design, or the 324M decoder-only architecture.

    Authors: We agree that direct ablation evidence would more rigorously isolate the contribution of shared parameters under predominantly unimodal pretraining. In the revised manuscript we have added a new Ablation Studies subsection that reports results from a controlled comparison (at reduced scale) between our unimodal-batch-plus-short-alignment schedule and a fully interleaved pretraining baseline using the same data mixture and tokenizer. We also include a brief discussion of the design choice against modality-specific adapters, noting that adapters would reintroduce separate modality pipelines and defeat the goal of a single unified backbone. While a full-scale 324M adapter ablation remains computationally prohibitive, the smaller-scale results and architectural analysis support that the observed cross-modal gains are not solely attributable to data scale or architecture size. revision: yes

  2. Referee: Abstract and results sections: Strong performance claims (matching Gemma-3-270M-PT on 19 NLU tasks, new bar on 24 UCR/UEA datasets, beating supervised fusion baselines on Time-MMD) are stated without details on training data sources, exact hyperparameters, statistical significance tests, or ablation studies, rendering it impossible to verify whether the results support the soundness of the central claims.

    Authors: We acknowledge that the original submission omitted several implementation details required for full reproducibility and verification. The revised manuscript now contains an expanded Experimental Setup section that specifies the exact language and time-series pretraining corpora (including sizes and sources), a comprehensive hyperparameter table in the appendix, and statistical significance testing (paired t-tests with p-values) for all headline comparisons against Gemma-3-270M-PT, the UCR/UEA baselines, and the Time-MMD fusion models. The new Ablation Studies subsection further addresses the request for supporting analyses. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivations or self-referential reductions

full rationale

The paper describes training a 324M decoder-only transformer from scratch on unimodal batches followed by a short alignment stage, then reports empirical results on NLU tasks, UCR/UEA classification, and Time-MMD forecasting. No equations, fitted parameters, uniqueness theorems, or ansatzes are presented as derivations. All central claims rest on direct experimental comparisons to external baselines (Gemma-3, supervised fusion models) rather than any internal reduction of outputs to inputs by construction. The absence of a mathematical derivation chain makes the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical success of a shared-parameter transformer trained mostly unimodally. No explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5772 in / 1189 out tokens · 46280 ms · 2026-05-21T07:36:10.953312+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · 9 internal anchors

  1. [1]

    International Conference on Machine Learning , year =

    Chronos: Learning the Language of Time Series , author =. International Conference on Machine Learning , year =

  2. [2]

    Chronos-2: From Univariate to Universal Forecasting

    Chronos-2: From Univariate to Universal Forecasting , author =. arXiv preprint arXiv:2510.15821 , year =

  3. [3]

    International Conference on Machine Learning , year =

    A Decoder-Only Foundation Model for Time-Series Forecasting , author =. International Conference on Machine Learning , year =

  4. [4]

    International Conference on Machine Learning , year =

    Unified Training of Universal Time Series Forecasting Transformers , author =. International Conference on Machine Learning , year =

  5. [5]

    Liu, Xu and Liu, Juncheng and Woo, Gerald and Aksu, Taha and Liang, Yuxuan and Zimmermann, Roger and Liu, Chenghao and Savarese, Silvio and Xiong, Caiming and Sahoo, Doyen , journal =. Moirai-

  6. [6]

    International Conference on Learning Representations , year =

    A Time Series is Worth 64 Words: Long-term Forecasting with Transformers , author =. International Conference on Learning Representations , year =

  7. [7]

    2026 , eprint=

    Revisiting the Generic Transformer: Deconstructing a Strong Baseline for Time Series Foundation Models , author=. 2026 , eprint=

  8. [8]

    Rasul, Kashif and Ashok, Arjun and Williams, Andrew R. and Ghonia, Hena and Bhagwatkar, Rishika and Khorasani, Ali and Bayazi, Mohammad Javad Dastgheib and Adamopoulos, George and Riachi, Roland and Hassen, Nadhir and others , journal =. Lag-

  9. [9]

    Goswami, Mononito and Szafer, Konrad and Choudhry, Arjun and Cai, Yifu and Li, Shuo and Dubrawski, Artur , journal =

  10. [10]

    Garza, Azul and Mergenthaler-Canseco, Max , journal =

  11. [11]

    Gao, Shanghua and Koker, Teddy and Queen, Owen and Hartvigsen, Tom and Tsiligkaridis, Theodoros and Zitnik, Marinka , booktitle =

  12. [12]

    arXiv preprint arXiv:2403.14735 , year =

    Foundation Models for Time Series Analysis: A Tutorial and Survey , author =. arXiv preprint arXiv:2403.14735 , year =

  13. [13]

    OpenAI Blog , year =

    Language Models are Unsupervised Multitask Learners , author =. OpenAI Blog , year =

  14. [14]

    Advances in Neural Information Processing Systems , volume =

    Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , volume =

  15. [15]

    Touvron, Hugo and others , journal =

  16. [16]

    Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and others , journal =. The

  17. [17]

    Qwen2 Technical Report

    Qwen2 Technical Report , author =. arXiv preprint arXiv:2407.10671 , year =

  18. [18]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , journal =

  19. [19]

    Journal of Machine Learning Research , volume =

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author =. Journal of Machine Learning Research , volume =

  20. [20]

    Text and Code Embeddings by Contrastive Pre-Training

    Text and Code Embeddings by Contrastive Pre-Training , author =. arXiv preprint arXiv:2201.10005 , year =

  21. [21]

    MTEB: Massive Text Embedding Benchmark

    Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo. arXiv preprint arXiv:2210.07316 , year =

  22. [22]

    Advances in Neural Information Processing Systems , volume =

    Large Language Models Are Zero-Shot Time Series Forecasters , author =. Advances in Neural Information Processing Systems , volume =

  23. [23]

    and Shi, Xiaoming and Chen, Pin-Yu and Liang, Yuxuan and Li, Yuan-Fang and Pan, Shirui and Wen, Qingsong , booktitle =

    Jin, Ming and Wang, Shiyu and Ma, Lintao and Chu, Zhixuan and Zhang, James Y. and Shi, Xiaoming and Chen, Pin-Yu and Liang, Yuxuan and Li, Yuan-Fang and Pan, Shirui and Wen, Qingsong , booktitle =. Time-

  24. [24]

    2023 , eprint=

    One Fits All:Power General Time Series Analysis by Pretrained LM , author=. 2023 , eprint=

  25. [25]

    and Pfister, Tomas and Zheng, Yixiang and Ye, Wen and Liu, Yan , journal =

    Cao, Defu and Jia, Furong and Arik, Sercan O. and Pfister, Tomas and Zheng, Yixiang and Ye, Wen and Liu, Yan , journal =

  26. [26]

    , journal =

    Xue, Hao and Salim, Flora D. , journal =

  27. [27]

    Language models still struggle to zero-shot reason about time series

    Language Models Still Struggle to Zero-shot Reason about Time Series , author =. arXiv preprint arXiv:2404.11757 , year =

  28. [28]

    ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning , volume=

    Xie, Zhe and Li, Zeyan and He, Xiao and Xu, Longlong and Wen, Xidao and Zhang, Tieying and Chen, Jianjun and Shi, Rui and Pei, Dan , year=. ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning , volume=. Proceedings of the VLDB Endowment , publisher=. doi:10.14778/3742728.3742735 , number=

  29. [29]

    Liu, Haoxin and others , journal =. Time-

  30. [30]

    2025 , note =

    Lee, Geon and others , booktitle =. 2025 , note =

  31. [31]

    2024 , eprint=

    ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data , author=. 2024 , eprint=

  32. [32]

    Jia, Furong and Wang, Kevin and Zheng, Yixiang and Cao, Defu and Liu, Yan , booktitle =

  33. [33]

    Proceedings of the AAAI Conference on Artificial Intelligence , year =

    Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

  34. [34]

    Advances in Neural Information Processing Systems , volume =

    Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting , author =. Advances in Neural Information Processing Systems , volume =

  35. [35]

    Wu, Haixu and Hu, Tengge and Liu, Yong and Zhou, Hang and Wang, Jianmin and Long, Mingsheng , booktitle =

  36. [36]

    Liu, Yong and Hu, Tengge and Zhang, Haoran and Wu, Haixu and Wang, Shiyu and Ma, Lintao and Long, Mingsheng , journal =. i

  37. [37]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume =

    Are Transformers Effective for Time Series Forecasting? , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =

  38. [38]

    Zhou, Tian and Ma, Ziqing and Wen, Qingsong and Wang, Xue and Sun, Liang and Jin, Rong , journal =

  39. [39]

    Advances in Neural Information Processing Systems , volume =

    Attention is All You Need , author =. Advances in Neural Information Processing Systems , volume =

  40. [40]

    Advances in Neural Information Processing Systems , volume =

    Root Mean Square Layer Normalization , author =. Advances in Neural Information Processing Systems , volume =

  41. [41]

    Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , journal =

  42. [42]

    Shah, Jay and Bikshandi, Ganesh and Zhang, Ying and Thakkar, Vijay and Ramani, Pradeep and Dao, Tri , journal =

  43. [43]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebr. arXiv preprint arXiv:2305.13245 , year =

  44. [44]

    Longformer: The Long-Document Transformer

    Longformer: The Long-Document Transformer , author =. arXiv preprint arXiv:2004.05150 , year =

  45. [45]

    Advances in Neural Information Processing Systems , volume =

    Primer: Searching for Efficient Transformers for Language Modeling , author =. Advances in Neural Information Processing Systems , volume =

  46. [46]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma 2: Improving Open Language Models at a Practical Size , author =. arXiv preprint arXiv:2408.00118 , year =

  47. [47]

    URL https://kellerjordan

    Muon: An optimizer for hidden layers in neural networks, 2024 , author=. URL https://kellerjordan. github. io/posts/muon , volume=

  48. [48]

    International Conference on Learning Representations , year =

    Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =

  49. [49]

    Rajbhandari, Samyam and Rasley, Jeff and Rabe, Markus N and He, Yuxiong , journal =

  50. [50]

    2024 , eprint=

    GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation , author=. 2024 , eprint=

  51. [51]

    Dau, Hoang Anh and Bagnall, Anthony and Kamgar, Kaveh and Yeh, Chin-Chia Michael and Zhu, Yan and Gharghabi, Shaghayegh and Ratanamahatana, Chotirat Ann and Keogh, Eamonn , journal =. The

  52. [52]

    Journal of the American Statistical Association , volume =

    Strictly Proper Scoring Rules, Prediction, and Estimation , author =. Journal of the American Statistical Association , volume =

  53. [53]

    International Conference on Learning Representations , year =

    Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift , author =. International Conference on Learning Representations , year =

  54. [54]

    2025 , howpublished =

  55. [55]

    2025 , eprint=

    NorMuon: Making Muon more efficient and scalable , author=. 2025 , eprint=

  56. [56]

    Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning,

    Auer, Andreas and Podest, Patrick and Klotz, Daniel and B. arXiv preprint arXiv:2505.23719 , year =

  57. [57]

    2025 , eprint =

    This Time is Different: An Observability Perspective on Time Series Foundation Models , author =. 2025 , eprint =

  58. [58]

    Output Scaling:

    Wang, Xue and Zhou, Tian and Gao, Jinyang and Ding, Bolin and Zhou, Jingren , journal =. Output Scaling:. 2025 , url =

  59. [59]

    and Carpov, Dmitri and Chapados, Nicolas and Bengio, Yoshua , booktitle =

    Oreshkin, Boris N. and Carpov, Dmitri and Chapados, Nicolas and Bengio, Yoshua , booktitle =. 2020 , url =

  60. [60]

    2020 , publisher =

    Salinas, David and Flunkert, Valentin and Gasthaus, Jan and Januschowski, Tim , journal =. 2020 , publisher =

  61. [61]

    Gemma 3 Technical Report

    Gemma 3 Technical Report , author=. arXiv preprint arXiv:2503.19786 , year=

  62. [62]

    2025 , eprint=

    LFM2 Technical Report , author=. 2025 , eprint=

  63. [63]

    Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and others , journal=

  64. [64]

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=

  65. [65]

    Think you have Solved Question Answering? Try

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have Solved Question Answering? Try

  66. [66]

    AAAI Spring Symposium Series , year=

    Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , author=. AAAI Spring Symposium Series , year=

  67. [67]

    Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , booktitle=

  68. [68]

    Bisk, Yonatan and Zellers, Rowan and Gao, Jianfeng and Choi, Yejin and others , booktitle=

  69. [69]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

  70. [70]

    Paperno, Denis and Kruszewski, Germ. The. arXiv preprint arXiv:1606.06031 , year=

  71. [71]

    Levesque, Hector and Davis, Ernest and Morgenstern, Leora , booktitle=. The

  72. [72]

    2019 , eprint=

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

  73. [73]

    Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=

  74. [74]

    Reddy, Siva and Chen, Danqi and Manning, Christopher D , journal=

  75. [75]

    Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , booktitle=

  76. [76]

    Transactions on Machine Learning Research , year=

    Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author=. Transactions on Machine Learning Research , year=

  77. [77]

    Zhong, Wanjun and Cui, Ruixiang and Guo, Yiduo and Liang, Yaobo and Lu, Shuai and Wang, Yanlin and Saied, Amin and Chen, Weizhu and Duan, Nan , journal=

  78. [78]

    Using the Output Embedding to Improve Language Models

    Press, Ofir and Wolf, Lior. Using the Output Embedding to Improve Language Models. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017

  79. [79]

    2026 , eprint=

    Moirai 2.0: When Less Is More for Time Series Forecasting , author=. 2026 , eprint=

  80. [80]

    Lee, Geon and Yu, Wenchao and Cheng, Wei and Chen, Haifeng , year=

Showing first 80 references.