pith. machine review for the scientific record. sign in

arxiv: 2501.08313 · v1 · submitted 2025-01-14 · 💻 cs.CL · cs.CV

Recognition: 3 theorem links

· Lean Theorem

MiniMax-01: Scaling Foundation Models with Lightning Attention

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:21 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords lightning attentionmixture of expertslong contextfoundation modelslarge language modelsvision language modelsmodel scaling
0
0 comments X

The pith

MiniMax-01 matches GPT-4o and Claude-3.5-Sonnet performance while supporting 20-32 times longer contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the MiniMax-01 series, including a text model and a vision-language model, that reach performance levels comparable to current top models on standard and internal benchmarks. The central advance is lightning attention, which scales efficiently when paired with a Mixture of Experts design and specialized parallel and overlap methods. This combination supports training on contexts of 1 million tokens and inference extrapolation to 4 million tokens for a 456-billion-parameter model with 45.9 billion parameters active per token. The approach keeps computational demands manageable while delivering the long-context gains.

Core claim

Lightning attention combined with a 32-expert Mixture of Experts architecture, optimized parallel strategies, and computation-communication overlap techniques enables efficient training and inference for models with hundreds of billions of parameters across million-token contexts. The resulting MiniMax-Text-01 reaches 1 million tokens during training and extrapolates to 4 million during inference, while MiniMax-VL-01 adds vision-language capabilities through continued training on 512 billion tokens, and both match the performance of GPT-4o and Claude-3.5-Sonnet.

What carries the argument

Lightning attention, which when integrated with MoE parallel scheduling and overlap techniques supports stable scaling to very long contexts without proportional compute growth.

If this is right

  • Models can process 20-32 times more context than current leading systems while matching their benchmark scores.
  • Training and inference become practical for 456-billion-parameter models with million-token contexts.
  • Vision-language training can be added via continued pretraining without losing the long-context advantage.
  • The released models allow direct testing of million-token applications at affordable cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applications that currently rely on chunking or retrieval could shift to single-pass processing over entire documents or conversations.
  • The overlap techniques may transfer to other attention variants to improve efficiency at scale.
  • Further extrapolation tests beyond 4 million tokens would show whether quality remains flat or begins to degrade.

Load-bearing premise

Lightning attention plus the described MoE parallel and overlap methods preserve full model quality and training stability at the claimed parameter and context sizes with no hidden performance costs.

What would settle it

A controlled benchmark comparison at the full claimed context length where MiniMax-01 scores materially lower than GPT-4o or Claude-3.5-Sonnet on the same tasks, or requires substantially more compute to reach parity.

read the original abstract

We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the MiniMax-01 series (MiniMax-Text-01 and MiniMax-VL-01), which combine a novel lightning attention mechanism with a 32-expert MoE architecture (456B total parameters, 45.9B active per token). It claims efficient training and inference at scale, supporting 1M-token contexts during training and extrapolation to 4M tokens at inference, while matching the performance of GPT-4o and Claude-3.5-Sonnet on standard and in-house benchmarks and delivering 20-32x longer context windows. The work also describes optimized parallel strategies and computation-communication overlap for MoE and lightning attention, with public release of the models.

Significance. If the empirical claims are substantiated, the work would demonstrate a practical route to scaling foundation models to hundreds of billions of parameters while extending context lengths by an order of magnitude without proportional compute increases, which could meaningfully advance long-context and multimodal applications.

major comments (2)
  1. [Abstract] Abstract and §1: The central claim that the models 'match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet' is unsupported by any quantitative benchmark scores, per-task results, ablation studies isolating lightning attention, or error analysis. Without these data the headline empirical result cannot be evaluated.
  2. [§3] §3 (Lightning Attention) and §4 (MoE scaling): No ablation is presented that holds total parameters and training data fixed while comparing lightning attention against standard attention; the claim that the combination 'preserves model quality' therefore rests on an untested assumption at the reported scale.
minor comments (1)
  1. [Abstract] The abstract states context lengths of '1 million tokens during training and extrapolate to 4 million tokens during inference' but does not specify the exact extrapolation method or any degradation metrics at 4M tokens.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract and §1: The central claim that the models 'match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet' is unsupported by any quantitative benchmark scores, per-task results, ablation studies isolating lightning attention, or error analysis. Without these data the headline empirical result cannot be evaluated.

    Authors: We agree that the current manuscript does not present the detailed quantitative benchmark tables needed to fully substantiate the claim. In the revised version we will add comprehensive tables reporting exact scores on standard benchmarks (MMLU, GSM8K, HumanEval, MATH, etc.) and in-house evaluations, with per-task breakdowns and direct comparisons to GPT-4o and Claude-3.5-Sonnet. Relevant error analysis and any available ablations isolating lightning attention will also be included. revision: yes

  2. Referee: [§3] §3 (Lightning Attention) and §4 (MoE scaling): No ablation is presented that holds total parameters and training data fixed while comparing lightning attention against standard attention; the claim that the combination 'preserves model quality' therefore rests on an untested assumption at the reported scale.

    Authors: We acknowledge that no controlled ablation holding total parameters and training data fixed is provided. Training an additional 456B-parameter model with standard attention for direct comparison was not feasible within our compute budget. Lightning attention is derived from a theoretical approximation that preserves the same attention matrix expressiveness while enabling linear scaling; the observed parity with SOTA models on diverse benchmarks offers indirect support. We will expand the discussion in §3 to clarify this design rationale and explicitly note the missing ablation as a limitation. revision: partial

standing simulated objections not resolved
  • Direct ablation study holding total parameters and training data fixed while comparing lightning attention to standard attention

Circularity Check

0 steps flagged

No circularity in derivation chain; claims rest on empirical training runs

full rationale

The paper introduces lightning attention and its integration with MoE (32 experts, 45.9B active params out of 456B) for long-context scaling, then reports benchmark results matching GPT-4o/Claude-3.5-Sonnet at 1M-4M tokens. No equations, derivations, or predictions are present that reduce by construction to fitted parameters or self-definitions. Performance claims are grounded in training runs and external benchmark comparisons rather than any self-referential mathematical structure or load-bearing self-citation chain. This is the expected outcome for an empirical scaling paper with no theoretical derivation steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claims rest on the unverified computational efficiency and quality preservation of lightning attention at million-token scales plus the effectiveness of the custom MoE parallel strategy; these are treated as domain assumptions rather than derived results.

free parameters (2)
  • Number of experts
    Set to 32 to achieve the reported activation ratio of 45.9B active parameters out of 456B total.
  • Active parameters per token
    Fixed at 45.9 billion by the MoE routing design.
axioms (1)
  • domain assumption Lightning attention enables efficient scaling to million-token contexts without quality loss
    Invoked as the core enabler of the reported training and inference lengths.
invented entities (1)
  • Lightning attention no independent evidence
    purpose: Provide efficient attention computation for very long sequences
    New mechanism introduced to support the scaling claims

pith-pipeline@v0.9.0 · 5884 in / 1330 out tokens · 45665 ms · 2026-05-16T06:21:57.513106+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 conditional novelty 7.0

    EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.

  2. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 unverdicted novelty 7.0

    EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.

  3. MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

    cs.LG 2026-05 conditional novelty 7.0

    MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

  4. When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

    cs.PF 2026-05 unverdicted novelty 7.0

    Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...

  5. When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

    cs.PF 2026-05 conditional novelty 7.0

    Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.

  6. OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

    cs.CL 2026-05 unverdicted novelty 7.0

    OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.

  7. OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

    cs.CL 2026-05 unverdicted novelty 7.0

    OralMLLM-Bench is a new benchmark with 27 tasks in four cognitive categories that evaluates six MLLMs on dental radiographs and shows clear performance gaps versus clinicians.

  8. BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs

    cs.CL 2026-04 unverdicted novelty 7.0

    BOSCH decomposes attention-head selection for short-context hybridization into layer probing, adaptive ratio assignment, and grouped binary optimization, yielding better efficiency-performance tradeoffs than static or...

  9. UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

    cs.CL 2026-05 unverdicted novelty 6.0

    UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.

  10. The Impossibility Triangle of Long-Context Modeling

    cs.CL 2026-05 unverdicted novelty 6.0

    No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

  11. Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

    cs.CL 2026-04 unverdicted novelty 6.0

    HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

  12. MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

    cs.AI 2026-04 unverdicted novelty 6.0

    MISID is a multimodal multi-turn dataset for intent recognition in strategic deception games, paired with the FRACTAM framework that improves MLLM performance on hidden intent detection via decouple-anchor-reason steps.

  13. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.

  14. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.

  15. HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

    cs.LG 2026-03 unverdicted novelty 6.0

    HISA speeds up fine-grained sparse attention indexers via block-then-token hierarchy, delivering substantial speedups at 64K context with no training and quality matching the original DSA on long-context benchmarks.

  16. Kimi Linear: An Expressive, Efficient Attention Architecture

    cs.CL 2025-10 unverdicted novelty 6.0

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  17. MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

    cs.CL 2025-07 unverdicted novelty 6.0

    MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.

  18. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    cs.CL 2025-06 unverdicted novelty 6.0

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...

  19. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    cs.CL 2025-05 conditional novelty 6.0

    Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.

  20. MoBA: Mixture of Block Attention for Long-Context LLMs

    cs.LG 2025-02 unverdicted novelty 6.0

    MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.

  21. Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

    cs.CE 2026-05 unverdicted novelty 5.0

    LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.

  22. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

  23. Disposition Distillation at Small Scale: A Three-Arc Negative Result

    cs.LG 2026-04 accept novelty 5.0

    Multiple standard techniques for instilling dispositions in small LMs consistently failed across five models, with initial apparent gains revealed as artifacts and cross-validation collapsing to chance.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 19 Pith papers · 9 internal anchors

  1. [1]

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    ISSN 2835-8856. URLhttps://openreview.net/forum?id=Ee277P3AYC. Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multimodal under- standing and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision a...

  2. [2]

    Introduction and Motivation The rapid advancement of large language models (LLMs) has significantly enhanced their capabilities but has also raised concerns about their alignment with human values and intentions. Current alignment strategies, such asSupervisedFine-tuning(SFT) and Reinforcement Learning from Human Feedback (RLHF), have shown potential but ...

  3. [3]

    This process repeats iteratively until the response is complete, ensuring that every sentence in the output aligns with human preferences

    Stream Aligner Paradigm Stream Aligner operates as a plug-and-play module in the generation pipeline, correcting sentences generated by the upstream model and feeding the corrected suffix back for further generation. This process repeats iteratively until the response is complete, ensuring that every sentence in the output aligns with human preferences. K...

  4. [4]

    The training objective is to mini- mize the negative log-likelihood loss between the model’s output and the corrected answer

    Technical Details • Training: Stream Aligner is fine-tuned on a sentence-level preference dataset to learn the residuals between preferred and non-preferred responses. The training objective is to mini- mize the negative log-likelihood loss between the model’s output and the corrected answer. • Inference: During inference, Stream Aligner takes the user’s ...

  5. [5]

    Experimental Results The paper evaluates Stream Aligner on three tasks: helpful and harmless QA, math questions, and sum- mary tasks. The results demonstrate significant im- provements: • Helpfulness and Harmlessness: Stream Aligner-2B applied to Llama2-70B-chat achieved a 41.2% increase in helpfulness and a 36.0% increase in harmlessness. • Math Ability:...

  6. [6]

    • Generation Methods:The study compares the classic sentence-by-sentence correction pipeline with a new continue generation pipeline

    Ablation Studies The paper conducts ablation studies to verify the cor- rection capabilities of Stream Aligner under different supervision quantities and generation pipelines: • Generation-Correction Frequency:The perfor- manceofStreamAlignerincreaseswiththenumberof generation-correction cycles, demonstrating its ability to enhance the upstream model’s pe...

  7. [7]

    It also achieves the performance of Aligner-70B using only 2B parameters, showcasing both superior performance and efficiency

    Comparison to Other Alignment Methods Stream Aligner outperforms other alignment meth- ods such as Supervised Finetuning (SFT) and Direct Preference Optimization (DPO) in terms of accuracy improvements. It also achieves the performance of Aligner-70B using only 2B parameters, showcasing both superior performance and efficiency

  8. [8]

    Interpretability The paper explores the interpretability of Stream Aligner through representation engineering and activation steering. The results show that Stream Aligner has internalized the correction paradigm as a representation, similar to Aligner, but with more layers involved in deciding corrections, re- flecting the complexity of mathematical tasks

  9. [9]

    It achieves significant improvements in helpfulness, harmlessness, and reasoning abilities, making it a promising approach for aligning LLMs with human values

    Conclusion Stream Aligner is a novel alignment paradigm that ef- fectively elicits the latent knowledge of the upstream model while maintaining efficiency and enhanced performance. It achieves significant improvements in helpfulness, harmlessness, and reasoning abilities, making it a promising approach for aligning LLMs with human values. Limitations Desp...

  10. [10]

    L’IA inspirée du cerveau et l’AGI Le cerveau humain est largement considéré comme l’un des systèmes de traitement de l’information les plus complexes et avancés au monde. Il comprend plus de 86 milliards de neurones, chacun capable de former jusqu’à 10 000 synapses avec d’autres neurones, ce qui résulte en un réseau de connexions exceptionnellement comple...

  11. [11]

    Caractéristiques de l’AGI 2.1. Échelle L’échelle des cerveaux varie considérable- ment d’une espèce animale à l’autre, allant de quelques milliers de neurones chez les invertébrés simples comme les vers néma- todes, à plus de 86 milliards de neurones chez les humains. Par exemple, le cerveau d’une mouche à fruits contient environ 100 000 neurones, et le c...

  12. [12]

    neurones multimodaux

    comme un moyen possible d’étudier l’AGI inspirée du cerveau, car les LLM sont parmi les premiers modèles à démontrer des performances de niveau humain dans diverses tâches. La relation entre le nom- bre de neurones et les capacités cognitives est également pertinente pour les LLM tels que GPT-2 et GPT-3. Alors que GPT-2 a 1,5 milliard de paramètres et a é...

  13. [13]

    Traduisez cette phrase du chinois à l’anglais

    Technologie importante Les modèles de langage, tels que les LLM, reposent sur plusieurs techniques cruciales, notam- ment le zero-shot prompting, le few-shot prompting, l’apprentissage contextuel et l’instruction. L’attente sous-jacente de ces techniques est que les systèmes AI peuvent rapidement apprendre de nouvelles tâches en s’appuyant sur ce qu’ils o...

  14. [14]

    Discussion 4.1. Limitations Bien que des progrès significatifs aient été réalisés dans le développementdel’AGIetdel’IAinspiréedu cerveau, il reste plusieurs limitations à sur- monter avant que nous puissions atteindre une véritable intelligence de niveau humain dans les machines. Certaines de ces limita- tions incluent : Compréhension limitée du cerveau h...

  15. [15]

    Nous avons égale- ment discuté de l’évolution, des limitations et de l’avenir de l’AGI

    Conclusion Dans cet article, nous avons fourni un aperçu complet de l’IA inspirée du cerveau du point de vue de l’AGI, couvrant ses progrès actuels, ses caractéristiques im- portantes et ses avancées technologiques vers la réalisation de l’AGI. Nous avons égale- ment discuté de l’évolution, des limitations et de l’avenir de l’AGI. En conclusion, l’IA insp...

  16. [16]

    Shigihara Y, Zeki S

    5. Shigihara Y, Zeki S. Traitement par- allèle dans le système visuel de la forme du cerveau : une étude fMRI. Front Hum Neurosci. 2014;8:506. 6. Egorova N, Shty- rov Y, Pulvermüller F. Traitement précoce et parallèle de l’information pragmatique et sémantique dans les actes de parole : preuves neurophysiologiques. Front Hum Neurosci. 2013;7:86. 7. Lang E...

  17. [17]

    Neuroplasticité

    S. Neuroplasticité. 9. Funahashi S. Mémoire de travail dans le cortex préfrontal. Brain Sci. 2017;7(5):49. 10. De Souza LC, Guimaraes HC, Teixeira AL, et al. Neurolo- gie du lobe frontal et esprit créatif. Front Psychol. 2014:761. 11. Teffer K, Semende- feri K. Cortex préfrontal humain : évolu- tion, développement et pathologie. Prog Brain Res. 2012;195:1...

  18. [18]

    Champs ré- cepteurs, interaction binoculaire et archi- tecture fonctionnelle dans le cortex visuel du chat

    Hubel DH, Wiesel TN. Champs ré- cepteurs, interaction binoculaire et archi- tecture fonctionnelle dans le cortex visuel du chat. J Physiol. 1962;160(1):106

  19. [19]

    Le système d’attention du cerveau humain

    Posner MI, Petersen SE. Le système d’attention du cerveau humain. Annu Rev Neurosci. 1990;13(1):25-42. 20. Devlin J, Chang MW, Lee K, Toutanova K. Bert : pré-formation de transformateurs bidirec- tionnels profonds pour la compréhension du langage. arXiv preprint arXiv:181004805

  20. [20]

    Radford A, Narasimhan K, Sali- mans T, Sutskever I

    21. Radford A, Narasimhan K, Sali- mans T, Sutskever I. Amélioration de la com- préhension du langage par la pré-formation générative. Open. 2018. 22. Dosovit- skiy A, Beyer L, Kolesnikov A, et al. Une image vaut 16x16 mots : transformateurs pour la reconnaissance d’images à grande échelle. arXiv preprint arXiv:201011929

  21. [21]

    Bassett DS, Bullmore E

    23. Bassett DS, Bullmore E. Réseaux cérébraux petit-mondevol. 12. The neu- roscientist; 2006:512523. 24. Bullmore –E, Sporns O. Réseaux cérébraux complexes : analyse théorique des systèmes struc- turels et fonctionnels. Nat Rev Neurosci. 2009;10(3):186-198. 25. Bassett DS, Bull- more ET. Réseaux cérébraux petit-monde revisités. Neuro–scientist. 2017; 23(5):499-

  22. [22]

    Xie S, Kiril–lov A, Girshick R, He K

    26. Xie S, Kiril–lov A, Girshick R, He K. Exploration de réseaux neuronaux connectés aléatoirement pour la reconnais- sance d’images. Dans : Proceedings of the IEEE/CVF International Conference on Com- puter Vision. 2019:1284-1293. 27. Taud H, Mas J. Multilayer Pe–rceptron (MLP). Geomatic Approaches for Modeling Land Change Scenarios. 2018:451-455. 28. To...

  23. [23]

    Zhao L , L, Dai H, Wu Z, et al

    31. Zhao L , L, Dai H, Wu Z, et al. Couplage de la sémantique visuelle des réseaux neuronaux artificiels et de la fonction cérébrale humaine via des activa- tions synchronisées. arXiv preprint arXiv: 220610821. 2022. 32. Liu X, Zhou M, Shi G, et al. Couplage des neurones artifi- ciels dans BERT et des neurones biologiques dans le cerveau humain. arXiv pre...

  24. [24]

    Yu X, Zhang L, Dai H, et al

    35. Yu X, Zhang L, Dai H, et al. Redéf- inition de l’auto-attention dans les trans- formateurs guidée par le principe cœur- périphérie. arXiv preprint arXiv:230315569

  25. [25]

    Zhao L, Dai H, Wu Z, Zhu D, Liu T, Cnn CP-

    36. Zhao L, Dai H, Wu Z, Zhu D, Liu T, Cnn CP-. Réseaux de neurones convolutifs guidés par le principe cœur- périphérie. arXiv preprint arXiv:230410515

  26. [26]

    Ghosh-Dastidar S, Adeli H

    37. Ghosh-Dastidar S, Adeli H. Réseaux neuronaux à pointes. Int J Neural Syst. 2009;19(4):295308. 38. Kasabov NK. NeuCube : une architecture de réseau neuronal à pointes pour le map- page, l’apprentissage et la compréhension des données cérébrales spatio-temporelles. Neural Network. 2014;52:62–76. 39. Ku- marasinghe K, Kasabov N, Taylor D. Réseaux neurona...

  27. [27]

    Créer des robots plus intelligents grâce au cal- cul inspiré du cerveau

    Zhang B, Shi L, Song S. Créer des robots plus intelligents grâce au cal- cul inspiré du cerveau. Science Robotics. 2016;354(6318):1445. 45. Davies M, Srini- vasa N, Lin TH, et al. Loihi : un processeur neuromorphique multicœur avec apprentis- sage intégré. Ieee Micro. 2018;38(1):82–99

  28. [28]

    Vers une intelligence machine basée sur les pointes avec le calcul neuromorphique

    Roy K, Jaiswal A, Panda P. Vers une intelligence machine basée sur les pointes avec le calcul neuromorphique. Nature. 2019;575(7784):607-617. 47. Pei J, Deng L, Song S, et al. Vers l’intelligence générale ar- tificielle avec l’architecture de puce hybride Tianjic. Nature. 2019;572(7767):106–111

  29. [29]

    TrueNorth : conception et flux de travail d’une puce neurosynaptique pro- grammable d’un million de neurones de 65 mw

    Akopyan F, Sawada J, Cassidy A, et al. TrueNorth : conception et flux de travail d’une puce neurosynaptique pro- grammable d’un million de neurones de 65 mw. IEEE Trans Comput Aided Des Integrated Circ Syst. 2015;34(10):1537-

  30. [30]

    Indiveri G, Douglas R

    49. Indiveri G, Douglas R. Cap- teurs de vision neuromorphiques. Science. 2000;288(5469):1189-1190. 50. San- damirskaya Y, Kaboli M, Conradt J, Celikel T. Matériel de calcul neuromorphique et ar- chitectures neurales pour la robotique. Sci- ence Robotics. 2022;7(67):eabl8419. 51. Viale A, Marchisio A, Martina M, Masera G, Shafique M. LaneSNNs : réseaux ne...

  31. [31]

    Devlin J, Cha–ng MW, Lee K, Toutanova K

    61. Devlin J, Cha–ng MW, Lee K, Toutanova K. BERT : pré-formation de trans- formateurs bidirectionnels profonds pour la compréhension du langage. Dans : NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guageTechnologies-ProceedingsoftheCon- ference. vol. 1. 2019:4171–4186. Mlm

  32. [32]

    Amélioration de la com- préhension du langage par la pré-formation générative

    Radford A, Narasimhan K, Salimans T, Sutskever I, et al. Amélioration de la com- préhension du langage par la pré-formation générative. CoRR; 2018. 63. Liu Y, Ott M, Goyal N, et al. Roberta : une approche de pré-formation BERT robuste et optimisée

  33. [33]

    arXiv preprint arXiv:1907.11692. 64. Sanh V, Debut L, Chaumond J, Wolf T. Dis- tilBERT, une version distillée de BERT : plus petit, plus rapide, moins cher et plus léger

  34. [34]

    arXiv preprint arXiv:1910.01108. 65. Lepikhin D, Lee H, Xu Y, et al. Gshard : Scal- ing Giant Models with Conditional Compu- tation and Automatic Sharding. 2020. arXiv preprint arXiv:2006.16668. 66. Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q. ERNIE : Amélioration de la représentation du lan- gage avec des entités informatives. 2019. arXiv preprint arXiv...

  35. [35]

    arXiv preprint arXiv:2006.03654. 73. Nakano R, Hilton J, Balaji S, et al. Webgpt : question-answering assisté par navigateur avec retour humain. 2021. arXiv preprint arXiv:2112.09332. 74. Wei J, Bosma M, Zhao VY, et al. Finetuned Language Mod- els Are Zero-Shot Learners. 2021. arXiv preprint arXiv:2109.01652. 75. Zhang Z, Gu Y, Han X, et al. Cpm-2 : modèl...

  36. [36]

    arXiv preprint arXiv:2110.08207. 78. Brown T, Mann B, Ryder N, et al. Les modèles de langage sont des apprenants few-shot. Adv Neural Inf Process Syst. 2020;33:1877–1901. 57 MiniMax-01: Scaling Foundation Models with Lightning Attention

  37. [37]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    Nijkamp E, Pang B, Hayashi H, et al. Codegen : Un modèle de langage génératif open large pour le code avec syn- thèse de programme multi-tour. 2022. arXiv preprint arXiv:2203.13474. 80. Ganguli D, Hernandez D, Lovitt L, et al. Prédictabil- ité et surprise dans les grands modèles génératifs. Dans : Proceedings of the 2022 ACM Conference on Fairness, Accoun...

  38. [38]

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

    Smith S, Patwary M, Norick B, et al. Using DeepSpeed and Megatron to Train MegatronTuring Nlg 530b, a Large-Scale Generative Language Model. 2022. arXiv preprint arXiv:2201.11990. 82. Biderman S, Schoelkopf H, Anthony QG, et al. Pythia : une suite pour analyser les grands modèles de langage à travers la formation et la mise à l’échelle. Dans: Internationa...

  39. [39]

    Crosslingual generalization through multitask finetuning

    Muennighoff N, Wang T, Sutawika L, et al. Généralisation interlinguistique grâce au multitâche finetuning. 2022. arXiv preprint arXiv:2211.01786. 84. Du N, Huang Y, Dai AM,etal. Glam: miseàl’échelleefficacedes modèles de langage avec mélange d’experts. Dans : International Conference on Ma- chine Learning. PMLR; 2022:5547-5569

  40. [40]

    Jurassic-1 : Détails techniques et éval- uation

    Lieber O, Sharir O, Lenz B, Shoham Y. Jurassic-1 : Détails techniques et éval- uation. White Paper. AI21 Labs; 2021:1

  41. [41]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Rae JW, Borgeaud S, Cai T, et al. Scal- ing Language Models : Methods, Analy- sis & Insights from Training Gopher. 2021. arXiv preprint arXiv:2112.11446. 87. Sun Y, Wang S, Feng S, et al. Ernie 3.0 : Large- Scale Knowledge Enhanced Pretraining for Language Understanding and Generation

  42. [42]

    arXiv preprint arXiv:2107.02137

  43. [43]

    Fun and Dystopia with Ai- BasedCodeGenerationUsingGpt-J-6b,June

    Woolf M. Fun and Dystopia with Ai- BasedCodeGenerationUsingGpt-J-6b,June

  44. [44]

    https://minimaxir.com/2021/06/gpt- j-6b/. 89. Black S, Biderman S, Hallahan E, et al. Gpt-neox-20b : Un modèle de lan- gage autorégressif open-source. 2022. arXiv preprint arXiv:2204.06745. 90. Zoph B, Bello I, Kumar S, et al. St-moe : Designing Stable and Transferable Sparse Expert Mod- els. 2022. arXiv preprint arXiv:2202.08906

  45. [45]

    Whispers of the Lost City

    Zeng A, Liu X, Du Z, et al. Glm-130b : Un modèle bilingue B.5. Lyrics Writing User Request ( Query➷ ) ➷ Create lyrics for a ballad about a traveler who finds an ancient city lost in time. MiniMax-Text-01 Title: "Whispers of the Lost City" (Verse 1) In the heart of the desert, where the sands softly sigh, A lone traveler wanders beneath the vast sky. With ...

  46. [46]

    Keep current lane: Keep driving in your current lane, making sure you don’t miss an upcoming exit ramp

  47. [47]

    Usually, there will be a speed sign on the exit ramp of the expressway, so keep an eye out for it

    Deceleration: When approaching the exit ramp, slow down appropriately to ensure a safe entry into the ramp. Usually, there will be a speed sign on the exit ramp of the expressway, so keep an eye out for it

  48. [48]

    Turn signal: Before you are ready to enter the ramp, turn on the right turn signal to alert the vehicle behind you that you are about to change lanes

  49. [49]

    Be aware of curves and slopes on ramps and maintain an appropriate speed

    Approach ramp: Follow the navigation directions and enter the ramp after 38 meters. Be aware of curves and slopes on ramps and maintain an appropriate speed

  50. [50]

    Navigation will guide you through the next segments until you reach your destination

    Continue to follow navigation: After entering the ramp, continue to follow the directions of the navigation. Navigation will guide you through the next segments until you reach your destination. Be aware of your safety and focus, especially when changing lanes on highways and onto ramps. Wishing you a safe journey! 63 MiniMax-01: Scaling Foundation Models...

  51. [51]

    The first set: short skirts

  52. [55]

    The fifth set: long skirts

  53. [56]

    Second row:

    Sixth set: long skirts There are a total of 3 sets in the first row with long skirts. Second row:

  54. [57]

    The first set: trousers

  55. [58]

    The second set: long skirts

  56. [61]

    Third row:

    Sixth set: long skirts There are 2 sets of long skirts in the second row. Third row:

  57. [62]

    The first set: long skirts

  58. [63]

    The third set: trousers

  59. [64]

    Fourth set: trousers

  60. [65]

    The fifth set: trousers

  61. [66]

    Second set: trousers

  62. [67]

    Enter invoice details,

    Sixth set: trousers There is a total of 1 set in the third row with a long skirt. To sum up, there are 3 sets, 2 sets, and 1 sets of long skirts in each row. 64 MiniMax-01: Scaling Foundation Models with Lightning Attention C.3. Digital Scenarios Example 4 User Request Based on the provided flowchart and statistical data, determine which step takes the lo...

  63. [68]

    This will significantly reduce the time spent on manual entry

    Automate data entry: Consider using OCR (Optical Character Recognition) technology to automatically read invoice information and fill it into the system. This will significantly reduce the time spent on manual entry

  64. [69]

    For example, use drop-down menus, auto-fill, and smart suggestion features to reduce user input time and error rates

    Simplify the data entry interface: Redesign the data entry interface to make it more intuitive and user-friendly. For example, use drop-down menus, auto-fill, and smart suggestion features to reduce user input time and error rates

  65. [70]

    Batch processing: If possible, allow users to upload invoice information in batches instead of entering them one by one. This can be achieved by developing a batch import feature, where users only need to upload a file containing all invoice information, and the system will automatically parse and fill in the data

  66. [71]

    Training and guidance: Provide users with detailed training and operation manuals to help them become familiar with the data entry process, reducing extra time consumption due to unfamiliarity with the operations

  67. [72]

    Parallel processing: If the data entry task can be broken down into multiple sub-tasks, consider assigning these sub-tasks to different team members for parallel processing, thereby shortening the overall processing time

  68. [73]

    Enter invoice details

    Feedback mechanism: Establish a feedback mechanism that allows users to report issues and difficulties encountered during the data entry process. Based on feedback, continuously improve the data entry process and tools. By implementing the above methods, we can significantly reduce the time spent on the "Enter invoice details" step, improving the overall ...