arxiv: 2501.08313 · v1 · submitted 2025-01-14 · 💻 cs.CL · cs.CV

Recognition: 3 theorem links

· Lean Theorem

MiniMax-01: Scaling Foundation Models with Lightning Attention

MiniMax , Aonian Li , Bangwei Gong , Bo Yang , Boji Shan , Chang Liu , Cheng Zhu , Chunhao Zhang

show 82 more authors

Congchao Guo Da Chen Dong Li Enwei Jiao Gengxin Li Guojun Zhang Haohai Sun Houze Dong Jiadai Zhu Jiaqi Zhuang Jiayuan Song Jin Zhu Jingtao Han Jingyang Li Junbin Xie Junhao Xu Junjie Yan Kaishun Zhang Kecheng Xiao Kexi Kang Le Han Leyang Wang Lianfei Yu Liheng Feng Lin Zheng Linbo Chai Long Xing Meizhi Ju Mingyuan Chi Mozhi Zhang Peikai Huang Pengcheng Niu Pengfei Li Pengyu Zhao Qi Yang Qidi Xu Qiexiang Wang Qin Wang Qiuhui Li Ruitao Leng Shengmin Shi Shuqi Yu Sichen Li Songquan Zhu Tao Huang Tianrun Liang Weigao Sun Weixuan Sun Weiyu Cheng Wenkai Li Xiangjun Song Xiao Su Xiaodong Han Xinjie Zhang Xinzhu Hou Xu Min Xun Zou Xuyang Shen Yan Gong Yingjie Zhu Yipeng Zhou Yiran Zhong Yongyi Hu Yuanxiang Fan Yue Yu Yufeng Yang Yuhao Li Yunan Huang Yunji Li Yunpeng Huang Yunzhi Xu Yuxin Mao Zehan Li Zekang Li Zewei Tao Zewen Ying Zhaoyang Cong Zhen Qin Zhenhua Fan Zhihang Yu Zhuo Jiang Zijia Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:21 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords lightning attentionmixture of expertslong contextfoundation modelslarge language modelsvision language modelsmodel scaling

0 comments

The pith

MiniMax-01 matches GPT-4o and Claude-3.5-Sonnet performance while supporting 20-32 times longer contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the MiniMax-01 series, including a text model and a vision-language model, that reach performance levels comparable to current top models on standard and internal benchmarks. The central advance is lightning attention, which scales efficiently when paired with a Mixture of Experts design and specialized parallel and overlap methods. This combination supports training on contexts of 1 million tokens and inference extrapolation to 4 million tokens for a 456-billion-parameter model with 45.9 billion parameters active per token. The approach keeps computational demands manageable while delivering the long-context gains.

Core claim

Lightning attention combined with a 32-expert Mixture of Experts architecture, optimized parallel strategies, and computation-communication overlap techniques enables efficient training and inference for models with hundreds of billions of parameters across million-token contexts. The resulting MiniMax-Text-01 reaches 1 million tokens during training and extrapolates to 4 million during inference, while MiniMax-VL-01 adds vision-language capabilities through continued training on 512 billion tokens, and both match the performance of GPT-4o and Claude-3.5-Sonnet.

What carries the argument

Lightning attention, which when integrated with MoE parallel scheduling and overlap techniques supports stable scaling to very long contexts without proportional compute growth.

If this is right

Models can process 20-32 times more context than current leading systems while matching their benchmark scores.
Training and inference become practical for 456-billion-parameter models with million-token contexts.
Vision-language training can be added via continued pretraining without losing the long-context advantage.
The released models allow direct testing of million-token applications at affordable cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applications that currently rely on chunking or retrieval could shift to single-pass processing over entire documents or conversations.
The overlap techniques may transfer to other attention variants to improve efficiency at scale.
Further extrapolation tests beyond 4 million tokens would show whether quality remains flat or begins to degrade.

Load-bearing premise

Lightning attention plus the described MoE parallel and overlap methods preserve full model quality and training stability at the claimed parameter and context sizes with no hidden performance costs.

What would settle it

A controlled benchmark comparison at the full claimed context length where MiniMax-01 scores materially lower than GPT-4o or Claude-3.5-Sonnet on the same tasks, or requires substantially more compute to reach parity.

read the original abstract

We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MiniMax-01 shows lightning attention plus large MoE can hit 1M training / 4M inference context at 456B total params while claiming GPT-4o parity, but the writeup gives no numbers or ablations to check whether quality actually holds.

read the letter

The main thing to know is that this paper reports a new attention variant they call lightning attention, paired with a 32-expert MoE (456B total, 45.9B active per token) and custom parallel/overlap tricks. They say this lets them train to 1M context and extrapolate to 4M at inference while matching GPT-4o and Claude-3.5-Sonnet on both public and internal benchmarks. They also continued-train a vision-language version on 512B tokens and release the model publicly.

Referee Report

2 major / 1 minor

Summary. The paper introduces the MiniMax-01 series (MiniMax-Text-01 and MiniMax-VL-01), which combine a novel lightning attention mechanism with a 32-expert MoE architecture (456B total parameters, 45.9B active per token). It claims efficient training and inference at scale, supporting 1M-token contexts during training and extrapolation to 4M tokens at inference, while matching the performance of GPT-4o and Claude-3.5-Sonnet on standard and in-house benchmarks and delivering 20-32x longer context windows. The work also describes optimized parallel strategies and computation-communication overlap for MoE and lightning attention, with public release of the models.

Significance. If the empirical claims are substantiated, the work would demonstrate a practical route to scaling foundation models to hundreds of billions of parameters while extending context lengths by an order of magnitude without proportional compute increases, which could meaningfully advance long-context and multimodal applications.

major comments (2)

[Abstract] Abstract and §1: The central claim that the models 'match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet' is unsupported by any quantitative benchmark scores, per-task results, ablation studies isolating lightning attention, or error analysis. Without these data the headline empirical result cannot be evaluated.
[§3] §3 (Lightning Attention) and §4 (MoE scaling): No ablation is presented that holds total parameters and training data fixed while comparing lightning attention against standard attention; the claim that the combination 'preserves model quality' therefore rests on an untested assumption at the reported scale.

minor comments (1)

[Abstract] The abstract states context lengths of '1 million tokens during training and extrapolate to 4 million tokens during inference' but does not specify the exact extrapolation method or any degradation metrics at 4M tokens.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical presentation.

read point-by-point responses

Referee: [Abstract] Abstract and §1: The central claim that the models 'match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet' is unsupported by any quantitative benchmark scores, per-task results, ablation studies isolating lightning attention, or error analysis. Without these data the headline empirical result cannot be evaluated.

Authors: We agree that the current manuscript does not present the detailed quantitative benchmark tables needed to fully substantiate the claim. In the revised version we will add comprehensive tables reporting exact scores on standard benchmarks (MMLU, GSM8K, HumanEval, MATH, etc.) and in-house evaluations, with per-task breakdowns and direct comparisons to GPT-4o and Claude-3.5-Sonnet. Relevant error analysis and any available ablations isolating lightning attention will also be included. revision: yes
Referee: [§3] §3 (Lightning Attention) and §4 (MoE scaling): No ablation is presented that holds total parameters and training data fixed while comparing lightning attention against standard attention; the claim that the combination 'preserves model quality' therefore rests on an untested assumption at the reported scale.

Authors: We acknowledge that no controlled ablation holding total parameters and training data fixed is provided. Training an additional 456B-parameter model with standard attention for direct comparison was not feasible within our compute budget. Lightning attention is derived from a theoretical approximation that preserves the same attention matrix expressiveness while enabling linear scaling; the observed parity with SOTA models on diverse benchmarks offers indirect support. We will expand the discussion in §3 to clarify this design rationale and explicitly note the missing ablation as a limitation. revision: partial

standing simulated objections not resolved

Direct ablation study holding total parameters and training data fixed while comparing lightning attention to standard attention

Circularity Check

0 steps flagged

No circularity in derivation chain; claims rest on empirical training runs

full rationale

The paper introduces lightning attention and its integration with MoE (32 experts, 45.9B active params out of 456B) for long-context scaling, then reports benchmark results matching GPT-4o/Claude-3.5-Sonnet at 1M-4M tokens. No equations, derivations, or predictions are present that reduce by construction to fitted parameters or self-definitions. Performance claims are grounded in training runs and external benchmark comparisons rather than any self-referential mathematical structure or load-bearing self-citation chain. This is the expected outcome for an empirical scaling paper with no theoretical derivation steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claims rest on the unverified computational efficiency and quality preservation of lightning attention at million-token scales plus the effectiveness of the custom MoE parallel strategy; these are treated as domain assumptions rather than derived results.

free parameters (2)

Number of experts
Set to 32 to achieve the reported activation ratio of 45.9B active parameters out of 456B total.
Active parameters per token
Fixed at 45.9 billion by the MoE routing design.

axioms (1)

domain assumption Lightning attention enables efficient scaling to million-token contexts without quality loss
Invoked as the core enabler of the reported training and inference lengths.

invented entities (1)

Lightning attention no independent evidence
purpose: Provide efficient attention computation for very long sequences
New mechanism introduced to support the scaling claims

pith-pipeline@v0.9.0 · 5884 in / 1330 out tokens · 45665 ms · 2026-05-16T06:21:57.513106+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 conditional novelty 7.0

EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 unverdicted novelty 7.0

EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 unverdicted novelty 7.0

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 conditional novelty 7.0

Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
cs.CL 2026-05 unverdicted novelty 7.0

OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
cs.CL 2026-05 unverdicted novelty 7.0

OralMLLM-Bench is a new benchmark with 27 tasks in four cognitive categories that evaluates six MLLMs on dental radiographs and shows clear performance gaps versus clinicians.
BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs
cs.CL 2026-04 unverdicted novelty 7.0

BOSCH decomposes attention-head selection for short-context hybridization into layer probing, adaptive ratio assignment, and grouped binary optimization, yielding better efficiency-performance tradeoffs than static or...
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
cs.CL 2026-05 unverdicted novelty 6.0

UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
The Impossibility Triangle of Long-Context Modeling
cs.CL 2026-05 unverdicted novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
cs.CL 2026-04 unverdicted novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games
cs.AI 2026-04 unverdicted novelty 6.0

MISID is a multimodal multi-turn dataset for intent recognition in strategic deception games, paired with the FRACTAM framework that improves MLLM performance on hidden intent detection via decouple-anchor-reason steps.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
cs.CR 2026-04 unverdicted novelty 6.0

ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
cs.CR 2026-04 unverdicted novelty 6.0

ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
cs.LG 2026-03 unverdicted novelty 6.0

HISA speeds up fine-grained sparse attention indexers via block-then-token hierarchy, delivering substantial speedups at 64K context with no training and quality matching the original DSA on long-context benchmarks.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
cs.CL 2025-07 unverdicted novelty 6.0

MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
cs.CL 2025-05 conditional novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
MoBA: Mixture of Block Attention for Long-Context LLMs
cs.LG 2025-02 unverdicted novelty 6.0

MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
cs.CE 2026-05 unverdicted novelty 5.0

LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
Disposition Distillation at Small Scale: A Three-Arc Negative Result
cs.LG 2026-04 accept novelty 5.0

Multiple standard techniques for instilling dispositions in small LMs consistently failed across five models, with initial apparent gains revealed as artifacts and cross-validation collapsing to chance.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 19 Pith papers · 9 internal anchors

[1]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

ISSN 2835-8856. URLhttps://openreview.net/forum?id=Ee277P3AYC. Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multimodal under- standing and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision a...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

Introduction and Motivation The rapid advancement of large language models (LLMs) has significantly enhanced their capabilities but has also raised concerns about their alignment with human values and intentions. Current alignment strategies, such asSupervisedFine-tuning(SFT) and Reinforcement Learning from Human Feedback (RLHF), have shown potential but ...

work page
[3]

This process repeats iteratively until the response is complete, ensuring that every sentence in the output aligns with human preferences

Stream Aligner Paradigm Stream Aligner operates as a plug-and-play module in the generation pipeline, correcting sentences generated by the upstream model and feeding the corrected suffix back for further generation. This process repeats iteratively until the response is complete, ensuring that every sentence in the output aligns with human preferences. K...

work page
[4]

The training objective is to mini- mize the negative log-likelihood loss between the model’s output and the corrected answer

Technical Details • Training: Stream Aligner is fine-tuned on a sentence-level preference dataset to learn the residuals between preferred and non-preferred responses. The training objective is to mini- mize the negative log-likelihood loss between the model’s output and the corrected answer. • Inference: During inference, Stream Aligner takes the user’s ...

work page
[5]

Experimental Results The paper evaluates Stream Aligner on three tasks: helpful and harmless QA, math questions, and sum- mary tasks. The results demonstrate significant im- provements: • Helpfulness and Harmlessness: Stream Aligner-2B applied to Llama2-70B-chat achieved a 41.2% increase in helpfulness and a 36.0% increase in harmlessness. • Math Ability:...

work page
[6]

• Generation Methods:The study compares the classic sentence-by-sentence correction pipeline with a new continue generation pipeline

Ablation Studies The paper conducts ablation studies to verify the cor- rection capabilities of Stream Aligner under different supervision quantities and generation pipelines: • Generation-Correction Frequency:The perfor- manceofStreamAlignerincreaseswiththenumberof generation-correction cycles, demonstrating its ability to enhance the upstream model’s pe...

work page
[7]

It also achieves the performance of Aligner-70B using only 2B parameters, showcasing both superior performance and efficiency

Comparison to Other Alignment Methods Stream Aligner outperforms other alignment meth- ods such as Supervised Finetuning (SFT) and Direct Preference Optimization (DPO) in terms of accuracy improvements. It also achieves the performance of Aligner-70B using only 2B parameters, showcasing both superior performance and efficiency

work page
[8]

Interpretability The paper explores the interpretability of Stream Aligner through representation engineering and activation steering. The results show that Stream Aligner has internalized the correction paradigm as a representation, similar to Aligner, but with more layers involved in deciding corrections, re- flecting the complexity of mathematical tasks

work page
[9]

It achieves significant improvements in helpfulness, harmlessness, and reasoning abilities, making it a promising approach for aligning LLMs with human values

Conclusion Stream Aligner is a novel alignment paradigm that ef- fectively elicits the latent knowledge of the upstream model while maintaining efficiency and enhanced performance. It achieves significant improvements in helpfulness, harmlessness, and reasoning abilities, making it a promising approach for aligning LLMs with human values. Limitations Desp...

work page
[10]

L’IA inspirée du cerveau et l’AGI Le cerveau humain est largement considéré comme l’un des systèmes de traitement de l’information les plus complexes et avancés au monde. Il comprend plus de 86 milliards de neurones, chacun capable de former jusqu’à 10 000 synapses avec d’autres neurones, ce qui résulte en un réseau de connexions exceptionnellement comple...

work page 1940
[11]

Caractéristiques de l’AGI 2.1. Échelle L’échelle des cerveaux varie considérable- ment d’une espèce animale à l’autre, allant de quelques milliers de neurones chez les invertébrés simples comme les vers néma- todes, à plus de 86 milliards de neurones chez les humains. Par exemple, le cerveau d’une mouche à fruits contient environ 100 000 neurones, et le c...

work page
[12]

neurones multimodaux

comme un moyen possible d’étudier l’AGI inspirée du cerveau, car les LLM sont parmi les premiers modèles à démontrer des performances de niveau humain dans diverses tâches. La relation entre le nom- bre de neurones et les capacités cognitives est également pertinente pour les LLM tels que GPT-2 et GPT-3. Alors que GPT-2 a 1,5 milliard de paramètres et a é...

work page 2021
[13]

Traduisez cette phrase du chinois à l’anglais

Technologie importante Les modèles de langage, tels que les LLM, reposent sur plusieurs techniques cruciales, notam- ment le zero-shot prompting, le few-shot prompting, l’apprentissage contextuel et l’instruction. L’attente sous-jacente de ces techniques est que les systèmes AI peuvent rapidement apprendre de nouvelles tâches en s’appuyant sur ce qu’ils o...

work page 1956
[14]

Discussion 4.1. Limitations Bien que des progrès significatifs aient été réalisés dans le développementdel’AGIetdel’IAinspiréedu cerveau, il reste plusieurs limitations à sur- monter avant que nous puissions atteindre une véritable intelligence de niveau humain dans les machines. Certaines de ces limita- tions incluent : Compréhension limitée du cerveau h...

work page
[15]

Nous avons égale- ment discuté de l’évolution, des limitations et de l’avenir de l’AGI

Conclusion Dans cet article, nous avons fourni un aperçu complet de l’IA inspirée du cerveau du point de vue de l’AGI, couvrant ses progrès actuels, ses caractéristiques im- portantes et ses avancées technologiques vers la réalisation de l’AGI. Nous avons égale- ment discuté de l’évolution, des limitations et de l’avenir de l’AGI. En conclusion, l’IA insp...

work page 2012
[16]

Shigihara Y, Zeki S

5. Shigihara Y, Zeki S. Traitement par- allèle dans le système visuel de la forme du cerveau : une étude fMRI. Front Hum Neurosci. 2014;8:506. 6. Egorova N, Shty- rov Y, Pulvermüller F. Traitement précoce et parallèle de l’information pragmatique et sémantique dans les actes de parole : preuves neurophysiologiques. Front Hum Neurosci. 2013;7:86. 7. Lang E...

work page 2014
[17]

Neuroplasticité

S. Neuroplasticité. 9. Funahashi S. Mémoire de travail dans le cortex préfrontal. Brain Sci. 2017;7(5):49. 10. De Souza LC, Guimaraes HC, Teixeira AL, et al. Neurolo- gie du lobe frontal et esprit créatif. Front Psychol. 2014:761. 11. Teffer K, Semende- feri K. Cortex préfrontal humain : évolu- tion, développement et pathologie. Prog Brain Res. 2012;195:1...

work page 2017
[18]

Champs ré- cepteurs, interaction binoculaire et archi- tecture fonctionnelle dans le cortex visuel du chat

Hubel DH, Wiesel TN. Champs ré- cepteurs, interaction binoculaire et archi- tecture fonctionnelle dans le cortex visuel du chat. J Physiol. 1962;160(1):106

work page 1962
[19]

Le système d’attention du cerveau humain

Posner MI, Petersen SE. Le système d’attention du cerveau humain. Annu Rev Neurosci. 1990;13(1):25-42. 20. Devlin J, Chang MW, Lee K, Toutanova K. Bert : pré-formation de transformateurs bidirec- tionnels profonds pour la compréhension du langage. arXiv preprint arXiv:181004805

work page 1990
[20]

Radford A, Narasimhan K, Sali- mans T, Sutskever I

21. Radford A, Narasimhan K, Sali- mans T, Sutskever I. Amélioration de la com- préhension du langage par la pré-formation générative. Open. 2018. 22. Dosovit- skiy A, Beyer L, Kolesnikov A, et al. Une image vaut 16x16 mots : transformateurs pour la reconnaissance d’images à grande échelle. arXiv preprint arXiv:201011929

work page 2018
[21]

Bassett DS, Bullmore E

23. Bassett DS, Bullmore E. Réseaux cérébraux petit-mondevol. 12. The neu- roscientist; 2006:512523. 24. Bullmore –E, Sporns O. Réseaux cérébraux complexes : analyse théorique des systèmes struc- turels et fonctionnels. Nat Rev Neurosci. 2009;10(3):186-198. 25. Bassett DS, Bull- more ET. Réseaux cérébraux petit-monde revisités. Neuro–scientist. 2017; 23(5):499-

work page 2006
[22]

Xie S, Kiril–lov A, Girshick R, He K

26. Xie S, Kiril–lov A, Girshick R, He K. Exploration de réseaux neuronaux connectés aléatoirement pour la reconnais- sance d’images. Dans : Proceedings of the IEEE/CVF International Conference on Com- puter Vision. 2019:1284-1293. 27. Taud H, Mas J. Multilayer Pe–rceptron (MLP). Geomatic Approaches for Modeling Land Change Scenarios. 2018:451-455. 28. To...

work page 2019
[23]

Zhao L , L, Dai H, Wu Z, et al

31. Zhao L , L, Dai H, Wu Z, et al. Couplage de la sémantique visuelle des réseaux neuronaux artificiels et de la fonction cérébrale humaine via des activa- tions synchronisées. arXiv preprint arXiv: 220610821. 2022. 32. Liu X, Zhou M, Shi G, et al. Couplage des neurones artifi- ciels dans BERT et des neurones biologiques dans le cerveau humain. arXiv pre...

work page 2022
[24]

Yu X, Zhang L, Dai H, et al

35. Yu X, Zhang L, Dai H, et al. Redéf- inition de l’auto-attention dans les trans- formateurs guidée par le principe cœur- périphérie. arXiv preprint arXiv:230315569

work page
[25]

Zhao L, Dai H, Wu Z, Zhu D, Liu T, Cnn CP-

36. Zhao L, Dai H, Wu Z, Zhu D, Liu T, Cnn CP-. Réseaux de neurones convolutifs guidés par le principe cœur- périphérie. arXiv preprint arXiv:230410515

work page
[26]

Ghosh-Dastidar S, Adeli H

37. Ghosh-Dastidar S, Adeli H. Réseaux neuronaux à pointes. Int J Neural Syst. 2009;19(4):295308. 38. Kasabov NK. NeuCube : une architecture de réseau neuronal à pointes pour le map- page, l’apprentissage et la compréhension des données cérébrales spatio-temporelles. Neural Network. 2014;52:62–76. 39. Ku- marasinghe K, Kasabov N, Taylor D. Réseaux neurona...

work page 2009
[27]

Créer des robots plus intelligents grâce au cal- cul inspiré du cerveau

Zhang B, Shi L, Song S. Créer des robots plus intelligents grâce au cal- cul inspiré du cerveau. Science Robotics. 2016;354(6318):1445. 45. Davies M, Srini- vasa N, Lin TH, et al. Loihi : un processeur neuromorphique multicœur avec apprentis- sage intégré. Ieee Micro. 2018;38(1):82–99

work page 2016
[28]

Vers une intelligence machine basée sur les pointes avec le calcul neuromorphique

Roy K, Jaiswal A, Panda P. Vers une intelligence machine basée sur les pointes avec le calcul neuromorphique. Nature. 2019;575(7784):607-617. 47. Pei J, Deng L, Song S, et al. Vers l’intelligence générale ar- tificielle avec l’architecture de puce hybride Tianjic. Nature. 2019;572(7767):106–111

work page 2019
[29]

TrueNorth : conception et flux de travail d’une puce neurosynaptique pro- grammable d’un million de neurones de 65 mw

Akopyan F, Sawada J, Cassidy A, et al. TrueNorth : conception et flux de travail d’une puce neurosynaptique pro- grammable d’un million de neurones de 65 mw. IEEE Trans Comput Aided Des Integrated Circ Syst. 2015;34(10):1537-

work page 2015
[30]

Indiveri G, Douglas R

49. Indiveri G, Douglas R. Cap- teurs de vision neuromorphiques. Science. 2000;288(5469):1189-1190. 50. San- damirskaya Y, Kaboli M, Conradt J, Celikel T. Matériel de calcul neuromorphique et ar- chitectures neurales pour la robotique. Sci- ence Robotics. 2022;7(67):eabl8419. 51. Viale A, Marchisio A, Martina M, Masera G, Shafique M. LaneSNNs : réseaux ne...

work page 2000
[31]

Devlin J, Cha–ng MW, Lee K, Toutanova K

61. Devlin J, Cha–ng MW, Lee K, Toutanova K. BERT : pré-formation de trans- formateurs bidirectionnels profonds pour la compréhension du langage. Dans : NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guageTechnologies-ProceedingsoftheCon- ference. vol. 1. 2019:4171–4186. Mlm

work page 2019
[32]

Amélioration de la com- préhension du langage par la pré-formation générative

Radford A, Narasimhan K, Salimans T, Sutskever I, et al. Amélioration de la com- préhension du langage par la pré-formation générative. CoRR; 2018. 63. Liu Y, Ott M, Goyal N, et al. Roberta : une approche de pré-formation BERT robuste et optimisée

work page 2018
[33]

arXiv preprint arXiv:1907.11692. 64. Sanh V, Debut L, Chaumond J, Wolf T. Dis- tilBERT, une version distillée de BERT : plus petit, plus rapide, moins cher et plus léger

work page internal anchor Pith review Pith/arXiv arXiv 1907
[34]

arXiv preprint arXiv:1910.01108. 65. Lepikhin D, Lee H, Xu Y, et al. Gshard : Scal- ing Giant Models with Conditional Compu- tation and Automatic Sharding. 2020. arXiv preprint arXiv:2006.16668. 66. Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q. ERNIE : Amélioration de la représentation du lan- gage avec des entités informatives. 2019. arXiv preprint arXiv...

work page internal anchor Pith review Pith/arXiv arXiv 1910
[35]

arXiv preprint arXiv:2006.03654. 73. Nakano R, Hilton J, Balaji S, et al. Webgpt : question-answering assisté par navigateur avec retour humain. 2021. arXiv preprint arXiv:2112.09332. 74. Wei J, Bosma M, Zhao VY, et al. Finetuned Language Mod- els Are Zero-Shot Learners. 2021. arXiv preprint arXiv:2109.01652. 75. Zhang Z, Gu Y, Han X, et al. Cpm-2 : modèl...

work page internal anchor Pith review Pith/arXiv arXiv 2006
[36]

arXiv preprint arXiv:2110.08207. 78. Brown T, Mann B, Ryder N, et al. Les modèles de langage sont des apprenants few-shot. Adv Neural Inf Process Syst. 2020;33:1877–1901. 57 MiniMax-01: Scaling Foundation Models with Lightning Attention

work page internal anchor Pith review Pith/arXiv arXiv 2020
[37]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Nijkamp E, Pang B, Hayashi H, et al. Codegen : Un modèle de langage génératif open large pour le code avec syn- thèse de programme multi-tour. 2022. arXiv preprint arXiv:2203.13474. 80. Ganguli D, Hernandez D, Lovitt L, et al. Prédictabil- ité et surprise dans les grands modèles génératifs. Dans : Proceedings of the 2022 ACM Conference on Fairness, Accoun...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Smith S, Patwary M, Norick B, et al. Using DeepSpeed and Megatron to Train MegatronTuring Nlg 530b, a Large-Scale Generative Language Model. 2022. arXiv preprint arXiv:2201.11990. 82. Biderman S, Schoelkopf H, Anthony QG, et al. Pythia : une suite pour analyser les grands modèles de langage à travers la formation et la mise à l’échelle. Dans: Internationa...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Crosslingual generalization through multitask finetuning

Muennighoff N, Wang T, Sutawika L, et al. Généralisation interlinguistique grâce au multitâche finetuning. 2022. arXiv preprint arXiv:2211.01786. 84. Du N, Huang Y, Dai AM,etal. Glam: miseàl’échelleefficacedes modèles de langage avec mélange d’experts. Dans : International Conference on Ma- chine Learning. PMLR; 2022:5547-5569

work page arXiv 2022
[40]

Jurassic-1 : Détails techniques et éval- uation

Lieber O, Sharir O, Lenz B, Shoham Y. Jurassic-1 : Détails techniques et éval- uation. White Paper. AI21 Labs; 2021:1

work page 2021
[41]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Rae JW, Borgeaud S, Cai T, et al. Scal- ing Language Models : Methods, Analy- sis & Insights from Training Gopher. 2021. arXiv preprint arXiv:2112.11446. 87. Sun Y, Wang S, Feng S, et al. Ernie 3.0 : Large- Scale Knowledge Enhanced Pretraining for Language Understanding and Generation

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

arXiv preprint arXiv:2107.02137

work page arXiv
[43]

Fun and Dystopia with Ai- BasedCodeGenerationUsingGpt-J-6b,June

Woolf M. Fun and Dystopia with Ai- BasedCodeGenerationUsingGpt-J-6b,June

work page
[44]

https://minimaxir.com/2021/06/gpt- j-6b/. 89. Black S, Biderman S, Hallahan E, et al. Gpt-neox-20b : Un modèle de lan- gage autorégressif open-source. 2022. arXiv preprint arXiv:2204.06745. 90. Zoph B, Bello I, Kumar S, et al. St-moe : Designing Stable and Transferable Sparse Expert Mod- els. 2022. arXiv preprint arXiv:2202.08906

work page internal anchor Pith review Pith/arXiv arXiv 2021
[45]

Whispers of the Lost City

Zeng A, Liu X, Du Z, et al. Glm-130b : Un modèle bilingue B.5. Lyrics Writing User Request ( Query➷ ) ➷ Create lyrics for a ballad about a traveler who finds an ancient city lost in time. MiniMax-Text-01 Title: "Whispers of the Lost City" (Verse 1) In the heart of the desert, where the sands softly sigh, A lone traveler wanders beneath the vast sky. With ...

work page 2022
[46]

Keep current lane: Keep driving in your current lane, making sure you don’t miss an upcoming exit ramp

work page
[47]

Usually, there will be a speed sign on the exit ramp of the expressway, so keep an eye out for it

Deceleration: When approaching the exit ramp, slow down appropriately to ensure a safe entry into the ramp. Usually, there will be a speed sign on the exit ramp of the expressway, so keep an eye out for it

work page
[48]

Turn signal: Before you are ready to enter the ramp, turn on the right turn signal to alert the vehicle behind you that you are about to change lanes

work page
[49]

Be aware of curves and slopes on ramps and maintain an appropriate speed

Approach ramp: Follow the navigation directions and enter the ramp after 38 meters. Be aware of curves and slopes on ramps and maintain an appropriate speed

work page
[50]

Navigation will guide you through the next segments until you reach your destination

Continue to follow navigation: After entering the ramp, continue to follow the directions of the navigation. Navigation will guide you through the next segments until you reach your destination. Be aware of your safety and focus, especially when changing lanes on highways and onto ramps. Wishing you a safe journey! 63 MiniMax-01: Scaling Foundation Models...

work page
[51]

The first set: short skirts

work page
[55]

The fifth set: long skirts

work page
[56]

Second row:

Sixth set: long skirts There are a total of 3 sets in the first row with long skirts. Second row:

work page
[57]

The first set: trousers

work page
[58]

The second set: long skirts

work page
[61]

Third row:

Sixth set: long skirts There are 2 sets of long skirts in the second row. Third row:

work page
[62]

The first set: long skirts

work page
[63]

The third set: trousers

work page
[64]

Fourth set: trousers

work page
[65]

The fifth set: trousers

work page
[66]

Second set: trousers

work page
[67]

Enter invoice details,

Sixth set: trousers There is a total of 1 set in the third row with a long skirt. To sum up, there are 3 sets, 2 sets, and 1 sets of long skirts in each row. 64 MiniMax-01: Scaling Foundation Models with Lightning Attention C.3. Digital Scenarios Example 4 User Request Based on the provided flowchart and statistical data, determine which step takes the lo...

work page
[68]

This will significantly reduce the time spent on manual entry

Automate data entry: Consider using OCR (Optical Character Recognition) technology to automatically read invoice information and fill it into the system. This will significantly reduce the time spent on manual entry

work page
[69]

For example, use drop-down menus, auto-fill, and smart suggestion features to reduce user input time and error rates

Simplify the data entry interface: Redesign the data entry interface to make it more intuitive and user-friendly. For example, use drop-down menus, auto-fill, and smart suggestion features to reduce user input time and error rates

work page
[70]

Batch processing: If possible, allow users to upload invoice information in batches instead of entering them one by one. This can be achieved by developing a batch import feature, where users only need to upload a file containing all invoice information, and the system will automatically parse and fill in the data

work page
[71]

Training and guidance: Provide users with detailed training and operation manuals to help them become familiar with the data entry process, reducing extra time consumption due to unfamiliarity with the operations

work page
[72]

Parallel processing: If the data entry task can be broken down into multiple sub-tasks, consider assigning these sub-tasks to different team members for parallel processing, thereby shortening the overall processing time

work page
[73]

Enter invoice details

Feedback mechanism: Establish a feedback mechanism that allows users to report issues and difficulties encountered during the data entry process. Based on feedback, continuously improve the data entry process and tools. By implementing the above methods, we can significantly reduce the time spent on the "Enter invoice details" step, improving the overall ...

work page 2023