{"total":78,"items":[{"citing_arxiv_id":"2606.01172","ref_index":114,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Revisiting Neural Processes via Fourier Transform and Volterra Series","primary_cat":"cs.LG","submitted_at":"2026-05-31T11:27:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces SFConvCNPs and SFVConvCNPs using set Fourier convolutions and Volterra expansions for translation-equivariant neural processes on irregular data with global receptive fields and linear scaling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22705","ref_index":75,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tokenization with Split Trees","primary_cat":"cs.CL","submitted_at":"2026-05-21T16:46:23+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22064","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild","primary_cat":"cs.CL","submitted_at":"2026-05-21T07:00:06+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21178","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Metaphors in Literary Post-Editing: Opening Pandora's Box?","primary_cat":"cs.CL","submitted_at":"2026-05-20T13:45:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Post-editors changed one in three metaphors in NMT and LLM outputs for literary texts, rated quality poor, and found post-editing more laborious than original translation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19717","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Physics-in-the-Loop: A Hybrid Agentic Architecture for Validated CAD Engineering Design","primary_cat":"cs.CV","submitted_at":"2026-05-19T11:52:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A hybrid agentic architecture integrates knowledge-based physical verification tools into LLM-driven CAD design loops, producing more complex and functionally valid designs than prior agentic baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17187","ref_index":256,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media","primary_cat":"cs.CL","submitted_at":"2026-05-16T22:52:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17152","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages","primary_cat":"cs.CL","submitted_at":"2026-05-16T20:56:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18870","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-Headed Transformer Architectures as Time-dependent Wasserstein Gradient Flows","primary_cat":"cs.LG","submitted_at":"2026-05-15T15:32:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Models multi-head transformer data flow as time-dependent Wasserstein gradient flows of an attention-capturing interaction energy, with proofs on omega-limit stationary points and stability under weight and input perturbations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15976","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective","primary_cat":"cs.CL","submitted_at":"2026-05-15T14:11:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GRPO with reference-free rewards improves NLLB-200 translation quality on 13 languages up to +5.03 chrF++, competing with supervised fine-tuning on complex languages without target data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13896","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Neural Code Translation of Legacy Code: APL to C#","primary_cat":"cs.SE","submitted_at":"2026-05-12T12:11:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Guided LLM strategies with custom datasets and execution-based verification enable functional APL-to-C# translation across a range of program complexities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11501","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Decaf: Improving Neural Decompilation with Automatic Feedback and Search","primary_cat":"cs.SE","submitted_at":"2026-05-12T04:21:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decaf uses compiler feedback and search to improve neural decompilation, boosting semantic success rate from 26.0% to 83.9% on ExeBench Real -O2 split.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"code re-compiled with Clang -O2 and -Os respectively: higher than N=1 performance of 59.1%. However, on the unstripped split the phenomenon is very different. Log Probability reranking strongly outperforms on the R EAL split, and moreover neural reranking leads to degenerate performance on the S YNTH split. We hypothesize that our neural reranker generally struggles due to the distribution shift [30], [31]. While it may learn assembly semantics from its discriminative training task, it may not have been exposed to certain patterns of assembly the Clang may generate that GCC may not generate. This underscores an importance that a neural reranker should be trained on a highly diverse set of generated assembly pairs from different compiler configurations if it is expected to generalize to such"},{"citing_arxiv_id":"2605.09949","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models","primary_cat":"cs.LG","submitted_at":"2026-05-11T03:53:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing.arXiv [cs.CL], 2018. doi:10.48550/arXiv.1808.06226. [33] RDKit: Open-source cheminformatics. [34] Esben Jannik Bjerrum. SMILES enumeration as data augmentation for neural network modeling of molecules. arXiv [cs.LG], 2017. doi:10.48550/arXiv.1703.07076. [35] Josep Arús-Pous, Simon Viet Johansson, Oleksii Prykhodko, Esben Jannik Bjerrum, Christian Tyrchan, Jean-Louis Reymond, Hongming Chen, and Ola Engkvist. Randomized SMILES strings improve the quality of molecular generative models.J. Cheminform., 11(1):71, 2019. ISSN 1758-2946,1758-2946. doi:10.1186/s13321-019-0393-0. [36] Yasuhiro Yoshikai, Tadahaya Mizuno, Shumpei Nemoto, and Hiroyuki Kusuhara."},{"citing_arxiv_id":"2605.09630","ref_index":93,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-10T16:18:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07711","ref_index":44,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:16:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For a minimal aligned unit u∈ A , let τM(u) = (v1, . . . , vk) be its tokenizer-M realization, and define sM(u|x <t) = 1 k logp M(u|x <t) = 1 k kX j=1 logp M(vj |x <t, v<j).(7) The factor 1/k uses the average log-likelihood rather than the raw sequence log-likelihood, avoiding systematically penalizing units that require more tokens under a given tokenizer [44]. This normal- ization is a standard practical choice, not a requirement of the theoretical construction in §3.4. The normalized supervision distribution over the candidate set is then qSimCT M (u|x <t) = exp(sM(u|x <t))P u′∈USimCT exp(sM(u′ |x <t)) ,(8) which we denote as qSimCT M = Π SimCT M (pM). Importantly, qSimCT M should not be interpreted as a"},{"citing_arxiv_id":"2605.03185","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AI and the Research-Education Environment of Physics","primary_cat":"physics.ed-ph","submitted_at":"2026-05-04T22:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":1.0,"formal_verification":"none","one_line_summary":"A summary of expert opinions on AI's impact on the research-education environment in physics from a KITP discussion session.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02123","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Context-Aware Wireless Token Communication via Joint Token Masking and Detection","primary_cat":"eess.SP","submitted_at":"2026-05-04T01:06:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A joint token masking and detection scheme with masked language models improves token reconstruction over noisy wireless channels by up to 1.77x on Europarl and 1.63x on WikiText-103 compared to conventional methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25486","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ReTokSync: Self-Synchronizing Tokenization Disambiguation for Generative Linguistic Steganography","primary_cat":"cs.CR","submitted_at":"2026-04-28T10:42:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReTokSync resolves tokenization ambiguity in generative linguistic steganography via targeted self-synchronizing resets, achieving over 99.7% extraction accuracy and 100% recovery with an auxiliary channel while matching baseline security and quality.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"08144 [31] Ruiyi Yan and Yugo Murawaki. 2025. Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing. Association for Computational Linguistics, Suzhou, China, 7076-7098. doi:10.18653/v1/2025.emnlp-main.361 [32] Ruiyi Yan and Yugo Murawaki. 2025. Low-Overhead Disambiguation for Genera- tive Linguistic Steganography via Tokenization Consistency. InProceedings of the Thirty-first Annual Meeting of the Association for Natural Language Processing. The Association for Natural Language Processing, Nagasaki, Japan, 2053-2056. https://www.anlp.jp/proceedings/annual_meeting/2025/pdf_dir/Q5-1."},{"citing_arxiv_id":"2604.23627","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Neural Grammatical Error Correction for Romanian","primary_cat":"cs.CL","submitted_at":"2026-04-26T09:42:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new Romanian GEC corpus of 10k pairs plus pretraining a Transformer on artificial errors generated via POS tagger yields F0.5 of 53.76, beating the 44.38 baseline from training only on the corpus.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20030","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning to count small and clustered objects with application to bacterial colonies","primary_cat":"cs.CV","submitted_at":"2026-04-21T22:21:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ACFamNet Pro reaches 9.64% mean normalized absolute error on bacterial colony images under 5-fold cross-validation, beating FamNet by 12.71%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19144","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation","primary_cat":"cs.CL","submitted_at":"2026-04-21T06:48:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18327","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PARM: Pipeline-Adapted Reward Model","primary_cat":"cs.AI","submitted_at":"2026-04-20T14:29:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For instance, Best-of-N sampling has demonstrated remarkable effectiveness in open-domain tasks like question answering and web search [3], [4], offering a simple yet powerful way to select high-quality responses. In machine translation, Beam Search-often paired with modified scoring functions-remains a standard technique for enhancing trans- lation quality [5], [6]. More recently, the use of MCTS with policy-driven LLMs has enabled the autonomous generation of high-quality training data, bypassing the need for exhaustive human annotation [7]. Collectively, these advances have es- tablished reward models as indispensable tools for optimizing LLM outputs in a diverse range of single-stage language tasks."},{"citing_arxiv_id":"2604.16037","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stochasticity in Tokenisation Improves Robustness","primary_cat":"cs.CL","submitted_at":"2026-04-17T13:05:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Stochastic tokenisation during pre-training and fine-tuning improves LLM robustness to perturbations while preserving accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14053","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution","primary_cat":"cs.CL","submitted_at":"2026-04-15T16:32:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SA-BPE regularizes standard BPE training for code by incorporating source diversity to skip problematic merges, substantially cutting unused tokens without altering inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08028","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Comparative Study of Semantic Log Representations for Software Log-based Anomaly Detection","primary_cat":"cs.SE","submitted_at":"2026-04-09T09:30:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QTyBERT matches or exceeds BERT-based log anomaly detection effectiveness while reducing embedding generation time to near static word embedding levels.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Yuqing Wang, Ying Song, Xiaozhou Li, Nana Reinikainen, and Mika V. Mäntylä. 2025. A Comparative Study of Semantic Log Representations for Software Log-based Anomaly Detection.Proc. ACM Softw. Eng.1, 1 (April 2025), 12 pages. https://doi.org/XXXXXXX.XXXXXXX 1 Introduction As modern software systems become increasingly complex, the potential for anomalies grows [44]. The anomalies may arise from various causes, e.g., misconfigurations, resource contention, or un- predictable workloads [11]. Even a small anomaly may compromise system reliability and performance [17]. Timely and effective anom- aly detection is critical to prevent anomalies from escalating into severe failures [11, 39]. Software logs record runtime information"},{"citing_arxiv_id":"2604.06789","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Video-guided Machine Translation with Global Video Context","primary_cat":"cs.CV","submitted_at":"2026-04-08T07:57:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A globally video-guided multimodal translation framework retrieves semantically related video segments with a vector database and applies attention mechanisms to improve subtitle translation accuracy in long videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.01444","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data","primary_cat":"cs.LG","submitted_at":"2026-03-02T04:47:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"ORiGAMi synthesizes sparse semi-structured mixed-type JSON data using path-encoded autoregressive tokenization and schema constraints, outperforming flattened tabular baselines on 17 of 18 fidelity, detection, and utility metrics while keeping privacy above 96%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.16294","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple","primary_cat":"cs.DC","submitted_at":"2026-01-22T19:56:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Space-filling curves enable platform- and shape-oblivious communication-avoiding matrix multiplication that outperforms vendor libraries by up to 5.5x on CPUs while also accelerating LLM prefill and distributed workloads.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.07285","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Non-Intrusive Automatic Speech Recognition Refinement: A Survey","primary_cat":"eess.AS","submitted_at":"2025-08-10T10:46:14+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that classifies non-intrusive ASR refinement methods into five categories, reviews domain adaptation and evaluation datasets, proposes standardized metrics, and identifies future research directions.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"log p(y | x) + λT log ptarget LM (y) − λS log psource LM (y) # , (2) where ptarget LM and psource LM denote the external and source language model probabilities, respectively. DRM consistently outperforms shallow fusion in cross-domain evaluations [32]. 4 Methods Correction Rescoring Fusion Distillation Training Adjustment Deep Shallow Cold [9], [27], [28], [29], [30], [31], [22], [32], [33], [34], [35], [36] [23], [37], [38] [39], [40], [41] [42], [43], [44], [45], [46], [47], [48], [49], [50], [13], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61] Neural LM Rule-based & N-gram LLM-based RAG-inclusive Encoder-based Second-pass Decoder [7], [62], [8], [63], [64] [65], [12], [66], [67],"},{"citing_arxiv_id":"2507.16632","ref_index":71,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Step-Audio 2 Technical Report","primary_cat":"cs.CL","submitted_at":"2025-07-22T14:23:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.01919","ref_index":56,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation","primary_cat":"cs.CL","submitted_at":"2025-04-02T17:26:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"tence without any examples. The model relies solely on its pre-trained knowledge to generate translations. ˆY=f θLLM (prompt, X) (1) Different approaches in zero-shot prompting vary by prompt format and the use of a pivot language for low-resource languages. It has been observed that zero-shot prompting with ChatGPT lags behind MT systems by Google MT [56], Tencent [57], and DeepL by around 5.0 BLEU points [58]. Pivot prompting has been explored to translate between distant languages, where the LLM first translates the sentence to English and then into the target language. This strategy, which uses a resource- rich language (English) as a pivot, improves translation quality between De→Zh and Ro→Zh [58]."},{"citing_arxiv_id":"2502.12187","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hallucinations are inevitable but can be made statistically negligible","primary_cat":"cs.CL","submitted_at":"2025-02-15T07:28:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.12917","ref_index":88,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Training Language Models to Self-Correct via Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2024-09-19T17:16:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.19427","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models","primary_cat":"cs.LG","submitted_at":"2024-02-29T18:24:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.17762","ref_index":93,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Massive Activations in Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-02-27T18:55:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.16867","ref_index":173,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Falcon Series of Open Language Models","primary_cat":"cs.CL","submitted_at":"2023-11-28T15:12:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.04799","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DA-Cramming: Enhancing Cost-Effective Language Model Pretraining with Dependency Agreement Integration","primary_cat":"cs.CL","submitted_at":"2023-11-08T16:18:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DA-Cramming inserts chunk-level dependency agreement embeddings into a dual-stage pretraining pipeline and reports better downstream performance than prior Cramming baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.05737","ref_index":201,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation","primary_cat":"cs.CV","submitted_at":"2023-10-09T14:10:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.06180","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Efficient Memory Management for Large Language Model Serving with PagedAttention","primary_cat":"cs.LG","submitted_at":"2023-09-12T12:50:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ing in missed opportunities for optimization. Specialized serving systems for transformers. Due to the significance of the transformer architecture, numerous specialized serving systems for it have been developed. These systems utilize GPU kernel optimizations [1, 29, 31, 56], ad- vanced batching mechanisms [14, 60], model parallelism [1, 41, 60], and parameter sharing [ 64] for efficient serving. Among them, Orca [60] is most relevant to our approach. Comparison to Orca. The iteration-level scheduling in Orca [60] and PagedAttention in vLLM are complementary techniques: While both systems aim to increase the GPU utilization and hence the throughput of LLM serving, Orca achieves it by scheduling and interleaving the requests so"},{"citing_arxiv_id":"2309.00267","ref_index":85,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback","primary_cat":"cs.CL","submitted_at":"2023-09-01T05:53:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.15717","ref_index":159,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The False Promise of Imitating Proprietary LLMs","primary_cat":"cs.CL","submitted_at":"2023-05-25T05:00:12+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2303.18223","ref_index":247,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-03-31T17:28:46+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"By comparing the performance of models trained on the filtered and unfiltered corpus, they have reached the similar conclusion that pre-training LLMs on cleaned data can improve the model performance. More specifically, the duplication of data may result in \"double descent\" (referring to the phenomenon of performance ini- tially deteriorating and subsequently improving) [234, 247], or even overwhelm the training process [234]. In addition, it has been shown that duplicate data degrades the ability of LLMs to copy from the context, which might further affect the generalization capacity of LLMs using in-context learning [234]. Therefore, as suggested in [56, 64, 78, 227], it is essential to utilize preprocessing methods like quality"},{"citing_arxiv_id":"2303.17564","ref_index":132,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BloombergGPT: A Large Language Model for Finance","primary_cat":"cs.LG","submitted_at":"2023-03-30T17:30:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2208.03299","ref_index":124,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Atlas: Few-shot Learning with Retrieval Augmented Language Models","primary_cat":"cs.CL","submitted_at":"2022-08-05T17:39:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2207.04672","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"No Language Left Behind: Scaling Human-Centered Machine Translation","primary_cat":"cs.CL","submitted_at":"2022-07-11T07:33:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A sparsely gated mixture-of-experts model trained on newly mined low-resource data achieves 44% relative BLEU improvement across 200 languages while adding human safety evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2206.10789","ref_index":94,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Scaling Autoregressive Models for Content-Rich Text-to-Image Generation","primary_cat":"cs.CV","submitted_at":"2022-06-22T01:11:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"munication with non-literate groups including language learners (including children, e.g., storybook illustrations), low-literacy social groups (e.g., up until the late modern period, religious illustrations for low-literacy congregations), and speakers of other languages. Parti uses an architecture and strategy that is directly connected to the neural sequence-to-sequence models used for machine translation [94] and other communication aids such as sentence simpliﬁcation [95] and paraphrasing [96]. This potentially strengthens the temptation to use large text-to-image models to assist with communication. However, we caution against the use of text-to-image models as communication aids, including for education (cf. [89]), until further research has examined questions of efﬁcacy and utility,"},{"citing_arxiv_id":"2202.08906","ref_index":208,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ST-MoE: Designing Stable and Transferable Sparse Expert Models","primary_cat":"cs.CL","submitted_at":"2022-02-17T21:39:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2112.09118","ref_index":96,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unsupervised Dense Information Retrieval with Contrastive Learning","primary_cat":"cs.IR","submitted_at":"2021-12-16T18:57:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2111.07832","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"iBOT: Image BERT Pre-Training with Online Tokenizer","primary_cat":"cs.CV","submitted_at":"2021-11-15T15:18:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2104.05565","ref_index":139,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Survey on reinforcement learning for language processing","primary_cat":"cs.CL","submitted_at":"2021-04-12T15:33:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This survey reviews reinforcement learning applications to natural language processing problems, especially conversational systems, including problem descriptions, suitability of RL, advantages, limitations, and promising directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"with a one-armed bandit machine, without apriori knowledge of the reward distribution function of the bandit machine. Sokolov et al. [120] propose a structured prediction in SMT based on bandit feedback, called bandit expected loss minimization . This approach uses stochastic optimization for learning from partial feedback in the form of an expected 1-BLEU loss criterion [95], [139], as opposed to learning from a gold standard reference translation. This is a non-convex optimization problem, which they analyzed in the stochastic gradient method of pseudogradient adaptation [102] that allowed to show convergence of the algorithm. Nevertheless, the algorithm of Sokolov et al. [120] presents slow convergence. In other words, such a system needs many rounds of user feedback"},{"citing_arxiv_id":"2009.03393","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generative Language Modeling for Automated Theorem Proving","primary_cat":"cs.LG","submitted_at":"2020-09-07T19:50:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}