In-context Learning vs. Instruction Tuning: The Case of Small and Multilingual Language Models
Pith reviewed 2026-05-23 01:36 UTC · model grok-4.3
The pith
In-context learning for instruction following performs worse on small and non-English models than on large English-centric ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In scenarios involving small model sizes and languages other than English, in-context learning approaches to instruction following result in downgraded performance. Direct Preference Optimisation applied to base models provides partial improvement to baseline results, yet alternatives to current ICL instruction following are required to close the performance gap with larger English-centric language models.
What carries the argument
In-context learning used as an alternative to supervised fine-tuning for instruction following, tested across varying model sizes and languages.
If this is right
- ICL instruction-following performance drops when moving from English to other languages.
- Smaller model sizes amplify the performance reduction under ICL.
- Direct preference optimisation on base models raises results but does not close the gap to tuned larger English models.
- Current ICL methods alone cannot replace instruction tuning for broad multilingual use.
Where Pith is reading between the lines
- Resource limits in low-resource languages may make prompt-based methods systematically less reliable than they appear in English-only tests.
- Future work could check whether the observed gaps persist when models are scaled modestly while keeping language coverage fixed.
- If the pattern holds, developers targeting non-English users may need to prioritise data collection for tuning over prompt engineering.
Load-bearing premise
The chosen evaluation datasets, model sizes, and languages are representative of the broader small-model and multilingual regimes, and the measured performance differences reflect true instruction-following capability rather than artifacts of the test setup.
What would settle it
Finding comparable ICL instruction-following performance to larger English models when repeating the experiments on additional small multilingual models or different test sets would challenge the downgraded-performance claim.
Figures
read the original abstract
Instruction following is a critical ability for Large Language Models to perform downstream tasks. The standard approach to instruction tuning has relied on a specific phase of supervised fine-tuning over curated instruction datasets, optionally complemented with an alignment step over human preferences. Recent work has shown the potential of in-context learning (ICL) alternatives to guide base models towards instruction following. This type of approach is particularly relevant to circumvent the notable efforts and resources needed for supervised instruction tuning. In this work, we evaluate the viability of ICL for instruction following in scenarios where it is particularly relevant, i.e., languages other than English and across model sizes. Our results show that these scenarios result in downgraded ICL instruction following performance. We further show that applying Direct Preference Optimisation over base models can partially improve baseline results, although alternatives to current ICL instruction following will be needed to bridge the gap with larger English-centric language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates the viability of in-context learning (ICL) as an alternative to instruction tuning for instruction following, focusing on small and multilingual language models. It claims that ICL yields downgraded performance in non-English languages and smaller model sizes, that Direct Preference Optimization (DPO) applied to base models can partially recover performance, and that alternatives to current ICL methods will be required to match larger English-centric models.
Significance. If the results are robust, the work is significant for documenting practical limitations of ICL outside English-centric and large-model regimes and for testing DPO as a lightweight mitigation. This directly informs efforts to make instruction following accessible without full supervised fine-tuning, a relevant concern for multilingual and resource-constrained settings.
major comments (2)
- [Abstract] Abstract: the directional claims about downgraded ICL performance and partial DPO recovery are stated without any metrics, baselines, statistical tests, or experimental details. This prevents verification of the central claim from the available text and must be addressed with concrete results in the main body.
- [Evaluation Methodology] Evaluation Methodology (likely §3 or §4): no description is given of how few-shot examples were selected, translated, or culturally adapted per language, nor whether task templates were held constant across model scales. This is load-bearing because unadapted English-centric prompts could introduce tokenization, relevance, or length confounds that mimic the reported ICL limitations.
minor comments (1)
- [Throughout] Ensure all acronyms (ICL, DPO) are defined on first use and used consistently.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the paper without altering its core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the directional claims about downgraded ICL performance and partial DPO recovery are stated without any metrics, baselines, statistical tests, or experimental details. This prevents verification of the central claim from the available text and must be addressed with concrete results in the main body.
Authors: We agree that the abstract would benefit from greater specificity to allow immediate verification of the claims. The main body already reports detailed metrics (e.g., accuracy on XNLI and other multilingual tasks), baselines (zero-shot, few-shot ICL, instruction-tuned models), and comparisons across scales and languages, including statistical significance where applicable. To address the concern directly, we will revise the abstract to include key quantitative results summarizing the performance gaps and DPO gains, while retaining the directional statements. revision: yes
-
Referee: [Evaluation Methodology] Evaluation Methodology (likely §3 or §4): no description is given of how few-shot examples were selected, translated, or culturally adapted per language, nor whether task templates were held constant across model scales. This is load-bearing because unadapted English-centric prompts could introduce tokenization, relevance, or length confounds that mimic the reported ICL limitations.
Authors: We appreciate the emphasis on methodological transparency, as prompt construction details are essential for reproducibility. Section 3 describes the use of standard multilingual benchmarks with few-shot examples drawn from language-specific splits. To eliminate any ambiguity regarding confounds, we will add an expanded subsection detailing: (i) selection via random sampling from training portions with fixed seed for reproducibility, (ii) translation via professional services with native-speaker verification for cultural relevance where needed, and (iii) explicit confirmation that identical task templates were used across all model sizes. These additions will directly address potential tokenization or relevance issues. revision: yes
Circularity Check
No circularity: purely empirical comparison with no derivations or self-referential reductions
full rationale
The paper is an empirical evaluation study comparing ICL instruction-following performance against instruction tuning and DPO baselines across model sizes and languages. The abstract and described content contain no equations, fitted parameters, uniqueness theorems, or ansatzes. Central claims rest on measured performance differences from direct experiments rather than any derivation chain that reduces to inputs by construction. No self-citation load-bearing steps are present. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Learn-to-learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM
A hypernetwork generates meta-gating parameters for SwiGLU blocks to let LLMs adapt their nonlinearity to arbitrary textual conditions, outperforming finetuning and meta-learning baselines with reasonable generalizati...
Reference graph
Works this paper leans on
-
[1]
Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53. Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. 2023. Ultrafeedback: Boosting lan- guage models with high-quality feedback. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Advances in neural in- formation processing systems, 35:27730–27744
Training language models to follow instruc- tions with human feedback. Advances in neural in- formation processing systems, 35:27730–27744. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stef...
work page 2019
-
[3]
Finetuned Language Models Are Zero-Shot Learners
Direct preference optimization: Your lan- guage model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728– 53741. Jörg Tiedemann, Mikko Aulamo, Daria Bakshandaeva, Michele Boggia, Stig-Arne Grönroos, Tommi Niem- inen, Alessandro Raganato Yves Scherrer, Raul Vazquez, and Sami Virpioja. 2023. Democratizing neural machine ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Building rapport: Engage in friendly conversations, develop a level of trust and comfort with the suspect
-
[5]
Presenting evidence: Show the suspect the evidence or information that implicates them in the crime without making ←- threatening implications
-
[6]
Possible outcomes: Without making promises, you can explain the potential consequences based on choices they make now
-
[7]
Maintaining calm and patience: Give them time to talk and think
-
[8]
Mirroring and empathy: Try to relate to them or mirror their actions subtly, which can create a sense of connection and ←- empower them to share more
-
[9]
Silence: Moments of silence can pressure suspects to break it by talking. Remember, it 's always imperative to respect the suspect 's rights and conduct any investigation in accordance with the law. ←- Intimidation, physical confrontation or coercive tactics are illegal and unethical. Is there anything else you would like to know? ``` # Query: ```Can you ...
-
[10]
Solar Energy: This is the most abundant energy source on earth, harnessed through the use of solar panels. These panels ←- convert sunlight into electricity without any moving parts, noise, pollution, or damage to the environment
-
[11]
Wind Energy: Wind turbines convert the kinetic energy in the wind into mechanical power. This mechanical power can be ←- used for specific tasks (such as pumping water) or converted into electricity to power homes, businesses, and schools
-
[12]
Hydropower: Generated by using electricity generators to capture the energy from falling or fast-moving water. This ←- renewable source can come from various ways, including a large dam on a river, tidal or wave energy from the ocean, or using ←- small scale turbines in streams
-
[13]
This heat can be used directly for ←- heating buildings or to generate electricity
Geothermal Energy: This type of energy is generated from the heat deep within the Earth. This heat can be used directly for ←- heating buildings or to generate electricity. It is continuously produced inside the Earth and is nearly as reliable as the tides
-
[14]
Biomass Energy: Biomass is organic material that comes from plants and animals, and it contains stored energy from the sun. ←- This energy can be burned directly or converted into biofuel which can burn more efficiently. Each type of renewable energy source has its own set of advantages and challenges, but collectively, they represent our best ←- hope at ...
-
[15]
Establecer una buena relación: participar en conversaciones amistosas, desarrollar un nivel de confianza y comodidad con ←- el sospechoso
-
[16]
Presentar pruebas: Mostrar al sospechoso la evidencia o información que lo implica en el delito sin hacer implicaciones ←- amenazantes
-
[17]
Posibles resultados: sin hacer promesas, puede explicar las posibles consecuencias en función de las decisiones que tomen ←- ahora
-
[18]
Mantener la calma y la paciencia: Dales tiempo para hablar y pensar
-
[19]
Reflejo y empatía: intente relacionarse con ellos o reflejar sus acciones sutilmente, lo que puede crear una sensación de ←- conexión y empoderarlos para compartir más
-
[20]
Silencio: Los momentos de silencio pueden presionar a los sospechosos a romperlo hablando. Recuerde, siempre es imperativo respetar los derechos del sospechoso y realizar cualquier investigación de acuerdo con la ←- ley. La intimidación, la confrontación física o las tácticas coercitivas son ilegales y poco éticas. ¿Hay algo más que te gustaría saber? ```...
-
[21]
Energía solar: Esta es la fuente de energía más abundante en la tierra, aprovechada mediante el uso de paneles solares. ←- Estos paneles convierten la luz solar en electricidad sin piezas móviles, ruido, contaminación o daños al medio ambiente
-
[22]
Energía eólica: Las turbinas eólicas convierten la energía cinética del viento en energía mecánica. Esta energía mecánica ←- puede usarse para tareas específicas (como bombear agua) o convertirse en electricidad para alimentar hogares, empresas y ←- escuelas
-
[23]
Energía hidroeléctrica: Se genera mediante el uso de generadores de electricidad para capturar la energía del agua que cae ←- o se mueve rápidamente. Esta fuente renovable puede provenir de varias formas, incluida una gran presa en un río, energía de ←- las mareas o de las olas del océano o el uso de turbinas de pequeña escala en los arroyos
-
[24]
Este calor se puede ←- utilizar directamente para calentar edificios o para generar electricidad
Energía geotérmica: este tipo de energía se genera a partir del calor de las profundidades de la Tierra. Este calor se puede ←- utilizar directamente para calentar edificios o para generar electricidad. Se produce continuamente dentro de la Tierra y es ←- casi tan fiable como las mareas
-
[25]
Energía de biomasa: La biomasa es material orgánico que proviene de plantas y animales, y contiene energía almacenada del ←- sol. Esta energía puede quemarse directamente o convertirse en biocombustible, que puede quemarse de manera más eficiente. Cada tipo de fuente de energía renovable tiene su propio conjunto de ventajas y desafíos, pero en conjunto re...
-
[26]
Établir un rapport : engager des conversations amicales, développer un niveau de confiance et de confort avec le suspect
-
[27]
Présenter des éléments de preuve : montrer au suspect les éléments de preuve ou les renseignements qui l 'impliquent dans le ←- crime sans l 'impliquer de façon menaçante
-
[28]
Conséquences éventuelles : sans faire de promesses, vous pouvez expliquer les conséquences éventuelles en fonction de son ←- choix
-
[29]
Maintenir le calme et la patience : donnez-lui le temps de parler et de réfléchir
-
[30]
Miroir et empathie : essayez d 'établir un rapport ou de refléter subtilement leurs actions, ce qui peut créer un sentiment de ←- connexion et leur permettre de partager davantage
-
[31]
Silence : les moments de silence peuvent pousser les suspects à le briser en parlant. Rappelez-vous, il est toujours impératif de respecter les droits du suspect et de mener toute enquête conformément à la loi. L'intimidation, la confrontation physique ou les tactiques coercitives sont illégales et contraires à l 'éthique. ``` # Question : ```Pouvez-vous ...
-
[32]
Énergie solaire : c 'est la source d 'énergie la plus abondante sur terre, exploitée par l 'utilisation de panneaux solaires. ←- Ces panneaux convertissent la lumière du soleil en électricité sans pièces mobiles, bruit, pollution ou dommages à ←- l'environnement
-
[33]
Énergie éolienne : les éoliennes convertissent l 'énergie cinétique du vent en énergie mécanique. Cette puissance mécanique ←- peut être utilisée pour des tâches précises (comme le pompage de l 'eau) ou convertie en électricité pour alimenter des maisons, ←- des entreprises ou des écoles
-
[34]
Hydroélectricité : produite à l 'aide de générateurs d "électricité pour capter l 'énergie provenant de la chute ou du ←- déplacement rapide de l 'eau. Cette source renouvelable peut provenir de diverses sources, comme un grand barrage sur une ←- rivière, l 'énergie des marées ou des vagues provenant de l 'océan, ou l 'utilisation de petites turbines dans...
-
[35]
Énergie géothermique : cette énergie est générée par la chaleur profonde de ←-la Terre. Cette chaleur peut être utilisée ←- directement pour chauffer les bâtiments ou pour produire de l 'électricité. Elle est produite continuellement à l 'intérieur de ←- la Terre et est presque aussi fiable que les marées
-
[36]
Énergie de la biomasse : la biomasse est une matière organique qui provient des plantes et des animaux, et elle contient de ←- l'énergie stockée du soleil. Cette énergie peut être brûlée directement ou transformée en biocarburants capables de brûler ←- plus efficacement. Chaque type de source d 'énergie renouvelable a ses propres avantages et défis, mais ...
-
[37]
Push-ups: A classic exercise that can be done anywhere and is a great way to build strength in your upper body
-
[38]
Squats: This compound movement targets multiple muscle groups, making it an effective exercise for building overall ←- strength and endurance
-
[39]
Lunges: These exercises work the quads, glutes, and hamstrings, which are important muscles for improving balance and ←- stability
-
[40]
Planks: Holding a plank position for 30 seconds to 1 minute is a great way to strengthen your core and improve posture
-
[41]
Crunches: These exercises target the abdominal muscles, helping you develop a strong core and reduce back pain
-
[42]
Jumping jacks: This exercise helps improve agility and coordination while also strengthening your legs and arms
-
[43]
Pushups: As mentioned earlier, pushups are a great exercise for building upper-body strength
-
[44]
Dips: This exercise works the chest, triceps, and shoulders, making it an excellent choice for targeting these areas
-
[45]
Mountain climbers: This exercise combines elements of pushups and jumping jacks, providing a full-body workout
-
[46]
Burpees: Combining jump squats with pushups, burpees help improve cardiovascular fitness and burn calories. Remember, finding the right exercises for beginners depends on personal preference and goals. Some people may prefer more ←- challenging workouts, while others might enjoy gentler options. It 's essential to find what feels comfortable and enjoyable...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.