Conversational Domain Adaptation of IndicTrans2 across 21 Indic Languages via Experience Replay and Model Soups

Aditya Pratap Singh

arxiv: 2606.29024 · v1 · pith:WN4YXNKMnew · submitted 2026-06-27 · 💻 cs.CL

Conversational Domain Adaptation of IndicTrans2 across 21 Indic Languages via Experience Replay and Model Soups

Aditya Pratap Singh This is my paper

Pith reviewed 2026-06-30 09:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords conversational domain adaptationIndicTrans2experience replaymodel soupsIndic languagesmachine translationmultilingual modelsdomain adaptation

0 comments

The pith

Experience replay combined with model averaging lets IndicTrans2 handle conversational input across 21 languages while preserving general-domain accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

IndicTrans2-1B produces stiff output on casual conversation because it was trained on formal text. Plain fine-tuning on public conversational corpora improves those scores but drops performance on general benchmarks such as FLORES. Adding experience replay, which mixes general data back into the training mix, followed by model souping, which averages the resulting weights with the original base weights, removes the drop. The combined procedure raises conversational chrF in every one of the 21 Indic languages while keeping FLORES scores statistically unchanged. The authors treat the metric gains as register alignment rather than evidence of superior translation quality.

Core claim

IndicTrans2-1B can be adapted to conversational register across all 21 Indic languages by mixing general-domain data back into the fine-tuning process and then averaging the fine-tuned weights with the base model weights. This combination eliminates the typical trade-off where conversational improvements come at the expense of general-domain performance. The resulting models show higher chrF scores on conversational test sets in every language and statistically indistinguishable scores on the FLORES general-domain test set.

What carries the argument

Experience replay, which mixes general data into conversational fine-tuning, combined with model souping, which averages the fine-tuned weights with the original base weights.

If this is right

Conversational chrF rises in all 21 languages with a mean gain of 6.2 points.
FLORES scores stay within 0.7 chrF of the base model, with a mean change of -0.17.
Paired bootstrap tests confirm the conversational improvements are significant while FLORES changes are not.
The procedure uses only publicly available data sources and applies uniformly across the 21 languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same replay-plus-souping sequence could be applied to other multilingual models facing domain-shift trade-offs.
Further tests could measure whether the adapted models reduce the need for separate general and conversational systems in production pipelines.
Extending the method to additional low-resource language pairs would test whether the observed pattern holds beyond the current set of 21 Indic languages.

Load-bearing premise

The public conversational corpora used for adaptation are representative enough of real user conversational language that gains on held-out splits will appear in deployed systems.

What would settle it

A blind human preference study on fresh conversational inputs in which the adapted model is not preferred over the base model would show that the chrF gains do not translate to perceived quality.

Figures

Figures reproduced from arXiv: 2606.29024 by Aditya Pratap Singh.

**Figure 1.** Figure 1: On Hindi, plain fine-tuning lifts conversational chrF but drops FLORES below the base. Replay keeps most of the general quality, and averaging with the base restores FLORES while staying ahead on conversation. Lang conv (delta) FLORES (delta) asm 61.9 to 69.4 (+7.6) 44.2 to 44.1 (-0.1) ben 68.7 to 73.6 (+4.8) 58.1 to 58.1 (0.0) brx 62.1 to 68.0 (+5.9) 45.9 to 46.0 (+0.1) doi 63.4 to 71.7 (+8.3) 49.8 to 49.… view at source ↗

**Figure 2.** Figure 2: Conversational chrF gain for all 21 languages. Blue marks the five languages with a hard subtitle test set; these are the gains to trust. The large grey gains come from easy in-domain test sets (§7). 5 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Each point is a language. Every language gains on conversation (all points above zero) while general quality stays inside a 0.7 chrF band around the base. The general result is a tie, not a win. On FLORES the soup matches the base (Hindi 61.8 versus 61.7, all 21 languages within 0.7). So we claim that general quality is preserved, not improved. Part of the conversational gain is reference style matching. O… view at source ↗

read the original abstract

IndicTrans2 is the strongest open English to Indic translation system, but like most systems it is trained on general text and tends to sound stiff on casual, conversational input. We adapt IndicTrans2-1B to conversational register across all 21 Indic languages using only public data (OpenSubtitles, BPCC-H-Daily, Tatoeba). Plain fine-tuning improves conversational chrF but forgets the general domain (it drops 3.9 chrF on FLORES for Hindi). Mixing general data back into training (experience replay) and then averaging the fine-tuned weights with the base (model souping) removes that trade-off: the resulting model beats IndicTrans2-1B on conversational chrF in every one of the 21 languages (mean +6.2) while matching it on FLORES (mean change -0.17, all within 0.7 chrF). Paired bootstrap tests confirm the conversational gains are significant (p <= 0.004) and that FLORES is not significantly degraded. We are deliberate about scope: these are chrF gains, and a blind human plus multi-model LLM check does not confirm them as a perceived quality improvement, so we treat the conversational gain as largely a register match to the references rather than proof of better translation. The techniques are not new; the contribution is the honest, end-to-end study in the Indic conversational setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Experience replay plus model souping adapts IndicTrans2-1B to conversational data across 21 languages while keeping FLORES scores nearly flat, though the gains read as register matching rather than quality lift.

read the letter

The main point is that the authors take standard techniques—experience replay to mix general data back in during fine-tuning, then model souping to average weights with the base model—and show they remove the usual forgetting trade-off on IndicTrans2-1B. Conversational chrF rises by a mean of 6.2 across all 21 languages, FLORES drops by only 0.17 on average with changes under 0.7, and paired bootstrap tests back the conversational gains at p ≤ 0.004 while FLORES stays non-significant.

The work is transparent about using only public corpora (OpenSubtitles, BPCC-H-Daily, Tatoeba) and about the methods not being new. It also flags that blind human plus LLM checks do not register the improvements as better perceived quality, so the authors treat the result as register alignment with the test references. That level of honesty is useful.

The soft spots are modest but real. The claim that these public conversational sets stand in for actual user input is assumed rather than tested, and without released code or full hyperparameter details the exact recipe is harder to replicate. The human-eval result is handled openly, but it does limit how strongly one can sell the adaptation as successful beyond metric matching.

This is a solid empirical application study for people working on Indic MT or low-cost domain adaptation of existing models. The evidence is direct measurements on held-out public data with appropriate tests, and the scope is stated clearly. It deserves peer review because the central pattern holds up and the limitations are not hidden.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that combining experience replay (mixing general-domain data into conversational fine-tuning) with model souping (averaging the resulting weights with the base IndicTrans2-1B model) adapts the system to conversational register across all 21 Indic languages using only public corpora (OpenSubtitles, BPCC-H-Daily, Tatoeba). This yields a mean +6.2 chrF gain on held-out conversational test sets while preserving general-domain performance on FLORES (mean change -0.17, all changes within 0.7 chrF). Paired bootstrap tests confirm significance for the conversational gains (p <= 0.004) but not for FLORES degradation. The authors explicitly scope the result to register matching with references rather than perceived quality improvement, citing a blind human plus multi-model LLM evaluation that does not confirm the metric gains as quality advances.

Significance. If the empirical pattern holds, the work supplies a reproducible, public-data-only recipe for domain adaptation that avoids catastrophic forgetting in a 21-language Indic setting. Strengths include the consistent directional gains across every language, explicit paired bootstrap testing, and the deliberate scoping that distinguishes metric improvement from human-perceived quality. The contribution lies in the end-to-end documentation rather than methodological novelty.

minor comments (2)

The methods section should report the precise experience-replay mixing ratios, learning-rate schedules, and souping coefficients (or the procedure used to select them) so that the reported chrF deltas can be exactly reproduced.
A per-language table of conversational and FLORES chrF scores (with the base model, fine-tuned model, replay-only, and soup variants) would allow readers to verify the uniformity of the +6.2 mean gain beyond the aggregate statistic.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the consistent gains across all 21 languages, the explicit statistical testing, and the deliberate scoping of results to register matching rather than perceived quality. The recommendation of minor revision is noted. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims consist of direct empirical measurements: chrF scores on held-out splits from public conversational corpora (OpenSubtitles, BPCC-H-Daily, Tatoeba) and FLORES after applying experience replay plus model souping to IndicTrans2-1B. No equations, fitted parameters, or self-citations are used to derive the reported deltas (+6.2 mean conversational, -0.17 mean FLORES); the numbers are computed from independent test evaluations with bootstrap significance tests. The methods (experience replay, model souping) are explicitly described as non-novel, and the paper includes explicit scope caveats that the gains reflect register matching rather than proven quality improvement. The derivation chain is therefore self-contained empirical reporting with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard machine-learning assumptions about data representativeness and metric validity rather than new mathematical constructs.

axioms (2)

domain assumption chrF is an appropriate automatic metric for measuring both conversational register match and general-domain translation quality
All reported gains and non-degradations are measured in chrF; the paper qualifies the conversational gains as register matching after human/LLM inspection.
domain assumption The chosen public conversational corpora are suitable proxies for the target conversational register across 21 Indic languages
Training and evaluation rely exclusively on OpenSubtitles, BPCC-H-Daily, and Tatoeba without additional validation of domain match.

pith-pipeline@v0.9.1-grok · 5784 in / 1541 out tokens · 37154 ms · 2026-06-30T09:35:47.283424+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

1 extracted references

[1]

Re-evaluating the Role of BLEU in MT Research

• Callison-Burch, Osborne, Koehn. Re-evaluating the Role of BLEU in MT Research. EACL 2006. • Chu, Dabre, Kurohashi. An Empirical Comparison of Domain Adaptation Methods for NMT. ACL 2017. • Freitag, Foster, Grangier, et al. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for MT. TACL 2021. • Gala et al. IndicTrans2: Towards High-Qua...

2006

[1] [1]

Re-evaluating the Role of BLEU in MT Research

• Callison-Burch, Osborne, Koehn. Re-evaluating the Role of BLEU in MT Research. EACL 2006. • Chu, Dabre, Kurohashi. An Empirical Comparison of Domain Adaptation Methods for NMT. ACL 2017. • Freitag, Foster, Grangier, et al. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for MT. TACL 2021. • Gala et al. IndicTrans2: Towards High-Qua...

2006