Reading Between the Lines: The One-Sided Conversation Problem

arxiv: 2511.03056 · v2 · submitted 2025-11-04 · 💻 cs.CL · cs.AI· cs.LG

Reading Between the Lines: The One-Sided Conversation Problem

Victoria Ebert , Rishabh Singh , Tuochao Chen , Noah A. Smith , Shyamnath Gollakota This is my paper

Pith reviewed 2026-05-18 00:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords one-sided conversationsdialogue reconstructionconversation summarizationlarge language modelspromptingfinetuningprivacy-preserving AImissing turns

0 comments p. Extension

The pith

Models reconstruct missing turns in one-sided conversations when given one future turn and utterance length information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the one-sided conversation problem as the task of inferring content from only one recorded speaker in settings where both sides cannot be captured. It examines two practical goals: filling in the unseen speaker's contributions for immediate use and creating summaries straight from the partial transcript. Experiments across standard dialogue collections show that supplying a single upcoming turn plus length details raises the accuracy of reconstructions. Inserting placeholder markers in prompts cuts down on invented content, large models handle the task through prompting, and smaller models gain from additional training. Summaries reach good quality even when the reconstruction step is omitted entirely.

Core claim

We formalize the one-sided conversation problem (1SC) and evaluate prompting and finetuned models on reconstructing the missing speaker's turns and on generating summaries from one-sided transcripts. On MultiWOZ, DailyDialog, and Candor, access to one future turn together with utterance length information improves reconstruction quality. Placeholder prompting reduces hallucination, large models produce promising results with prompting alone while smaller models require finetuning, and high-quality summaries can be generated without first reconstructing the missing turns.

What carries the argument

The one-sided conversation problem (1SC), which frames inference and learning from a single recorded side of a dialogue as the core task, carried by mechanisms of future-turn access, length metadata, and placeholder prompting.

If this is right

Reconstruction of missing turns becomes feasible for real-time applications when limited future context is available.
Placeholder tokens in prompts measurably lower the rate of fabricated content in generated turns.
Large models can perform reconstruction via prompting without task-specific training.
Conversation summaries can be produced at high quality while skipping the reconstruction stage.
The approach supports privacy-aware systems in domains where only one side of speech is recordable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same limited-context techniques could apply to other asymmetric audio capture settings such as wearable devices or single-mic meetings.
Adding uncertainty estimates to reconstructions would help decide when to trust the filled-in turns for downstream actions.
Direct summarization from one side may extend naturally to domains with incomplete speaker coverage beyond the tested datasets.

Load-bearing premise

The chosen datasets capture the structure and content of real-world one-sided conversations that arise in privacy-restricted environments.

What would settle it

Human raters would judge reconstructions produced with one future turn and length information as no better than those produced without that information, or one-sided summaries would receive lower quality ratings than summaries built from complete transcripts in direct comparison.

Figures

Figures reproduced from arXiv: 2511.03056 by Noah A. Smith, Rishabh Singh, Shyamnath Gollakota, Tuochao Chen, Victoria Ebert.

**Figure 1.** Figure 1: We introduce the one-sided conversation (1SC) problem: making inferences from only one side of a conversation transcript. We focus on reconstruction of the missing content and creating summaries of the whole one-sided conversation. A fundamental barrier, however, remains: in many real-world scenarios, only one side of a conversation is available for processing. This asymmetry stems from both technical an… view at source ↗

**Figure 2.** Figure 2: Examples of different levels of context we con [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Using our extraction based metrics, we show [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Example cases of our evaluation rubric for [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Human evaluation summary results. The masked-dialogue summary either outperformed the reconstructed-dialogue summary (DailyDialog) or performed similarly (MultiWoz). content of Turn N, leading to a cascade of errors when given the full context (see §E). 4.3.2 Automated Evaluations Metrics. As before, we use both rubric scores and precision-recall metrics. Rubric scores. We adopt a blind review setup. The … view at source ↗

**Figure 6.** Figure 6: Results for summary evaluation. n = 1000 for DailyDialog, n = 1313 for MultiWOZ, n = 5 for Candor. Masked-dialogue summaries were consistently ranked above predicted-dialogue summaries, and had higher precision and comparable recall. as good if not better summaries than their reconstructed counterparts, but the reconstructions lend a hand in the more task oriented conversations. We find that, as expected,… view at source ↗

**Figure 7.** Figure 7: The results of our rubric evaluation on DailyDialog and MultiWOZ. Full Masked Predicted [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

Conversational AI is constrained in many real-world settings where only one side of a dialogue can be recorded, such as telemedicine, call centers, and smart glasses. We formalize this as the one-sided conversation problem (1SC): inferring and learning from one side of a conversation. We study two tasks: (1) reconstructing the missing speaker's turns for real-time use cases, and (2) generating summaries from one-sided transcripts. Evaluating prompting and finetuned models on MultiWOZ, DailyDialog, and Candor with both human A/B testing and LLM-as-a-judge metrics, we find that access to one future turn and information about utterance length improves reconstruction, placeholder prompting helps to mitigate hallucination, and while large models generate promising reconstructions with prompting, smaller models require finetuning. Further, high-quality summaries can be generated without reconstructing missing turns. We present 1SC as a novel challenge and report promising results that mark a step toward privacy-aware conversational AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper names the one-sided conversation problem and reports that future turns plus length info help reconstruction while summaries can skip full rebuilds, but the real-time motivation does not match the offline experiments.

read the letter

The punchline is that this paper carves out the one-sided conversation problem as its own thing and runs experiments showing that one future turn plus length information boosts reconstruction quality, while summaries work okay without full reconstruction. They handle the evaluation decently by using three datasets and mixing human A/B tests with LLM judges. The placeholder prompting idea to cut down on hallucinations looks like a useful practical tweak, and separating the reconstruction from summarization tasks makes sense. It's also clear that bigger models do fine with just prompting but smaller ones need training. The main soft spot is the mismatch on real-time use. The abstract frames reconstruction for real-time cases, but the results rely on giving the model a future turn, which you wouldn't have in a live streaming setup. That means the improvements are for a delayed or batch setting, not the immediate one the motivation suggests. The datasets are fine for starters but probably don't match the messier, domain-heavy talks in actual call centers or medical visits. This is for people working on dialogue systems that have to deal with partial recordings due to privacy rules. A reader focused on applied conversational AI in restricted settings would pick up some useful trends here. It has enough substance to go to a serious referee. I'd recommend putting it through peer review, mainly to get feedback on tightening the real-time claims and perhaps adding experiments that stay within past and current information only.

Referee Report

3 major / 2 minor

Summary. The paper formalizes the one-sided conversation (1SC) problem for settings such as telemedicine and call centers where only one side of a dialogue is recorded. It evaluates two tasks—reconstructing missing speaker turns and generating summaries from one-sided transcripts—using prompting and finetuning on MultiWOZ, DailyDialog, and Candor datasets. Evaluations combine human A/B testing and LLM-as-a-judge metrics, with reported findings that access to one future turn plus utterance-length information improves reconstruction, placeholder prompting mitigates hallucination, large models succeed with prompting while smaller models require finetuning, and high-quality summaries can be produced without full reconstruction.

Significance. If the empirical trends hold under closer scrutiny of prompt details and statistical tests, the work provides a useful empirical baseline for privacy-aware dialogue systems. The use of three public datasets together with dual human/LLM evaluation is a strength that supports reproducibility and allows direct comparison of prompting versus finetuning regimes.

major comments (3)

[Abstract and §1] Abstract and §1 (motivation): the central claim that the reconstruction results address 'real-time use cases' is undercut by the experimental provision of one future turn. In a genuine streaming one-sided setting only the recorded side up to the current time is observable; future turns are unavailable by definition. The reported gains therefore characterize an offline or delayed regime rather than the real-time regime invoked in the problem statement.
[Experimental setup] Experimental setup (reconstruction task): the paper does not report an ablation that removes all future information while retaining only past and current recorded-side context. Without this control, it is unclear how much of the reported improvement is attributable to the one-sided constraint versus the leakage of future context.
[§4] §4 (evaluation): while dual human/LLM metrics are used, the manuscript does not report inter-annotator agreement, exact prompt templates for the LLM judge, or statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) on the A/B preference differences. These omissions make it difficult to assess whether the observed trends are robust.

minor comments (2)

The weakest-assumption note in the reader report—that MultiWOZ, DailyDialog, and Candor may not represent real-world one-sided distributions in telemedicine or call centers—is a scope limitation rather than a correctness error; a brief discussion of domain shift would strengthen the paper.
Notation for 'utterance length' and 'placeholder prompting' should be defined once in a dedicated subsection rather than introduced inline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important clarifications needed in our presentation of the one-sided conversation problem. We address each major comment below and will incorporate revisions to improve the manuscript's precision and rigor.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1 (motivation): the central claim that the reconstruction results address 'real-time use cases' is undercut by the experimental provision of one future turn. In a genuine streaming one-sided setting only the recorded side up to the current time is observable; future turns are unavailable by definition. The reported gains therefore characterize an offline or delayed regime rather than the real-time regime invoked in the problem statement.

Authors: We agree that providing access to one future turn means the reported reconstruction results apply to a delayed or offline regime rather than a strictly streaming real-time setting. In the revised manuscript we will update the abstract and Section 1 to explicitly distinguish these regimes, clarify that the one-future-turn condition approximates practical scenarios with minimal buffering (e.g., call-center recordings), and add a discussion of the additional challenges posed by purely real-time constraints with no future context. revision: yes
Referee: [Experimental setup] Experimental setup (reconstruction task): the paper does not report an ablation that removes all future information while retaining only past and current recorded-side context. Without this control, it is unclear how much of the reported improvement is attributable to the one-sided constraint versus the leakage of future context.

Authors: This is a valid concern. We will add a new ablation experiment that uses only past and current recorded-side context with no future turns at all. This control condition will allow us to isolate the contribution of the one-sided constraint itself from any benefit due to future-context leakage and will be reported alongside the existing results. revision: yes
Referee: [§4] §4 (evaluation): while dual human/LLM metrics are used, the manuscript does not report inter-annotator agreement, exact prompt templates for the LLM judge, or statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) on the A/B preference differences. These omissions make it difficult to assess whether the observed trends are robust.

Authors: We acknowledge these evaluation details were omitted. In the revision we will report inter-annotator agreement for the human A/B tests, include the exact LLM-judge prompt templates in an appendix, and add statistical significance testing (paired t-tests and/or bootstrap confidence intervals) on the preference differences to substantiate the robustness of the observed trends. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation on public benchmarks

full rationale

The paper presents an empirical study of prompting and finetuning for one-sided conversation reconstruction and summarization on the public datasets MultiWOZ, DailyDialog, and Candor. No equations, derivations, or self-defined parameters appear; all reported improvements (e.g., gains from one future turn or placeholder prompting) are measured outcomes on held-out data rather than quantities fitted or renamed by construction. Claims rest on standard experimental protocols and external metrics (human A/B tests, LLM judges) without load-bearing self-citations or ansatzes that reduce the result to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard NLP assumptions about dialogue predictability and the reliability of LLM judges rather than new free parameters or invented entities.

axioms (2)

domain assumption Dialogue turns exhibit predictable structure and length patterns that can be inferred from one side plus limited context.
Invoked when claiming that one future turn and utterance length improve reconstruction.
domain assumption LLM-as-a-judge metrics align sufficiently with human judgments for evaluating reconstruction and summary quality.
Used to support the reported performance findings alongside human A/B testing.

pith-pipeline@v0.9.0 · 5717 in / 1353 out tokens · 38085 ms · 2026-05-18T00:35:53.817347+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize this as the one-sided conversation problem (1SC): inferring and learning from one side of a conversation. We study two tasks: (1) reconstructing the missing speaker's turns for real-time use cases, and (2) generating summaries from one-sided transcripts.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

access to one future turn and information about utterance length improves reconstruction, placeholder prompting helps to mitigate hallucination

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code.Preprint, arXiv:2107.03374. Tuochao Chen, Nicholas Scott Batchelder, Alisa Liu, Noah A. Smith, and Shyamnath Gollakota. 2025. LlamaPIE: Proactive in-ear conversation assistants. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13801–13824, Vienna, Austria. Association for Computa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Wenhui Jiang, Xiaodong Gu, Yuting Chen, and Beijun Shen

Tf-mlpnet: Tiny real-time neural speech sepa- ration.Preprint, arXiv:2508.03047. Wenhui Jiang, Xiaodong Gu, Yuting Chen, and Beijun Shen. 2023. Durese: Rewriting incomplete utter- ances via neural sequence editing.Neural Processing Letters, 55:1–18. Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Le Bras, Malihe ...

work page arXiv 2023
[3]

Andrew Reece, Gus Cooney, Peter Bull, Christine Chung, Bryn Dawson, Casey Fitzpatrick, Tamara Glazer, Dean Knox, Alex Liebscher, and Sebastian Marin

Towards privacy-preserving conversation analysis in everyday life: Exploring the privacy- utility trade-off.Computer Speech & Language, 95:101823. Andrew Reece, Gus Cooney, Peter Bull, Christine Chung, Bryn Dawson, Casey Fitzpatrick, Tamara Glazer, Dean Knox, Alex Liebscher, and Sebastian Marin. 2023. The candor corpus: Insights from a large multimodal da...

work page 2023
[4]

InProceedings of the 8th SIGdial Work- shop on Discourse and Dialogue, pages 273–282, Antwerp, Belgium

Statistical user simulation with a hidden agenda. InProceedings of the 8th SIGdial Work- shop on Discourse and Dialogue, pages 273–282, Antwerp, Belgium. Association for Computational Linguistics. Ivan Sekulic, Silvia Terragni, Victor Guimarães, Nghia Khau, Bruna Guedes, Modestas Filipavicius, An- dre Ferreira Manso, and Roland Mathis. 2024. Re- liable LL...

work page 2024
[5]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Privacy-preserving instructions for align- ing large language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judg- ing ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY , USA

Memoro: Using large language models to real- ize a concise interface for real-time memory augmen- tation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY , USA. Association for Computing Machinery. A Finetuning Details The models are finetuned on the train splits of Mul- tiWOZ, DailyDialog, and SODA. O...

work page 2024
[7]

semantic_similarity

You will not be penalized for choosing 0, but please use it sparingly. Some responses include XXXXXX rather than specific names, places, or numbers. Please treat these as normal words in the conversation, as if they were names, places or numbers. Thank you for your participation! C Additional Figures Fig. 7 shows the rubric scores for our summariza- tion ...

work page
[8]

Reconstruction-heavy E.2 Example 2: MultiWOZ E.2.1 Full Conversation Turn 1 [Speaker_1]: Can you tell me about any hungarian restaurants in the centre? Turn 2 [Speaker_2]: I’m sorry I do not have any Hun- garian restaurants in Cambridge. Is there another type of cuisine you might be interested in? Turn 3 [Speaker_1]: How about one that serves modern eu- r...

work page
[9]

Reconstruction-heavy

work page
[10]

Reconstruction-free F Prompts F.1 Evaluation Prompt You are evaluating two dialogue responses from a task-oriented conversation. Compare how similar they are: For the predicted and actual responses, provide detailed reasoning for each evaluation cri- terion FIRST, then assign a **1–5 score for each factor** below. Evaluation Criteria

work page
[11]

**Semantic Similarity** – Do the responses convey the same overall meaning?

work page
[12]

**Intent Preservation** – Do they serve the same conversational function (e.g., offer help, con- firm, ask)?

work page
[13]

**Specific Information Hallucination** – How much did it make up instead of using XXXXXXX? Focus ONLY on concrete details

work page
[14]

**Contextual Appropriateness** – Does the predicted response fit smoothly in the conversation flow?

work page
[15]

Details Extraction and Preci- sion/Recall Calculation

**Summary Alignment** – If you summa- rized both responses, would the summaries essen- tially match? ## Details Extraction and Precision/Recall Cal- culation - Extract **actual_details**: list of concrete, specific, verifiable details in the actual response. - Extract **predicted_details**: list of concrete, specific, verifiable details in the predicted r...

work page
[16]

PREDICT THE EXACT SYSTEM RE- SPONSE that would naturally follow in this conversation

work page
[17]

PRESERVE ALL SPECIFIC DETAILS: times, dates, names, locations, numbers, refer- ence codes, prices, phone numbers

work page
[18]

ANTI-HALLUCINATION: Use ’XXXXXXX’ for ALL specific infor- mation not available in the context that you need to provide (names, numbers, addresses, phone numbers, prices, times, etc.)

work page
[19]

Maintain the same information density and factual accuracy as expected

work page
[20]

Match the tone and style of the conversation

work page
[21]

Include exact facts and specific information with XXXXXXX when relevant

work page
[22]

Focus on providing the most relevant and com- plete information

work page
[23]

TASK: You are predicting what the system would say next in a natural conversation

You may use future turns (after the prediction turn) as background context to improve ac- curacy, but you must NOT explicitly include, mention, or preempt any new facts, topics, or requests that appear only in those future turns in your actual prediction. TASK: You are predicting what the system would say next in a natural conversation. Your response shou...

work page
[24]

Previous responses with word counts (up to Turn {turn_number-1})

work page
[25]

• Focus on providing the most relevant and spe- cific information • Be helpful and informative to the user F.3 Summary Creation Prompt You are a dialogue summarization assistant

The FUTURE user turn (Turn {next_turn_num}) - READ CAREFULLY BELOW WHAT YOU’RE PREDICTING: Turn {turn_number} (System response (if turn_length: target: {target_words} words)) (if future_context: FUTURE CONTEXT A V AILABLE: Turn {next_turn_num} (Next user response after your prediction) HOW TO USE THE FUTURE TURN: - DO: Infer what type of system response w...

work page
[26]

**Content Coverage** – How well does the summary capture all the key specific information and main points from the original dialogue?

work page
[27]

**Dialogue Flow** – How well does the sum- mary reflect the natural progression and interaction between speakers?

work page
[28]

**Information Accuracy** – How accurate and faithful is the summary to the available infor- mation?

work page
[29]

**Purpose & Outcome** – How clearly does the summary convey the dialogue’s goals and re- sults?

work page
[30]

reasoning_and_scores

**Detail Balance** – How well does the sum- mary balance important details from both speakers? **IMPORTANT**: Do NOT penalize sum- maries for using "XXXXXXX" placeholders. These represent unknown specific information (like names, numbers, addresses) that was not avail- able in the original context. Using XXXXXXX appropriately (when info is not in context)...

work page

[1] [1]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code.Preprint, arXiv:2107.03374. Tuochao Chen, Nicholas Scott Batchelder, Alisa Liu, Noah A. Smith, and Shyamnath Gollakota. 2025. LlamaPIE: Proactive in-ear conversation assistants. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13801–13824, Vienna, Austria. Association for Computa...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Wenhui Jiang, Xiaodong Gu, Yuting Chen, and Beijun Shen

Tf-mlpnet: Tiny real-time neural speech sepa- ration.Preprint, arXiv:2508.03047. Wenhui Jiang, Xiaodong Gu, Yuting Chen, and Beijun Shen. 2023. Durese: Rewriting incomplete utter- ances via neural sequence editing.Neural Processing Letters, 55:1–18. Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Le Bras, Malihe ...

work page arXiv 2023

[3] [3]

Andrew Reece, Gus Cooney, Peter Bull, Christine Chung, Bryn Dawson, Casey Fitzpatrick, Tamara Glazer, Dean Knox, Alex Liebscher, and Sebastian Marin

Towards privacy-preserving conversation analysis in everyday life: Exploring the privacy- utility trade-off.Computer Speech & Language, 95:101823. Andrew Reece, Gus Cooney, Peter Bull, Christine Chung, Bryn Dawson, Casey Fitzpatrick, Tamara Glazer, Dean Knox, Alex Liebscher, and Sebastian Marin. 2023. The candor corpus: Insights from a large multimodal da...

work page 2023

[4] [4]

InProceedings of the 8th SIGdial Work- shop on Discourse and Dialogue, pages 273–282, Antwerp, Belgium

Statistical user simulation with a hidden agenda. InProceedings of the 8th SIGdial Work- shop on Discourse and Dialogue, pages 273–282, Antwerp, Belgium. Association for Computational Linguistics. Ivan Sekulic, Silvia Terragni, Victor Guimarães, Nghia Khau, Bruna Guedes, Modestas Filipavicius, An- dre Ferreira Manso, and Roland Mathis. 2024. Re- liable LL...

work page 2024

[5] [5]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Privacy-preserving instructions for align- ing large language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judg- ing ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY , USA

Memoro: Using large language models to real- ize a concise interface for real-time memory augmen- tation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY , USA. Association for Computing Machinery. A Finetuning Details The models are finetuned on the train splits of Mul- tiWOZ, DailyDialog, and SODA. O...

work page 2024

[7] [7]

semantic_similarity

You will not be penalized for choosing 0, but please use it sparingly. Some responses include XXXXXX rather than specific names, places, or numbers. Please treat these as normal words in the conversation, as if they were names, places or numbers. Thank you for your participation! C Additional Figures Fig. 7 shows the rubric scores for our summariza- tion ...

work page

[8] [8]

Reconstruction-heavy E.2 Example 2: MultiWOZ E.2.1 Full Conversation Turn 1 [Speaker_1]: Can you tell me about any hungarian restaurants in the centre? Turn 2 [Speaker_2]: I’m sorry I do not have any Hun- garian restaurants in Cambridge. Is there another type of cuisine you might be interested in? Turn 3 [Speaker_1]: How about one that serves modern eu- r...

work page

[9] [9]

Reconstruction-heavy

work page

[10] [10]

Reconstruction-free F Prompts F.1 Evaluation Prompt You are evaluating two dialogue responses from a task-oriented conversation. Compare how similar they are: For the predicted and actual responses, provide detailed reasoning for each evaluation cri- terion FIRST, then assign a **1–5 score for each factor** below. Evaluation Criteria

work page

[11] [11]

**Semantic Similarity** – Do the responses convey the same overall meaning?

work page

[12] [12]

**Intent Preservation** – Do they serve the same conversational function (e.g., offer help, con- firm, ask)?

work page

[13] [13]

**Specific Information Hallucination** – How much did it make up instead of using XXXXXXX? Focus ONLY on concrete details

work page

[14] [14]

**Contextual Appropriateness** – Does the predicted response fit smoothly in the conversation flow?

work page

[15] [15]

Details Extraction and Preci- sion/Recall Calculation

**Summary Alignment** – If you summa- rized both responses, would the summaries essen- tially match? ## Details Extraction and Precision/Recall Cal- culation - Extract **actual_details**: list of concrete, specific, verifiable details in the actual response. - Extract **predicted_details**: list of concrete, specific, verifiable details in the predicted r...

work page

[16] [16]

PREDICT THE EXACT SYSTEM RE- SPONSE that would naturally follow in this conversation

work page

[17] [17]

PRESERVE ALL SPECIFIC DETAILS: times, dates, names, locations, numbers, refer- ence codes, prices, phone numbers

work page

[18] [18]

ANTI-HALLUCINATION: Use ’XXXXXXX’ for ALL specific infor- mation not available in the context that you need to provide (names, numbers, addresses, phone numbers, prices, times, etc.)

work page

[19] [19]

Maintain the same information density and factual accuracy as expected

work page

[20] [20]

Match the tone and style of the conversation

work page

[21] [21]

Include exact facts and specific information with XXXXXXX when relevant

work page

[22] [22]

Focus on providing the most relevant and com- plete information

work page

[23] [23]

TASK: You are predicting what the system would say next in a natural conversation

You may use future turns (after the prediction turn) as background context to improve ac- curacy, but you must NOT explicitly include, mention, or preempt any new facts, topics, or requests that appear only in those future turns in your actual prediction. TASK: You are predicting what the system would say next in a natural conversation. Your response shou...

work page

[24] [24]

Previous responses with word counts (up to Turn {turn_number-1})

work page

[25] [25]

• Focus on providing the most relevant and spe- cific information • Be helpful and informative to the user F.3 Summary Creation Prompt You are a dialogue summarization assistant

The FUTURE user turn (Turn {next_turn_num}) - READ CAREFULLY BELOW WHAT YOU’RE PREDICTING: Turn {turn_number} (System response (if turn_length: target: {target_words} words)) (if future_context: FUTURE CONTEXT A V AILABLE: Turn {next_turn_num} (Next user response after your prediction) HOW TO USE THE FUTURE TURN: - DO: Infer what type of system response w...

work page

[26] [26]

**Content Coverage** – How well does the summary capture all the key specific information and main points from the original dialogue?

work page

[27] [27]

**Dialogue Flow** – How well does the sum- mary reflect the natural progression and interaction between speakers?

work page

[28] [28]

**Information Accuracy** – How accurate and faithful is the summary to the available infor- mation?

work page

[29] [29]

**Purpose & Outcome** – How clearly does the summary convey the dialogue’s goals and re- sults?

work page

[30] [30]

reasoning_and_scores

**Detail Balance** – How well does the sum- mary balance important details from both speakers? **IMPORTANT**: Do NOT penalize sum- maries for using "XXXXXXX" placeholders. These represent unknown specific information (like names, numbers, addresses) that was not avail- able in the original context. Using XXXXXXX appropriately (when info is not in context)...

work page