Recognition: 1 theorem link
· Lean TheoremSocratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Pith reviewed 2026-05-16 09:46 UTC · model grok-4.3
The pith
Pretrained models can be composed zero-shot through multimodal prompting to exchange information and gain new multimodal capabilities without finetuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Socratic Models (SMs) form a modular framework in which multiple pretrained models may be composed zero-shot via multimodal-informed prompting to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, and they enable new applications such as answering free-form questions about egocentric video, engaging in multimodal assistive dialogue by interfacing with external APIs, and supporting robot perception and planning.
What carries the argument
Socratic Models: a modular framework that composes pretrained models zero-shot through multimodal-informed prompting so they exchange information across domains.
If this is right
- Competitive performance with state-of-the-art zero-shot image captioning and video-to-text retrieval is achieved.
- Free-form questions about egocentric video can be answered by chaining vision and language models.
- Multimodal assistive dialogue becomes possible by letting the composed system call external APIs and databases.
- Robot perception and planning tasks can be handled through the same prompting-based composition.
Where Pith is reading between the lines
- The same prompting composition could extend to other modality pairs such as audio-language or tactile-language without new training runs.
- Error accumulation across long prompting chains might limit reliability on complex multi-step tasks.
- The framework suggests a route to more modular AI systems in which new capabilities are added by swapping one component model rather than retraining the whole system.
Load-bearing premise
That distinct capabilities stored in separately trained foundation models can be reliably accessed and combined through prompting alone, without finetuning or task-specific adaptation.
What would settle it
A controlled test in which a Socratic Model chain is given a multimodal query that requires both visual recognition and symbolic reasoning, yet produces answers no better than the individual models used in isolation.
read the original abstract
Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Socratic Models (SMs), a modular framework for composing multiple pretrained foundation models (VLMs, LMs, etc.) zero-shot via multimodal-informed prompting. This enables information exchange across models to capture new multimodal capabilities without finetuning. The work reports competitive performance on zero-shot image captioning and video-to-text retrieval, plus new applications in egocentric video QA, multimodal assistive dialogue with external APIs, and robot perception/planning.
Significance. If the results hold under rigorous controls, the work is significant for showing that complementary knowledge stored in separately trained foundation models can be combined through prompting to enable new tasks with minimal engineering. This modular approach could reduce the need for task-specific finetuning and support rapid prototyping in robotics, video understanding, and assistive systems.
major comments (2)
- [§4] §4 (Experiments): The abstract states competitive results on captioning and retrieval, but the manuscript provides no full baseline tables, statistical significance tests, or error analysis for the zero-shot composition claim; this is load-bearing because the central assertion of reliable exchange without finetuning cannot be verified from the reported metrics alone.
- [§3] §3 (Method): The framework assumes text prompts suffice to transfer visual information (e.g., from VLM detections to LM planning), yet no quantitative bound or ablation on information loss (spatial/temporal/relational details) is provided; this directly affects the zero-shot property and the new applications such as egocentric video QA.
minor comments (2)
- The phrase 'minimal engineering' in the abstract is used without concrete examples of prompt templates or API interfaces in the main text.
- Figure captions and method diagrams would benefit from explicit notation for the prompting flow between models.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We provide point-by-point responses to the major comments and outline the revisions to be made in the updated manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The abstract states competitive results on captioning and retrieval, but the manuscript provides no full baseline tables, statistical significance tests, or error analysis for the zero-shot composition claim; this is load-bearing because the central assertion of reliable exchange without finetuning cannot be verified from the reported metrics alone.
Authors: We acknowledge the need for more rigorous experimental validation. In the revised version, we will include full baseline tables with additional zero-shot methods, conduct statistical significance tests (e.g., using McNemar's test for classification-like metrics or bootstrap for others), and provide an error analysis highlighting where the multimodal composition excels or falls short. This will better substantiate the zero-shot capabilities. revision: yes
-
Referee: [§3] §3 (Method): The framework assumes text prompts suffice to transfer visual information (e.g., from VLM detections to LM planning), yet no quantitative bound or ablation on information loss (spatial/temporal/relational details) is provided; this directly affects the zero-shot property and the new applications such as egocentric video QA.
Authors: We agree that ablations on information transfer are valuable. We will add experiments ablating the prompt content and VLM output types to measure effects on task performance. However, a general quantitative bound on information loss is not feasible without further assumptions on the models' internal representations, as the transfer is through natural language which is inherently lossy for visual details. revision: partial
- A general theoretical quantitative bound on information loss in the text-based transfer between models.
Circularity Check
No significant circularity: empirical composition of pretrained models via prompting
full rationale
The paper introduces Socratic Models as a modular framework for zero-shot composition of existing foundation models (VLMs, LMs) through multimodal-informed prompting. No equations, fitted parameters, or derivations are present that reduce outputs to inputs by construction. The central claim rests on empirical demonstrations of new capabilities (egocentric QA, assistive dialogue, robot planning) rather than self-definitional steps, self-citation load-bearing premises, or renamed known results. Self-citations to prior model work are standard and non-circular per the guidelines, as the framework itself adds no fitted or definitional reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large pretrained models exhibit distinct capabilities depending on the domain of data they are trained on.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.
-
Code as Policies: Language Model Programs for Embodied Control
Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
-
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
-
Building a Precise Video Language with Human-AI Oversight
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video gene...
-
Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.
-
Demystifying CLIP Data
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
-
Cognitive Architectures for Language Agents
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic de...
-
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.
-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
-
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents
DEPS combines LLM-based interactive planning with a trainable goal selector to create a zero-shot multi-task agent that completes 70+ Minecraft tasks and nearly doubles prior performance.
-
Inner Monologue: Embodied Reasoning through Planning with Language Models
LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
-
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Reference graph
Works this paper leans on
-
[1]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [2]
-
[3]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[4]
On the Opportunities and Risks of Foundation Models
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 2021
work page 2021
- [6]
-
[7]
M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, ...
- [8]
- [9]
-
[10]
LaMDA: Language Models for Dialog Applications
R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Know What You Don't Know: Unanswerable Questions for SQuAD
P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021
work page 2021
- [14]
-
[15]
X. Chen, H. Fang, T.-Y . Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[16]
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014
work page 2014
- [17]
-
[18]
J. Xu, T. Mei, T. Yao, and Y . Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 5288–5296, 2016. 10
work page 2016
-
[19]
K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. arXiv preprint arXiv:2110.07058, 2021
- [20]
- [21]
-
[22]
R. Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997
work page 1997
-
[23]
S. Thrun. Lifelong learning algorithms. In Learning to learn, pages 181–209. Springer, 1998
work page 1998
-
[24]
G. E. Hinton, S. Osindero, and Y .-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006
work page 2006
- [25]
-
[26]
P. Vincent, H. Larochelle, Y . Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008
work page 2008
- [27]
-
[28]
G. Mesnil, Y . Dauphin, X. Glorot, S. Rifai, Y . Bengio, I. Goodfellow, E. Lavoie, X. Muller, G. Desjardins, D. Warde-Farley, et al. Unsupervised and transfer learning challenge: a deep learning approach. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pages 97–110. JMLR Workshop and Conference Proceedings, 2012
work page 2012
-
[29]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
work page 2009
-
[30]
R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014
work page 2014
-
[31]
J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655. PMLR, 2014
work page 2014
-
[32]
M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014
work page 2014
-
[33]
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[34]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013
work page 2013
-
[35]
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 1532–1543, 2014
work page 2014
-
[36]
A. M. Dai and Q. V . Le. Semi-supervised sequence learning.Advances in neural information processing systems, 28, 2015
work page 2015
-
[37]
Unsupervised Pretraining for Sequence to Sequence Learning
P. Ramachandran, P. J. Liu, and Q. V . Le. Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[38]
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextual- ized word representations. 2018
work page 2018
- [39]
-
[40]
T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V . Mnih. Unsuper- vised learning of object keypoints for perception and control. Advances in neural information processing systems, 32, 2019
work page 2019
-
[41]
P. Florence, L. Manuelli, and R. Tedrake. Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters, 5(2):492–499, 2019
work page 2019
-
[42]
M. Tsimpoukelli, J. L. Menick, S. Cabi, S. Eslami, O. Vinyals, and F. Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021
work page 2021
- [43]
-
[44]
J. Lu, D. Batra, D. Parikh, and S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019
work page 2019
-
[45]
arXiv preprint arXiv: 2111.09734 (2021)
R. Mokady, A. Hertz, and A. H. Bermano. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021
- [46]
- [47]
-
[48]
R. Zellers, J. Lu, X. Lu, Y . Yu, Y . Zhao, M. Salehi, A. Kusupati, J. Hessel, A. Farhadi, and Y . Choi. Merlot reserve: Neural script knowledge through vision and language and sound. arXiv preprint arXiv:2201.02639, 2022
-
[49]
I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27, 2014
work page 2014
-
[50]
Y . Song, X. Fan, Y . Yang, G. Ren, and W. Pan. Large pretrained models on multimodal sentiment analysis. In Artificial Intelligence in China, pages 506–513. Springer, 2022
work page 2022
- [51]
-
[52]
S. Karpagavalli and E. Chandra. A review on automatic speech recognition architecture and approaches. International Journal of Signal Processing, Image Processing and Pattern Recognition, 9(4):393–404, 2016
work page 2016
-
[53]
M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm.Neural computation, 6(2):181–214, 1994
work page 1994
-
[54]
S. Masoudnia and R. Ebrahimpour. Mixture of experts: a literature survey. Artificial Intelligence Review, 42(2):275–293, 2014
work page 2014
-
[55]
Y . Liu, S. Albanie, A. Nagrani, and A. Zisserman. Use what you have: Video retrieval using representations from collaborative experts. BMVC, 2019
work page 2019
-
[56]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[57]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[58]
M. Wortsman, G. Ilharco, M. Li, J. W. Kim, H. Hajishirzi, A. Farhadi, H. Namkoong, and L. Schmidt. Robust fine-tuning of zero-shot models. arXiv preprint arXiv:2109.01903, 2021
-
[59]
B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva. Places: An image database for deep scene understanding. arXiv preprint arXiv:1610.02055, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[60]
B. Wu, W. Chen, Y . Fan, Y . Zhang, J. Hou, J. Liu, and T. Zhang. Tencent ml-images: A large-scale multi-label image database for visual representation learning. IEEE Access, 7:172683–172693, 2019
work page 2019
- [61]
- [62]
-
[63]
A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015
work page 2015
-
[64]
R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015
work page 2015
- [65]
- [66]
-
[67]
J. A. Portillo-Quintero, J. C. Ortiz-Bayliss, and H. Terashima-Marín. A straightforward framework for video retrieval using clip. In Mexican Conference on Pattern Recognition, pages 3–12. Springer, 2021
work page 2021
-
[68]
https://cloud.google.com/ speech-to-text
Speech-to-text: Automatic speech recognition | google cloud. https://cloud.google.com/ speech-to-text. Accessed: 2022-05-13. 12
work page 2022
-
[69]
Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[70]
N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pages 19–27, 2018
work page 2018
- [71]
-
[72]
Y . Yu, J. Kim, and G. Kim. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pages 471–487, 2018
work page 2018
- [73]
-
[74]
P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018
work page 2018
-
[75]
Y . Li, T. Nagarajan, B. Xiong, and K. Grauman. Ego-exo: Transferring visual representations from third-person to first-person videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6943–6953, 2021
work page 2021
-
[76]
G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari. Charades-ego: A large-scale dataset of paired third and first person videos. arXiv preprint arXiv:1804.09626, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [77]
- [78]
-
[79]
S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et al. Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 559–568, 2011
work page 2011
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.