PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
hub Mixed citations
Datacomp: In search of the next generation of multimodal datasets
Mixed citation behavior. Most common role is background (40%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faster convergence.
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
Objaverse-XL supplies over 10 million diverse 3D objects that, when used to render 100 million views, improve zero-shot novel-view synthesis in models such as Zero123.
L2C2 is a deep RL framework that learns to clean tabular data by aligning it to the synthetic prior of tabular foundation models, yielding higher accuracy on some benchmarks and cross-dataset policy transfer.
QuiLL is a new evaluation pipeline that uses optimized LLM prompts, dynamic in-context learning from an NVD vector store, and a novel accuracy-plus-reasoning metric to benchmark vulnerability detection in real code.
Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.
A review of AI sustainability studies finds inconsistent life cycle definitions and predominant reliance on coarse CO2e proxies, with limited coverage of water, materials, and multi-impact assessments.
OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.
citing papers explorer
-
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
-
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faster convergence.
-
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
-
Objaverse-XL: A Universe of 10M+ 3D Objects
Objaverse-XL supplies over 10 million diverse 3D objects that, when used to render 100 million views, improve zero-shot novel-view synthesis in models such as Zero123.
-
Prior-Aligned Data Cleaning for Tabular Foundation Models
L2C2 is a deep RL framework that learns to clean tabular data by aligning it to the synthetic prior of tabular foundation models, yielding higher accuracy on some benchmarks and cross-dataset policy transfer.
-
QuiLL: An LLM-Based Vulnerability Assessment Framework for the Wild
QuiLL is a new evaluation pipeline that uses optimized LLM prompts, dynamic in-context learning from an NVD vector store, and a novel accuracy-plus-reasoning metric to benchmark vulnerability detection in real code.
-
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
-
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
-
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
-
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.
-
From Cradle to Cloud: A Life Cycle Review of AI's Environmental Footprint
A review of AI sustainability studies finds inconsistent life cycle definitions and predominant reliance on coarse CO2e proxies, with limited coverage of water, materials, and multi-impact assessments.
-
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.
- ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison