pith. sign in

arXiv preprint arXiv:2501.05453 , year=

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

citation-role summary

background 1

citation-polarity summary

years

2026 1 2025 4

verdicts

UNVERDICTED 5

roles

background 1

polarities

background 1

representative citing papers

Uncovering the Latent Potential of Deep Intermediate Representations

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.

Frozen Forecasting: A Unified Evaluation

cs.CV · 2025-07-18 · unverdicted · novelty 6.0

A new evaluation framework using latent diffusion on frozen vision backbones shows video-pretrained models consistently outperform image-based ones in forecasting entire trajectories across abstraction levels.

citing papers explorer

Showing 5 of 5 citing papers.

  • Uncovering the Latent Potential of Deep Intermediate Representations cs.LG · 2026-05-21 · unverdicted · none · ref 48

    Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.

  • Frozen Forecasting: A Unified Evaluation cs.CV · 2025-07-18 · unverdicted · none · ref 33

    A new evaluation framework using latent diffusion on frozen vision backbones shows video-pretrained models consistently outperform image-based ones in forecasting entire trajectories across abstraction levels.

  • V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning cs.AI · 2025-06-11 · unverdicted · none · ref 45

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.

  • SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics cs.LG · 2025-06-02 · unverdicted · none · ref 34

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  • Perception Encoder: The best visual embeddings are not at the output of the network cs.CV · 2025-04-17 · unverdicted · none · ref 107

    Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.