BMD-45 is a new large-scale CCTV vehicle detection dataset from developing cities that reveals a 2.5x performance gap for models adapted from prior benchmarks.
Pp-ocrv3: More attempts for the im- provement of ultra lightweight ocr system
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other methods on image translation benchmarks.
StyleTextGen proposes a dual-branch style encoder, text style consistency loss, and mask-guided inference to achieve superior style consistency and cross-lingual performance in multilingual scene text generation on a new bilingual benchmark.
CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
A proactive EMR assistant using streaming ASR and belief stabilization reaches 0.84 state-event F1, 0.87 retrieval Recall@5, and 83.3% coverage in a controlled pilot of ten doctor-patient dialogues.
PaddleOCR 3.0 releases compact open-source models for OCR, document structure parsing, and information extraction that rival billion-parameter VLMs.
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
citing papers explorer
-
BMD-45: A Large-Scale CCTV Vehicle Detection Dataset for Urban Traffic in Developing Cities
BMD-45 is a new large-scale CCTV vehicle detection dataset from developing cities that reveals a 2.5x performance gap for models adapted from prior benchmarks.
-
MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation
MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other methods on image translation benchmarks.
-
StyleTextGen: Style-Conditioned Multilingual Scene Text Generation
StyleTextGen proposes a dual-branch style encoder, text style consistency loss, and mask-guided inference to achieve superior style consistency and cross-lingual performance in multilingual scene text generation on a new bilingual benchmark.
-
CogVLM2: Visual Language Models for Image and Video Understanding
CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
-
A Proactive EMR Assistant for Doctor-Patient Dialogue: Streaming ASR, Belief Stabilization, and Preliminary Controlled Evaluation
A proactive EMR assistant using streaming ASR and belief stabilization reaches 0.84 state-event F1, 0.87 retrieval Recall@5, and 83.3% coverage in a controlled pilot of ten doctor-patient dialogues.
-
PaddleOCR 3.0 Technical Report
PaddleOCR 3.0 releases compact open-source models for OCR, document structure parsing, and information extraction that rival billion-parameter VLMs.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.