OCR Post Correction for Endangered Language Texts

Rei, R · 2020 · DOI 10.18653/v1/2020.emnlp-

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

open at publisher browse 10 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Evolution-Aware Regression Test Prioritization of ML-Enabled Systems Using Gradient-Based Behavior Vectors

cs.SE · 2026-06-26 · unverdicted · novelty 7.0

GBV-PD projects gradient-based behavior vectors of test cases and the parameter delta of an evolved model onto a shared PCA basis to estimate directional loss changes without executing the new model.

Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding

cs.CV · 2026-05-19 · conditional · novelty 7.0

Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.

MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

MedStruct-S benchmark shows encoder-only models outperform larger decoder-only ones on key-conditioned QA from noisy OCR clinical reports, with fine-tuned large models winning only when scale is ignored.

Spectral Tempering for Embedding Compression in Dense Passage Retrieval

cs.IR · 2026-03-19 · unverdicted · novelty 7.0

Spectral Tempering derives an adaptive scaling factor γ(k) from the embedding eigenspectrum via local SNR analysis and knee-point normalization to achieve near-optimal compression without training or validation.

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

IPO-Mine releases a toolkit and large multimodal dataset for structured analysis of IPO filings and shows state-of-the-art models diverge from human judgments on chart quality and misleadingness.

Entities as Retrieval Signals: A Systematic Study of Coverage, Supervision, and Evaluation in Entity-Oriented Ranking

cs.IR · 2026-04-06 · conditional · novelty 6.0

Entity signals cover only 19.7% of relevant documents on Robust04 and no configuration among 443 systems improves MAP by more than 0.05 in open-world evaluation, despite gains when entities are pre-restricted.

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

cs.CL · 2026-06-29 · unverdicted · novelty 4.0

A new 27k-sentence Arabic-Russian parallel corpus supports fine-tuned LLM translation benchmarks that improve BLEU by 4.36 and COMET by 0.051 over zero-shot baselines for scientific content.

UniTranslator: A Unified Multi-modal Framework for End-to-end In-Image Machine Translation

cs.CV · 2026-06-23 · unverdicted · novelty 4.0

UniTranslator adds an Understand-Generation Alignment Module and Spatial Mask Decoder to a unified multimodal model to fix translation inconsistency and spatial misalignment in in-image machine translation, reporting SOTA results on multiple benchmarks.

Domain-Adaptive Dense Retrieval for Brazilian Legal Search

cs.IR · 2026-05-05 · unverdicted · novelty 4.0

Mixed training of Qwen3-Embedding-4B on legal data plus SQuAD-pt yields higher average NDCG@10 (0.447), MRR@10 (0.595), and MAP@10 (0.308) across six Portuguese retrieval datasets than legal-only or base models, with largest gains on out-of-domain question-based search.

Developing an ESG-Oriented Large Language Model through ESG Practices

cs.CE · 2026-03-20 · unverdicted · novelty 3.0

ESG-adapted versions of Qwen-3-4B using LoRA and IRM outperform the base model and Llama-3/Gemma-3 baselines on generative ESG question-answering tasks.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Evolution-Aware Regression Test Prioritization of ML-Enabled Systems Using Gradient-Based Behavior Vectors cs.SE · 2026-06-26 · unverdicted · none · ref 30
GBV-PD projects gradient-based behavior vectors of test cases and the parameter delta of an evolved model onto a shared PCA basis to estimate directional loss changes without executing the new model.
MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports cs.CL · 2026-05-04 · unverdicted · none · ref 23
MedStruct-S benchmark shows encoder-only models outperform larger decoder-only ones on key-conditioned QA from noisy OCR clinical reports, with fine-tuned large models winning only when scale is ignored.
Spectral Tempering for Embedding Compression in Dense Passage Retrieval cs.IR · 2026-03-19 · unverdicted · none · ref 12
Spectral Tempering derives an adaptive scaling factor γ(k) from the embedding eigenspectrum via local SNR analysis and knee-point normalization to achieve near-optimal compression without training or validation.
IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents cs.CL · 2026-05-27 · unverdicted · none · ref 51
IPO-Mine releases a toolkit and large multimodal dataset for structured analysis of IPO filings and shows state-of-the-art models diverge from human judgments on chart quality and misleadingness.
Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer cs.CL · 2026-06-29 · unverdicted · none · ref 3
A new 27k-sentence Arabic-Russian parallel corpus supports fine-tuned LLM translation benchmarks that improve BLEU by 4.36 and COMET by 0.051 over zero-shot baselines for scientific content.
UniTranslator: A Unified Multi-modal Framework for End-to-end In-Image Machine Translation cs.CV · 2026-06-23 · unverdicted · none · ref 28
UniTranslator adds an Understand-Generation Alignment Module and Spatial Mask Decoder to a unified multimodal model to fix translation inconsistency and spatial misalignment in in-image machine translation, reporting SOTA results on multiple benchmarks.
Domain-Adaptive Dense Retrieval for Brazilian Legal Search cs.IR · 2026-05-05 · unverdicted · none · ref 8
Mixed training of Qwen3-Embedding-4B on legal data plus SQuAD-pt yields higher average NDCG@10 (0.447), MRR@10 (0.595), and MAP@10 (0.308) across six Portuguese retrieval datasets than legal-only or base models, with largest gains on out-of-domain question-based search.
Developing an ESG-Oriented Large Language Model through ESG Practices cs.CE · 2026-03-20 · unverdicted · none · ref 26
ESG-adapted versions of Qwen-3-4B using LoRA and IRM outperform the base model and Llama-3/Gemma-3 baselines on generative ESG question-answering tasks.

OCR Post Correction for Endangered Language Texts

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer