pith. sign in

← back to paper

Review history

arxiv: 2606.22723 · 2 revisions

BLUEX v2: Benchmarking LLMs on Open-Ended Questions from Brazilian University Entrance Exams

  1. 2026-07-01 UNVERDICTED LOW v0.9.1-grok novelty 6.0
    50838 ms 5838 in 1189 out 2026-07-01T07:01:49.976634+00:00
  2. 2026-06-26 UNVERDICTED LOW v0.9.1-grok novelty 7.0
    27801 ms 5796 in 1161 out 2026-06-26T09:57:18.742807+00:00