OpenCoderRank: Personalized Technical Assessments with Generative AI

Hridoy Sankar Dutta; Sana Ansari; Shounak Ravi Bhalerao; Swati Kumari

arxiv: 2509.06774 · v2 · submitted 2025-09-08 · 💻 cs.SE

OpenCoderRank: Personalized Technical Assessments with Generative AI

Hridoy Sankar Dutta , Sana Ansari , Swati Kumari , Shounak Ravi Bhalerao This is my paper

Pith reviewed 2026-05-18 18:24 UTC · model grok-4.3

classification 💻 cs.SE

keywords self-hosted coding assessmentstimed technical evaluationsautomatic gradinggenerative AI for question creationresource-constrained environmentsproblem setter solver platform

0 comments

The pith

OpenCoderRank is a lightweight self-hosted platform for creating and automatically grading timed coding assessments with fine-grained controls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenCoderRank to help organizations and schools run realistic time-bound coding tests that measure both correctness and efficiency. Problem setters use it to craft challenges with adjustable difficulty, time limits, and execution rules, while problem solvers work in a controlled environment that limits easy access to external help. The platform stays model-agnostic so different generative AI tools can assist with question creation without locking users into one system. It runs locally with low resource needs, making it suitable where cloud services are unavailable or expensive. Evaluation through BERTScore and LLM methods supports its role in linking setters and solvers for preparation and assessment.

Core claim

OpenCoderRank is a lightweight, self-hosted, model-agnostic platform that facilitates the creation, deployment and automatic grading of problems while offering fine-grained control over time limits, input-output pairs and execution constraints, thereby connecting problem setters and solvers by supporting time-constrained preparation and self-hosted, customizable assessments in resource-constrained settings.

What carries the argument

The OpenCoderRank platform, which is intentionally model-agnostic to allow flexible use of generative AI for question generation while enforcing assessment rules through time limits, input-output pairs, and execution constraints.

If this is right

Problem setters gain support for generating diverse questions while keeping assessments relevant and balanced.
Problem solvers receive structured time-constrained practice that mirrors real technical evaluations.
Educational and hiring groups in low-resource areas can run customizable assessments without external infrastructure.
Automatic grading handles correctness and efficiency checks consistently across deployments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The platform could integrate with local networks to support fully offline assessment sessions in regions with unreliable connectivity.
Extending its controls might allow dynamic adjustment of problem difficulty during a session based on early performance.
Wider adoption could reduce reliance on commercial cloud-based testing services for smaller institutions.

Load-bearing premise

The assumption that BERTScore and LLM evaluation methods are sufficient to confirm the platform emulates real assessments and maintains integrity against LLM assistance.

What would settle it

A side-by-side comparison of student solution quality, completion times, and external assistance usage rates between OpenCoderRank and conventional online judges or in-person proctored tests.

Figures

Figures reproduced from arXiv: 2509.06774 by Hridoy Sankar Dutta, Sana Ansari, Shounak Ravi Bhalerao, Swati Kumari.

**Figure 2.** Figure 2: (a) High-level pipeline of OpenCoderRank, (b) Question Generator with the prompt to generate questions on any [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

Organizations and educational institutions use time-bound assessment tasks to evaluate coding and problem-solving skills. These assessments measure not only the correctness of the solutions, but also their efficiency. Problem setters (educator/interviewer) are responsible for crafting these challenges, carefully balancing difficulty and relevance to create meaningful evaluation experiences. Conversely, problem solvers (student/interviewee) apply critical and logical thinking to arrive at correct solutions. In the era of Large Language Models (LLMs), LLMs assist problem setters in generating diverse and challenging questions, but they can undermine assessment integrity for problem solvers by providing easy access to solutions. We introduce OpenCoderRank, a lightweight, self-hosted platform that emulates real-world timed technical assessments in resource-constrained environments. OpenCoderRank is intentionally model-agnostic: it facilitates the creation, deployment and automatic grading of problems while offering fine-grained control over time limits, input-output pairs and execution constraints. OpenCoderRank is evaluated using two methods: 1. BERTScore, 2. LLM evaluation. Our findings indicate that OpenCoderRank connects problem setters and solvers by supporting time-constrained preparation and self-hosted, customizable assessments in resource-constrained settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenCoderRank describes a self-hosted assessment platform but lacks any reported results or baselines for its evaluation methods.

read the letter

Hi colleague, The key takeaway is that this paper presents OpenCoderRank, a self-hosted platform for running timed coding assessments, but it gives no numbers or comparisons to show that the evaluations actually work. On the positive side, the work does lay out a clear system design. It's lightweight and model-agnostic, which means it can run locally without depending on specific LLMs or big infrastructure. Features like custom time limits, input-output pairs, and execution constraints sound practical for places with limited resources. The intent to help problem setters create challenges while limiting solver access to LLMs through self-hosting makes sense as a response to current tools. The soft spots are in the evaluation and claims. They mention using BERTScore and LLM evaluation to assess the platform, yet nothing is shown—no scores, no baselines against human assessors or other systems, no details on how it preserves integrity. This means the statement that it connects setters and solvers through these assessments is not backed by evidence in what I see. The stress-test note about missing quantitative results holds up here. This kind of paper is for educators, interviewers, or developers interested in open tools for technical hiring and teaching in constrained settings. Someone building similar systems could pick up implementation pointers, but it won't provide strong evidence for adoption. I recommend sending it for peer review so the authors can get input on strengthening the evaluation section. It has enough of a concrete contribution to warrant referee time, even if revisions are needed. Cheers,

Referee Report

1 major / 1 minor

Summary. The paper introduces OpenCoderRank, a lightweight, self-hosted, model-agnostic platform for creating, deploying, and automatically grading time-bound coding assessments with controls for time limits, input-output pairs, and execution constraints. It positions the tool as a means to emulate real-world timed technical assessments in resource-constrained settings while addressing LLM-assisted integrity issues for problem setters and solvers. Evaluation is described via two methods (BERTScore and LLM evaluation), leading to the claim that the platform connects setters and solvers through time-constrained preparation and customizable assessments.

Significance. If the empirical claims hold, the work could provide a practical, open-source alternative for educational institutions and organizations needing to run controlled coding assessments without external dependencies, particularly in low-resource environments where proprietary platforms are inaccessible.

major comments (1)

[Abstract] Abstract (and any Evaluation section): The manuscript states that OpenCoderRank 'is evaluated using two methods: 1. BERTScore, 2. LLM evaluation' and that 'Our findings indicate that OpenCoderRank connects problem setters and solvers...', yet supplies no metrics, baselines, effect sizes, error bars, or comparisons to human assessments or existing platforms. This directly undermines the central claim that the platform emulates real-world timed assessments and preserves integrity against LLM assistance.

minor comments (1)

[Abstract] The abstract could more clearly distinguish between the platform's design features and the specific outcomes of the BERTScore/LLM evaluations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and commit to revisions that strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract (and any Evaluation section): The manuscript states that OpenCoderRank 'is evaluated using two methods: 1. BERTScore, 2. LLM evaluation' and that 'Our findings indicate that OpenCoderRank connects problem setters and solvers...', yet supplies no metrics, baselines, effect sizes, error bars, or comparisons to human assessments or existing platforms. This directly undermines the central claim that the platform emulates real-world timed assessments and preserves integrity against LLM assistance.

Authors: We agree that the abstract and evaluation sections would be strengthened by including concrete quantitative results. The current manuscript describes the two evaluation methods but does not report specific scores, baselines, or statistical details. In the revised version, we will expand the Evaluation section to report BERTScore values across sample problems, details of the LLM-as-judge prompts and agreement rates, and any available comparisons to human grading. We will also revise the abstract to accurately summarize these results rather than stating broad findings. This addresses the concern about overclaiming while preserving the paper's focus on the platform design for resource-constrained settings. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive platform paper with no derivation chain

full rationale

The manuscript presents OpenCoderRank as a lightweight self-hosted assessment platform, detailing its model-agnostic design, time limits, input-output constraints, and evaluation via BERTScore plus LLM-as-judge. No equations, fitted parameters, predictions, or uniqueness theorems appear; the findings statement simply summarizes the platform's intended use case. Because the work contains no load-bearing derivation that reduces to its own inputs or self-citations, the analysis chain is self-contained and exhibits zero circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the platform description relies on standard assumptions about LLM capabilities and assessment validity without explicit listing.

pith-pipeline@v0.9.0 · 5748 in / 968 out tokens · 30706 ms · 2026-05-18T18:24:28.470934+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce OpenCoderRank, a lightweight, self-hosted platform that emulates real-world timed technical assessments in resource-constrained environments... evaluated using two methods: 1. BERTScore, 2. LLM evaluation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

R.; Borchers, C.; Raghuvanshi, I.; Abboud, R.; Gatz, E.; Gupta, S.; and Koedinger, K

Bhushan, S.; Thomas, D. R.; Borchers, C.; Raghuvanshi, I.; Abboud, R.; Gatz, E.; Gupta, S.; and Koedinger, K. 2025. Detecting LLM-Generated Short Answers and Effects on Learner Performance. arXiv preprint arXiv:2506.17196

work page arXiv 2025
[4]

S.; et al

Chu, Z.; Wang, S.; Xie, J.; Zhu, T.; Yan, Y.; Ye, J.; Zhong, A.; Hu, X.; Liang, J.; Yu, P. S.; et al. 2025 a . Llm agents for education: Advances and applications. arXiv preprint arXiv:2503.11733

work page arXiv 2025
[5]

Chu, Z.; Xie, J.; Wang, S.; Wang, Z.; and Wen, Q. 2025 b . UniEDU: A Unified Language and Vision Assistant for Education Applications. arXiv preprint arXiv:2503.20701

work page arXiv 2025
[6]

M.; Cooper, M

Desmond, M.; Ashktorab, Z.; Geyer, W.; Daly, E. M.; Cooper, M. S.; Pan, Q.; Nair, R.; Wagner, N.; and Pedapati, T. 2025. Evalassist: Llm-as-a-judge simplified. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 29637--29639

work page 2025
[7]

Doddapaneni, S.; Khan, M. S. U. R.; Verma, S.; and Khapra, M. M. 2024. Finding blind spots in evaluator llms with interpretable checklists. arXiv preprint arXiv:2406.13439

work page arXiv 2024
[8]

o ller, O.; and M \

Fleckenstein, J.; Meyer, J.; Jansen, T.; Keller, S. D.; K \"o ller, O.; and M \"o ller, J. 2024. Do teachers spot AI? Evaluating the detectability of AI-generated texts among student essays. Computers and Education: Artificial Intelligence, 6: 100209

work page 2024
[9]

Kannam, S.; Yang, Y.; Dharm, A.; and Lin, K. 2025. Code Interviews: Design and Evaluation of a More Authentic Assessment for Introductory Programming Assignments. In Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1, 554--560

work page 2025
[10]

Koike, R.; Kaneko, M.; and Okazaki, N. 2024. Outfox: Llm-generated essay detection through in-context learning with adversarially generated examples. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 21258--21266

work page 2024
[11]

E.; Savelka, J.; and Denny, P

Liffiton, M.; Sheese, B. E.; Savelka, J.; and Denny, P. 2023. Codehelp: Using large language models with guardrails for scalable support in programming classes. In Proceedings of the 23rd Koli Calling International Conference on Computing Education Research, 1--11

work page 2023
[12]

P a durean, V.-A.; Denny, P.; and Singla, A. 2024. Automated Generation of Code Debugging Exercises. arXiv e-prints, arXiv--2411

work page 2024
[13]

Verga, P.; Hofstatter, S.; Althammer, S.; Su, Y.; Piktus, A.; Arkhangorodsky, A.; Xu, M.; White, N.; and Lewis, P. 2024. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. arXiv preprint arXiv:2404.18796

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Zhang, Z.; Zhang-Li, D.; Yu, J.; Gong, L.; Zhou, J.; Hao, Z.; Jiang, J.; Cao, J.; Liu, H.; Liu, Z.; et al. 2024. Simulating classroom education with llm-empowered agents. arXiv preprint arXiv:2406.19226

work page arXiv 2024

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

R.; Borchers, C.; Raghuvanshi, I.; Abboud, R.; Gatz, E.; Gupta, S.; and Koedinger, K

Bhushan, S.; Thomas, D. R.; Borchers, C.; Raghuvanshi, I.; Abboud, R.; Gatz, E.; Gupta, S.; and Koedinger, K. 2025. Detecting LLM-Generated Short Answers and Effects on Learner Performance. arXiv preprint arXiv:2506.17196

work page arXiv 2025

[4] [4]

S.; et al

Chu, Z.; Wang, S.; Xie, J.; Zhu, T.; Yan, Y.; Ye, J.; Zhong, A.; Hu, X.; Liang, J.; Yu, P. S.; et al. 2025 a . Llm agents for education: Advances and applications. arXiv preprint arXiv:2503.11733

work page arXiv 2025

[5] [5]

Chu, Z.; Xie, J.; Wang, S.; Wang, Z.; and Wen, Q. 2025 b . UniEDU: A Unified Language and Vision Assistant for Education Applications. arXiv preprint arXiv:2503.20701

work page arXiv 2025

[6] [6]

M.; Cooper, M

Desmond, M.; Ashktorab, Z.; Geyer, W.; Daly, E. M.; Cooper, M. S.; Pan, Q.; Nair, R.; Wagner, N.; and Pedapati, T. 2025. Evalassist: Llm-as-a-judge simplified. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 29637--29639

work page 2025

[7] [7]

Doddapaneni, S.; Khan, M. S. U. R.; Verma, S.; and Khapra, M. M. 2024. Finding blind spots in evaluator llms with interpretable checklists. arXiv preprint arXiv:2406.13439

work page arXiv 2024

[8] [8]

o ller, O.; and M \

Fleckenstein, J.; Meyer, J.; Jansen, T.; Keller, S. D.; K \"o ller, O.; and M \"o ller, J. 2024. Do teachers spot AI? Evaluating the detectability of AI-generated texts among student essays. Computers and Education: Artificial Intelligence, 6: 100209

work page 2024

[9] [9]

Kannam, S.; Yang, Y.; Dharm, A.; and Lin, K. 2025. Code Interviews: Design and Evaluation of a More Authentic Assessment for Introductory Programming Assignments. In Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1, 554--560

work page 2025

[10] [10]

Koike, R.; Kaneko, M.; and Okazaki, N. 2024. Outfox: Llm-generated essay detection through in-context learning with adversarially generated examples. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 21258--21266

work page 2024

[11] [11]

E.; Savelka, J.; and Denny, P

Liffiton, M.; Sheese, B. E.; Savelka, J.; and Denny, P. 2023. Codehelp: Using large language models with guardrails for scalable support in programming classes. In Proceedings of the 23rd Koli Calling International Conference on Computing Education Research, 1--11

work page 2023

[12] [12]

P a durean, V.-A.; Denny, P.; and Singla, A. 2024. Automated Generation of Code Debugging Exercises. arXiv e-prints, arXiv--2411

work page 2024

[13] [13]

Verga, P.; Hofstatter, S.; Althammer, S.; Su, Y.; Piktus, A.; Arkhangorodsky, A.; Xu, M.; White, N.; and Lewis, P. 2024. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. arXiv preprint arXiv:2404.18796

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Zhang, Z.; Zhang-Li, D.; Yu, J.; Gong, L.; Zhou, J.; Hao, Z.; Jiang, J.; Cao, J.; Liu, H.; Liu, Z.; et al. 2024. Simulating classroom education with llm-empowered agents. arXiv preprint arXiv:2406.19226

work page arXiv 2024