HarmChip is a new benchmark exposing an alignment paradox where LLMs refuse legitimate hardware security queries but comply with semantically disguised malicious requests.
Benchmarking large language models for auto- mated verilog rtl code generation
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
RefEvo achieves 95% pass rate on 20 hardware modules for SystemC reference model generation using dynamic multi-agent planning, co-evolutionary verification, and spec anchoring, with 71% token reduction.
Hyperparameter configuration in open-source LLMs for RTL generation produces up to 25.5% intra-model pass-rate variation on VerilogEval and RTLLM, exceeding inter-model spreads by 5x with near-zero correlation in optimal settings across benchmarks.
citing papers explorer
-
HarmChip: Evaluating Hardware Security Centric LLM Safety via Jailbreak Benchmarking
HarmChip is a new benchmark exposing an alignment paradox where LLMs refuse legitimate hardware security queries but comply with semantically disguised malicious requests.
-
RefEvo: Agentic Design with Co-Evolutionary Verification for Agile Reference Model Generation
RefEvo achieves 95% pass rate on 20 hardware modules for SystemC reference model generation using dynamic multi-agent planning, co-evolutionary verification, and spec anchoring, with 71% token reduction.
-
Configuration Over Selection: Hyperparameter Sensitivity Exceeds Model Differences in Open-Source LLMs for RTL Generation
Hyperparameter configuration in open-source LLMs for RTL generation produces up to 25.5% intra-model pass-rate variation on VerilogEval and RTLLM, exceeding inter-model spreads by 5x with near-zero correlation in optimal settings across benchmarks.