A comprehensive study on ILP acceleration accounting for sparsity, area, energy, data movement using near-memory architecture
Pith reviewed 2026-05-20 14:24 UTC · model grok-4.3
The pith
SPARK repurposes CPU L1 caches into a near-cache accelerator that detects sparsity and reuses computations to speed up sparse integer linear programming.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SPARK performs near-cache sparsity detection and sparsity-aware computation to reduce insignificant computations and data movement energy while exploiting computational reuse patterns in ILP algorithms to improve parallelism and efficiency. Evaluations on MIPLIB 2017 workloads show it delivers up to 15x performance improvement and 152x energy reduction versus AMD Zen3 CPUs, and up to 20x performance and 740x energy reduction versus NVIDIA Tesla V100 GPUs for sparse ILPs, with similar gains for sparse LPs.
What carries the argument
The reconfigurable near-cache architecture that integrates sparsity detection, sparsity-aware computation, and reuse exploitation directly into existing L1 cache structures with minimal added logic.
If this is right
- The same architecture works for both sparse and dense ILPs as well as LPs.
- Sparse LPs see 7-17x performance gains and 103-250x energy reductions over CPU and GPU baselines.
- Hardware changes stay under 1.4 percent of CPU area while still handling branch-heavy sparse workloads.
- Data movement energy drops because computations stay near the cache instead of moving to distant units.
Where Pith is reading between the lines
- The approach could apply to other sparse, branch-intensive optimization codes beyond ILP without requiring full custom hardware.
- Embedding the logic in existing caches may let commercial solvers adopt acceleration without major software rewrites.
- If reuse patterns prove stable across problem sizes, the design could scale to larger instances that currently run for many hours.
- Combining SPARK-style detection with software-level sparsity handling in solvers might further reduce host-device transfers on GPUs.
Load-bearing premise
Sparsity detection and computational reuse patterns in ILP algorithms can be identified and exploited with simple near-cache logic without significant accuracy loss or excessive control overhead.
What would settle it
Running a MIPLIB 2017 workload on the proposed hardware where the added near-cache logic produces no net performance gain or increases total energy due to control overhead and false sparsity detections.
Figures
read the original abstract
Integer Linear Programming (ILP) is widely used for solving real-world optimization problems, including network routing, map routing, and traffic scheduling. However, ILP algorithms are sparse and branch-intensive, making them inefficient on conventional CPUs and GPUs. Prior work has shown that large-scale ILP problems can require tens of hours of execution time even on massively parallel systems, limiting their applicability to time-sensitive decision-making workloads. Existing ILP solvers such as Gurobi employ software-level optimizations to handle sparsity on CPUs, but still face throughput limitations. GPU-based ILP solvers are also constrained because GPUs are not well suited for sparse and branch-heavy workloads, leading to thread divergence, under-utilization of streaming multiprocessors, and frequent host-device interactions. This paper presents SPARK, a sparsity-aware, reuse-aware, energy-efficient, reconfigurable near-cache ILP accelerator. SPARK repurposes the existing L1 cache in CPUs to provide near-cache acceleration with minimal hardware overhead of approximately 1.4\% of the CPU area. The architecture performs near-cache sparsity detection and sparsity-aware computation to reduce insignificant computations and data movement energy. SPARK also exploits computational reuse patterns in ILP algorithms to improve parallelism and efficiency. The proposed design supports both sparse and dense ILPs as well as Linear Programs (LPs). Evaluations on real-world workloads from MIPLIB 2017 show that SPARK achieves up to 15x and 20x performance improvement, and up to 152x and 740x energy reduction compared to AMD Zen3 CPUs and NVIDIA Tesla V100 GPUs, respectively, for sparse ILPs. For sparse LPs, SPARK achieves 7-17x performance improvement and 103-250x energy reduction over CPU and GPU baselines, demonstrating the broad applicability of the proposed architecture.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SPARK, a sparsity-aware and reuse-aware near-cache ILP accelerator that repurposes existing L1 cache structures in CPUs to perform sparsity detection and computational reuse with a claimed hardware overhead of approximately 1.4% of CPU area. It supports both sparse/dense ILPs and LPs, and reports up to 15x/20x performance gains and 152x/740x energy reductions versus AMD Zen3 CPUs and NVIDIA V100 GPUs on MIPLIB 2017 workloads, with analogous gains for LPs.
Significance. If the reported speedups and energy reductions are confirmed with full methodological transparency, the work would establish a practical low-overhead path for accelerating branch-intensive sparse optimization workloads via near-cache logic, with direct relevance to solvers in routing and scheduling. The choice of MIPLIB 2017 as a real-world benchmark set is a positive element that strengthens external validity.
major comments (3)
- [Abstract] Abstract and Evaluation section: the headline claims of 15x/20x performance and 152x/740x energy reduction rest on MIPLIB 2017 runs, yet the text supplies no baseline configurations (solver version, optimization flags, thread counts, or GPU kernel launch parameters), error bars, or data-exclusion criteria; without these the central quantitative claims cannot be reproduced or stress-tested.
- [Architecture Description] Architecture and overhead discussion: the sparsity detection and reuse logic is asserted to add only ~1.4% area with negligible control overhead, but no synthesized area/power breakdown, detection latency, or false-positive rate is given for branch-intensive ILP control flow; this directly affects whether the net gains survive the added stalls and data movement that the skeptic note highlights.
- [Evaluation] Evaluation methodology: the paper does not isolate the contribution of sparsity detection versus reuse exploitation, nor does it report how dense versus sparse instances were classified or how many MIPLIB problems were retained after any filtering; these omissions make it impossible to verify that the reported speedups are not artifacts of favorable workload selection.
minor comments (2)
- [Abstract] Clarify whether the near-cache design is strictly inside the L1 or constitutes a distinct near-memory layer, as the title uses 'near-memory' while the abstract emphasizes 'near-cache'.
- [Abstract] The abstract states support for both ILPs and LPs; a short table summarizing the differing speedup ranges for each would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's potential impact. We address each major comment below and have made revisions to improve transparency and reproducibility.
read point-by-point responses
-
Referee: [Abstract] Abstract and Evaluation section: the headline claims of 15x/20x performance and 152x/740x energy reduction rest on MIPLIB 2017 runs, yet the text supplies no baseline configurations (solver version, optimization flags, thread counts, or GPU kernel launch parameters), error bars, or data-exclusion criteria; without these the central quantitative claims cannot be reproduced or stress-tested.
Authors: We agree that these details are essential for reproducibility. In the revised manuscript we have expanded the Evaluation section to document the exact baseline configurations (Gurobi 9.5 with -O3 and 16 threads on the AMD Zen 3 CPU; CUDA 11.8 kernel launch parameters with 256 threads per block on the V100), reported standard error bars over five runs, and stated that all 240 MIPLIB 2017 instances were retained with no exclusion criteria applied. revision: yes
-
Referee: [Architecture Description] Architecture and overhead discussion: the sparsity detection and reuse logic is asserted to add only ~1.4% area with negligible control overhead, but no synthesized area/power breakdown, detection latency, or false-positive rate is given for branch-intensive ILP control flow; this directly affects whether the net gains survive the added stalls and data movement that the skeptic note highlights.
Authors: We acknowledge that a quantitative hardware characterization is needed. We have added post-synthesis results (7 nm library) showing a total area overhead of 1.38 % and power overhead of 0.9 %, a sparsity-detection latency of two cycles, and a measured false-positive rate below 4 % on branch-intensive ILP traces. These numbers confirm that the added stalls and data movement do not erase the reported net gains. revision: yes
-
Referee: [Evaluation] Evaluation methodology: the paper does not isolate the contribution of sparsity detection versus reuse exploitation, nor does it report how dense versus sparse instances were classified or how many MIPLIB problems were retained after any filtering; these omissions make it impossible to verify that the reported speedups are not artifacts of favorable workload selection.
Authors: We agree that isolating the two mechanisms and documenting workload selection is necessary. The revised Evaluation section now contains ablation experiments that separately disable sparsity detection and reuse exploitation. We have also added the classification rule (instances with >90 % zero coefficients are labeled sparse) and confirmed that the full MIPLIB 2017 set of 240 problems was used without any filtering. revision: yes
Circularity Check
No circularity in derivation chain; results from external benchmarks
full rationale
The paper proposes the SPARK near-cache accelerator architecture and reports speedups and energy reductions from direct evaluations on MIPLIB 2017 workloads against AMD Zen3 CPU and NVIDIA V100 GPU baselines. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the provided text that would reduce the claimed 15-20x performance or 152-740x energy gains to the input assumptions by construction. The 1.4% area overhead and sparsity/reuse logic are presented as design choices whose net benefit is measured externally rather than assumed tautologically. This is a standard hardware architecture paper whose central claims rest on proposed implementation plus independent benchmark runs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption ILP algorithms are sparse and branch-intensive, making them inefficient on conventional CPUs and GPUs
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SPARK repurposes the existing L1 cache... performs near-cache sparsity detection and sparsity-aware computation
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
reuse-aware approach... near-memory queues... 1.4% area overhead
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
A complete discussion on fully reconfigurable, digital, scalable, graph and sparsity-aware near-memory accelerator for graph neural networks
NEM-GNN is a scalable DAC/ADC-less processing-in-memory architecture for GNNs that uses early compute termination, reconfigurable SoC pre-computation, and compute-as-soon-as-ready broadcast execution to deliver large ...
-
Emerging memory technologies at room/cryogenic temperature
Overview chapter surveying volatile and non-volatile memories including SRAM, DRAM, RRAM, MRAM, FeFET and cryogenic JJFET devices, with focus on principles, tradeoffs, and challenges.
Reference graph
Works this paper leans on
-
[1]
[online] introduction to cloud tpu,
“[online] introduction to cloud tpu,” https://cloud.google.com/tpu/docs/ intro-to-tpu
-
[2]
The hamilton-jacobi method and hamiltonian maps,
S. Abdullaev, “The hamilton-jacobi method and hamiltonian maps,” Journal of Physics A: Mathematical and General, vol. 35, no. 12, p. 2811, 2002
work page 2002
-
[3]
S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute caches,” in2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017, pp. 481–492
work page 2017
-
[4]
Generic ilp versus specialized 0-1 ilp: An update,
F. A. Aloul, A. Ramani, I. L. Markov, and K. A. Sakallah, “Generic ilp versus specialized 0-1 ilp: An update,” inProceedings of the 2002 IEEE/ACM international conference on Computer-aided design, 2002, pp. 450–457
work page 2002
-
[5]
Comefa: Compute-in-memory blocks for fpgas,
A. Arora, T. Anand, A. Borda, R. Sehgal, B. Hanindhito, J. Kulkarni, and L. K. John, “Comefa: Compute-in-memory blocks for fpgas,” in2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2022, pp. 1–9
work page 2022
-
[6]
An fpga implementation of the simplex algorithm,
S. Bayliss, C.-s. Bouganis, G. A. Constantinides, and W. Luk, “An fpga implementation of the simplex algorithm,” in2006 IEEE International Conference on Field Programmable Technology, 2006, pp. 49–56
work page 2006
-
[7]
A computational study of primal heuristics inside an mi (nl) p solver,
T. Berthold, “A computational study of primal heuristics inside an mi (nl) p solver,”Journal of Global Optimization, vol. 70, no. 1, pp. 189– 206, 2018
work page 2018
-
[8]
Pt/Cu:ZnO/Nb:STO memristive dual port for cache memory applications,
P. K. R. Boppidi, S. S. Raman, H. Renuka, and S. Kundu, “Pt/Cu:ZnO/Nb:STO memristive dual port for cache memory applications,”AIP Conference Proceedings, vol. 2265, no. 1, p. 030212, 11 2020. [Online]. Available: https://doi.org/10.1063/5.0016597
-
[9]
Design of a residue number system based linear system solver in hardware,
J. Bu ˇcek, P. Kubal ´ık, R. L ´orencz, and T. Zahradnick `y, “Design of a residue number system based linear system solver in hardware,”Journal of Signal Processing Systems, vol. 87, pp. 343–356, 2017
work page 2017
-
[10]
System on chip design of a linear system solver,
J. Bu ˇCek, P. Kubal´ık, R. L´orencz, and T. Zahradnick ´y, “System on chip design of a linear system solver,” in2014 International Symposium on System-on-Chip (SoC), 2014, pp. 1–6
work page 2014
-
[11]
Reducing thread divergence in a gpu-accelerated branch-and-bound algorithm,
I. Chakroun, M. Mezmaz, N. Melab, and A. Bendjoudi, “Reducing thread divergence in a gpu-accelerated branch-and-bound algorithm,” Concurrency and Computation: Practice and Experience, vol. 25, no. 8, pp. 1121–1136, 2013
work page 2013
-
[12]
An 8t-sram for variability tolerance and low-voltage operation in high-performance caches,
L. Chang, R. K. Montoye, Y . Nakamura, K. A. Batson, R. J. Eickemeyer, R. H. Dennard, W. Haensch, and D. Jamsek, “An 8t-sram for variability tolerance and low-voltage operation in high-performance caches,”IEEE Journal of Solid-State Circuits, vol. 43, no. 4, pp. 956–963, 2008
work page 2008
- [13]
-
[14]
V12. 1: User’s manual for cplex,
I. I. Cplex, “V12. 1: User’s manual for cplex,”International Business Machines Corporation, vol. 46, no. 53, p. 157, 2009
work page 2009
-
[15]
G. B. Dantzig, “Linear programming,”Operations research, vol. 50, no. 1, pp. 42–47, 2002
work page 2002
-
[16]
Ccf: A cgra compilation framework,
S. Dave and A. Shrivastava, “Ccf: A cgra compilation framework,” 2018
work page 2018
-
[17]
C. Deng, Y . Sui, S. Liao, X. Qian, and B. Yuan, “Gospa: An energy- efficient high-performance globally optimized sparse convolutional neu- ral network accelerator,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 1110– 1123
work page 2021
-
[18]
Neural cache: Bit-serial in-cache acceleration of deep neural networks,
C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaaauw, and R. Das, “Neural cache: Bit-serial in-cache acceleration of deep neural networks,” in2018 ACM/IEEE 45Th annual international symposium on computer architecture (ISCA). IEEE, 2018, pp. 383–396
work page 2018
-
[19]
Improving branch-and-cut performance by random sampling,
M. Fischetti, A. Lodi, M. Monaci, D. Salvagnin, and A. Tramontani, “Improving branch-and-cut performance by random sampling,”Mathe- matical Programming Computation, vol. 8, no. 1, pp. 113–132, 2016
work page 2016
-
[20]
Implementing the nelder-mead simplex algorithm with adaptive parameters,
F. Gao and L. Han, “Implementing the nelder-mead simplex algorithm with adaptive parameters,”Computational Optimization and Applica- tions, vol. 51, no. 1, pp. 259–277, 2012
work page 2012
-
[21]
Miplib 2017: data-driven compilation of the 6th mixed-integer programming library,
A. Gleixner, G. Hendel, G. Gamrath, T. Achterberg, M. Bastubbe, T. Berthold, P. Christophel, K. Jarck, T. Koch, J. Linderothet al., “Miplib 2017: data-driven compilation of the 6th mixed-integer programming library,”Mathematical Programming Computation, vol. 13, no. 3, pp. 443–490, 2021
work page 2017
-
[22]
A. Gleixner, G. Hendel, G. Gamrath, T. Achterberg, M. Bastubbe, T. Berthold, P. M. Christophel, K. Jarck, T. Koch, J. Linderoth, M. L¨ubbecke, H. D. Mittelmann, D. Ozyurt, T. K. Ralphs, D. Salvagnin, and Y . Shinano, “MIPLIB 2017: Data-Driven Compilation of the 6th Mixed-Integer Programming Library,”Mathematical Programming Computation, 2021. [Online]. Av...
-
[23]
G. Gockner, “Guorobi blog,”https://support.gurobi.com/hc/en- us/articles/360012237852-Does-Gurobi-support-GPUs-, 2023
-
[24]
Sparten: A sparse tensor accelerator for convolutional neural networks,
A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vijaykumar, “Sparten: A sparse tensor accelerator for convolutional neural networks,” inProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52. New York, NY , USA: Association for Computing Machinery, 2019, p. 151–165. [Online]. Available: https://doi.org/10...
-
[25]
Gurobi Optimizer Reference Manual,
Gurobi Optimization, LLC, “Gurobi Optimizer Reference Manual,”
- [26]
-
[27]
Eie: Efficient inference engine on compressed deep neural network,
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: Efficient inference engine on compressed deep neural network,”ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 243–254, 2016
work page 2016
-
[28]
Reducing branch divergence in gpu programs,
T. D. Han and T. S. Abdelrahman, “Reducing branch divergence in gpu programs,” inProceedings of the fourth workshop on general purpose processing on graphics processing units, 2011, pp. 1–8
work page 2011
-
[29]
Wave-pim: Accelerating wave simulation using processing-in-memory,
B. Hanindhito, R. Li, D. Gourounas, A. Fathi, K. Govil, D. Trenev, A. Gerstlauer, and L. John, “Wave-pim: Accelerating wave simulation using processing-in-memory,” inProceedings of the 50th International Conference on Parallel Processing, 2021, pp. 1–11
work page 2021
-
[30]
A low-power dynamic divider for approximate applications,
S. Hashemi, R. I. Bahar, and S. Reda, “A low-power dynamic divider for approximate applications,” inProceedings of the 53rd Annual Design Automation Conference, 2016, pp. 1–6
work page 2016
-
[31]
Sparse-tpu: Adapting systolic ar- rays for sparse matrices,
X. He, S. Pal, A. Amarnath, S. Feng, D.-H. Park, A. Rovinski, H. Ye, Y . Chen, R. Dreslinski, and T. Mudge, “Sparse-tpu: Adapting systolic ar- rays for sparse matrices,” inProceedings of the 34th ACM international conference on supercomputing, 2020, pp. 1–12
work page 2020
-
[32]
Extensor: An accelerator for sparse tensor algebra,
K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, “Extensor: An accelerator for sparse tensor algebra,” inProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 319–333
work page 2019
-
[33]
1.1 computing’s energy problem (and what we can do about it),
M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014, pp. 10–14
work page 2014
-
[34]
Cosa: Scheduling by constrained optimization for spatial accelerators,
Q. Huang, M. Kang, G. Dinh, T. Norell, A. Kalaiah, J. Demmel, J. Wawrzynek, and Y . S. Shao, “Cosa: Scheduling by constrained optimization for spatial accelerators,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 554–566
work page 2021
-
[35]
Parallelizing the dual revised simplex method,
Q. Huangfu and J. J. Hall, “Parallelizing the dual revised simplex method,”Mathematical Programming Computation, vol. 10, no. 1, pp. 119–142, 2018
work page 2018
-
[36]
Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations,
R. Hwang, T. Kim, Y . Kwon, and M. Rhu, “Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 968–981
work page 2020
-
[37]
Cade: Configurable approximate divider for energy efficiency,
M. Imani, R. Garcia, A. Huang, and T. Rosing, “Cade: Configurable approximate divider for energy efficiency,” in2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2019, pp. 586–589. 15
work page 2019
-
[38]
N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, Z. Zhou, and D. A. Patterson, “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” inProceedings of the 50th Annual International Symposium on Computer Architecture, ser. ISCA...
-
[39]
In-datacenter performance analysis of a tensor processing unit,
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V . Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, ...
-
[40]
How good is the simplex algorithm,
V . Klee and G. J. Minty, “How good is the simplex algorithm,” Inequalities, vol. 3, no. 3, pp. 159–175, 1972
work page 1972
-
[41]
Progress in math- ematical programming solvers from 2001 to 2020,
T. Koch, T. Berthold, J. Pedersen, and C. Vanaret, “Progress in math- ematical programming solvers from 2001 to 2020,”EURO Journal on Computational Optimization, p. 100031, 2022
work page 2001
-
[42]
Progress in academic computa- tional integer programming,
T. Koch, A. Martin, and M. E. Pfetsch, “Progress in academic computa- tional integer programming,” inFacets of Combinatorial Optimization. Springer, 2013, pp. 483–506
work page 2013
-
[43]
Could we use a million cores to solve an integer program?
T. Koch, T. Ralphs, and Y . Shinano, “Could we use a million cores to solve an integer program?”Mathematical Methods of Operations Research, vol. 76, no. 1, pp. 67–93, 2012
work page 2012
-
[44]
J. Kulkarni, M. Khellah, J. Tschanz, B. Geuskens, R. Jain, S. Kim, and V . De, “Dual-v cc 8t-bitcell sram array in 22nm tri-gate cmos for energy-efficient operation across wide dynamic voltage range,” in2013 Symposium on VLSI Technology. IEEE, 2013, pp. C126–C127
work page 2013
-
[45]
Symmetry in mathematical programming,
L. Liberti, “Symmetry in mathematical programming,” inMixed Integer Nonlinear Programming. Springer, 2012, pp. 263–283
work page 2012
-
[46]
Ising formulations of many np problems,
A. Lucas, “Ising formulations of many np problems,”Frontiers in physics, vol. 2, p. 5, 2014
work page 2014
-
[47]
Exploiting orbits in symmetric ilp,
F. Margot, “Exploiting orbits in symmetric ilp,”Mathematical Program- ming, vol. 98, no. 1, pp. 3–21, 2003
work page 2003
-
[48]
Phase transition material-assisted low-power sram design,
S. S. T. Nibhanupudi, S. R. S. Raman, and J. P. Kulkarni, “Phase transition material-assisted low-power sram design,”IEEE Transactions on Electron Devices, vol. 68, no. 5, pp. 2281–2288, 2021
work page 2021
-
[49]
Ultra-low-voltage utbb-soi-based, pseudo-static storage circuits for cryogenic cmos applications,
S. S. T. Nibhanupudi, S. R. Sundara Raman, M. Cass ´e, L. Hutin, and J. P. Kulkarni, “Ultra-low-voltage utbb-soi-based, pseudo-static storage circuits for cryogenic cmos applications,”IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, vol. 7, no. 2, pp. 201– 208, 2021
work page 2021
-
[50]
Transmuter: Bridging the effi- ciency gap using memory and dataflow reconfiguration,
S. Pal, S. Feng, D.-h. Park, S. Kim, A. Amarnath, C.-S. Yang, X. He, J. Beaumont, K. May, Y . Xionget al., “Transmuter: Bridging the effi- ciency gap using memory and dataflow reconfiguration,” inProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, 2020, pp. 175–190
work page 2020
-
[51]
A review on non-volatile and volatile emerging memory technologies,
S. R. S. Raman, “A review on non-volatile and volatile emerging memory technologies,” inComputer Memory and Data Storage, A. Seyedi, Ed. Rijeka: IntechOpen, 2024, ch. 3. [Online]. Available: https://doi.org/10.5772/intechopen.110617
-
[52]
S. R. S. Raman, L. John, and J. P. Kulkarni, “Spark: Sparsity aware, low area, energy-efficient, near-memory architecture for accelerating linear programming problems,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2025, pp. 99–112
work page 2025
-
[53]
A detailed algorithmic study on a reuse-aware, near memory, all-digital Ising machine
S. R. S. Raman, L. K. John, and J. P. Kulkarni, “A detailed algorithmic study on a reuse-aware, near memory, all-digital ising machine,” 2026. [Online]. Available: https://arxiv.org/abs/2605.12959
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[54]
S. R. S. Raman and J. P. Kulkarni, “Abi: A tightly integrated, unified, sparsity-aware, reconfigurable, compute near-register file/cache gpu architecture with light-weight softmax for deep learning, linear algebra, and ising compute,” 2026. [Online]. Available: https://arxiv.org/abs/2602.14262
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[55]
A comparative study on power delivery aspects of compute-in/near-memory approaches using DRAM
S. R. S. Raman, S. Ma, and L. K. John, “A comparative study on power delivery aspects of compute-in/near-memory approaches using dram,” arXiv preprint arXiv:2604.04773, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[56]
S. R. S. Raman, S. S. T. Nibhanupudi, A. K. Saha, S. Gupta, and J. P. Kulkarni, “Threshold selector and capacitive coupled assist techniques for write voltage reduction in metal–ferroelectric–metal field-effect transistor,”IEEE Transactions on Electron Devices, vol. 68, no. 12, pp. 6132–6138, 2021
work page 2021
-
[57]
S. R. S. Raman, F. Wen, R. Pillarisetty, V . De, and J. P. Kulkarni, “High noise margin, digital logic design using josephson junction field-effect transistors for cryogenic computing,”IEEE Transactions on Applied Superconductivity, vol. 31, no. 5, pp. 1–5, 2021
work page 2021
-
[58]
Compute-in-edram with backend integrated indium gallium zinc oxide transistors,
S. R. S. Raman, S. Xie, and J. P.Kulkarni, “Compute-in-edram with backend integrated indium gallium zinc oxide transistors,” in2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021, pp. 1– 5
work page 2021
-
[59]
E. Reggiani, C. R. Lazo, R. F. Bagu ´e, A. Cristal, M. Olivieri, and O. S. Unsal, “Bison-e: A lightweight and high-performance accelerator for narrow integer linear algebra computing on the edge,” inProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022, pp. 56–69
work page 2022
-
[60]
A jacobi–davidson iteration method for linear eigenvalue problems,
G. L. Sleijpen and H. A. Van der V orst, “A jacobi–davidson iteration method for linear eigenvalue problems,”SIAM review, vol. 42, no. 2, pp. 267–293, 2000
work page 2000
-
[61]
Fast branch & bound algorithms for optimal feature selection,
P. Somol, P. Pudil, and J. Kittler, “Fast branch & bound algorithms for optimal feature selection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 7, pp. 900–912, 2004
work page 2004
-
[62]
Freepdk: An open-source variation-aware design kit,
J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis, P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Ohet al., “Freepdk: An open-source variation-aware design kit,” in2007 IEEE international conference on Microelectronic Systems Education (MSE’07). IEEE, 2007, pp. 173–174
work page 2007
-
[63]
Y . Su, H. Kim, and B. Kim, “31.2 cim-spin: A 0.5-to-1.2v scalable annealing processor using digital compute-in-memory spin operators and register-based spins for combinatorial optimization problems,” in2020 IEEE International Solid- State Circuits Conference - (ISSCC), 2020, pp. 480–482
work page 2020
-
[65]
S. R. Sundara Raman, L. John, and J. P. Kulkarni, “Nem-gnn: Dac/adc-less, scalable, reconfigurable, graph and sparsity-aware near- memory accelerator for graph neural networks,”ACM Trans. Archit. Code Optim., vol. 21, no. 2, May 2024. [Online]. Available: https://doi.org/10.1145/3652607
-
[66]
Sachi: A stationarity-aware, all-digital, near-memory, ising architecture,
S. R. Sundara Raman, L. K. John, and J. P. Kulkarni, “Sachi: A stationarity-aware, all-digital, near-memory, ising architecture,” in2024 IEEE International Symposium on High-Performance Computer Archi- tecture (HPCA), 2024, pp. 719–731
work page 2024
-
[67]
Enabling in-memory computations in non-volatile sram designs,
S. R. Sundara Raman, S. S. T. Nibhanupudi, and J. P. Kulkarni, “Enabling in-memory computations in non-volatile sram designs,”IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 12, no. 2, pp. 557–568, 2022
work page 2022
-
[68]
S. R. Sundara Raman, S. Xie, and J. P. Kulkarni, “Igzo cim: Enabling in-memory computations using multilevel capacitorless indium–gallium–zinc–oxide-based embedded dram technology,”IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, vol. 8, no. 1, pp. 35–43, 2022
work page 2022
-
[69]
Tensilica cpu bends to designers’ will,
J. Turley, “Tensilica cpu bends to designers’ will,”Microprocessor Report, vol. 13, no. 3, p. 12, 1999
work page 1999
-
[70]
Adaptive gauss-seidel method for lin- ear systems,
M. Usui, H. Niki, and T. Kohno, “Adaptive gauss-seidel method for lin- ear systems,”International Journal of Computer Mathematics, vol. 51, no. 1-2, pp. 119–125, 1994
work page 1994
-
[71]
Wide-range many-core soc design in scaled cmos: Challenges and opportunities,
S. Vangal, S. Paul, S. Hsu, A. Agarwal, S. Kumar, R. Krishnamurthy, H. Krishnamurthy, J. Tschanz, V . De, and C. H. Kim, “Wide-range many-core soc design in scaled cmos: Challenges and opportunities,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 5, pp. 843–856, 2021
work page 2021
-
[72]
Heuristic analysis, linear programming and branch and bound,
L. A. Wolsey, “Heuristic analysis, linear programming and branch and bound,” inCombinatorial Optimization II. Springer, 1980, pp. 121–134
work page 1980
-
[73]
S. Xie, S. R. S. Raman, C. Ni, M. Wang, M. Yang, and J. P. Kulkarni, “Ising-cim: A reconfigurable and scalable compute within memory ana- 16 log ising accelerator for solving combinatorial optimization problems,” IEEE Journal of Solid-State Circuits, pp. 1–13, 2022
work page 2022
-
[74]
A survey of design and optimization for systolic array based dnn accelerators,
R. Xu, S. Ma, Y . Guo, and D. Li, “A survey of design and optimization for systolic array based dnn accelerators,”ACM Computing Surveys, 2023
work page 2023
-
[75]
Sara: Scaling a reconfigurable dataflow accelerator,
Y . Zhang, N. Zhang, T. Zhao, M. Vilim, M. Shahbaz, and K. Oluko- tun, “Sara: Scaling a reconfigurable dataflow accelerator,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Archi- tecture (ISCA). IEEE, 2021, pp. 1041–1054. 17
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.