TransDot unifies SIMD FMA and trans-precision DPA in one reconfigurable FPU, achieving 2x FP16, 4x FP8, and 8x FP4 throughput with FP32 accumulation plus 1.46x to 2.92x area efficiency gains over the FPnew baseline.
Lut tensor core: A software-hardware co-design for lut-based low-bit llm inference
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.
citing papers explorer
-
TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines
TransDot unifies SIMD FMA and trans-precision DPA in one reconfigurable FPU, achieving 2x FP16, 4x FP8, and 8x FP4 throughput with FP32 accumulation plus 1.46x to 2.92x area efficiency gains over the FPnew baseline.
-
Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML
Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.