ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
arXiv:2410.21272 (2024)
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
background 2polarities
background 2representative citing papers
Transformer represents but does not causally transmit staged algorithmic intermediates for base-digit extraction, diverging from probe predictions.
MechaRule localizes sparse agonist neurons via contrastive hierarchical ablation and adaptive group testing to ground rule extraction, recalling 97% of high-effect activations at 2.14% cost while enabling near-total elimination of target behaviors.
A prompting method that forces GPAI models to state SE best practices before deciding reduces prompt-induced cognitive biases by 51% on average across eight tested biases.
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
Case study applies SAE probing with enstrophy triage to a continuum-dynamics foundation model and reports intermittent feature consistency that does not align with standard physics while linking some output discrepancies to specific feature changes.