Artificial intelligence, values, and alignment

Iason Gabriel · 2020

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

TouchAI: Exploring human-AI perceptual alignment in touch through language model representations

cs.CL · 2024-06-05 · unverdicted · novelty 6.0

LLMs show partial and variable perceptual alignment with human touch on textiles, succeeding on samples like silk satin but failing on cotton denim when matching descriptive language to embedding similarity.

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

cs.LG · 2023-10-05 · accept · novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

cs.CL · 2025-08-06 · unverdicted · novelty 5.0

Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.

citing papers explorer

Showing 3 of 3 citing papers.

TouchAI: Exploring human-AI perceptual alignment in touch through language model representations cs.CL · 2024-06-05 · unverdicted · none · ref 3
LLMs show partial and variable perceptual alignment with human touch on textiles, succeeding on samples like silk satin but failing on cotton denim when matching descriptive language to embedding similarity.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks cs.LG · 2023-10-05 · accept · none · ref 3
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap cs.CL · 2025-08-06 · unverdicted · none · ref 6
Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.

Artificial intelligence, values, and alignment

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer