Follow
Francis Rhys Ward
Title
Cited by
Cited by
Year
Honesty is the best policy: defining and mitigating AI deception
F Ward, F Toni, F Belardinelli, T Everitt
Advances in neural information processing systems 36, 2313-2341, 2023
302023
An assurance case pattern for the interpretability of machine learning in safety-critical systems
FR Ward, I Habli
Computer Safety, Reliability, and Security. SAFECOMP 2020 Workshops: DECSoS …, 2020
242020
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
T van der Weij, F Hofstätter, O Jaffe, SF Brown, FR Ward
arXiv preprint arXiv:2406.07358, 2024
19*2024
Geometric deep learning for post-menstrual age prediction based on the neonatal white matter cortical surface
V Vosylius, A Wang, C Waters, A Zakharov, F Ward, L Le Folgoc, J Cupitt, ...
Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and …, 2020
182020
The reasons that agents act: Intention and instrumental goals
FR Ward, M MacDermott, F Belardinelli, F Toni, T Everitt
arXiv preprint arXiv:2402.07221, 2024
132024
On Agent Incentives to Manipulate Human Feedback in Multi-Agent Reward Learning Scenarios.
FR Ward, F Toni, F Belardinelli
AAMAS, 1759-1761, 2022
82022
Defining deception in structural causal games
FR Ward, F Toni, F Belardinelli
Proceedings of the 2023 International Conference on Autonomous Agents and …, 2023
52023
Towards defining deception in structural causal games
FR Ward
NeurIPS ML Safety Workshop, 2022
32022
Towards a Theory of AI Personhood
F Rhys Ward
arXiv e-prints, arXiv: 2501.13533, 2025
1*2025
Evaluating Language Model Character Traits
FR Ward, Z Yang, A Jackson, R Brown, C Smith, G Colverd, L Thomson, ...
arXiv preprint arXiv:2410.04272, 2024
12024
The Elicitation Game: Stress-Testing Capability Elicitation Techniques
F Hofstätter, J Teoh, T van der Weij, FR Ward
Workshop on Socially Responsible Language Modelling Research, 2024
12024
Tall Tales at Different Scales: Evaluating Scaling Trends for Deception in Language Models
F Hofstätter, FR Ward, H W, L Thomson, O J, P Bartak, S Brown
The Alignment Forum, 2023
1*2023
Argumentative reward learning: Reasoning about human preferences
FR Ward, F Belardinelli, F Toni
arXiv preprint arXiv:2209.14010, 2022
12022
A Causal Perspective on AI Deception in Games.
FR Ward, F Toni, F Belardinelli
AISafety@ IJCAI, 2022
12022
The Elicitation Game: Evaluating Capability Elicitation Techniques
F Hofstätter, T van der Weij, J Teoh, H Bartsch, FR Ward
arXiv preprint arXiv:2502.02180, 2025
2025
A Causal Model of Theory-of-Mind in AI Agents
J Foxabbott, R Subramani, J Fox, FR Ward
2024
Experiments with Detecting and Mitigating AI Deception
I Sahbane, FR Ward, CH Åslund
arXiv preprint arXiv:2306.14816, 2023
2023
AI Sandbagging: Language Models can Selectively Underperform on Evaluations
T van der Weij, F Hofstätter, O Jaffe, SF Brown, FR Ward
Workshop on Socially Responsible Language Modelling Research, 0
AGI Alignment Coursework Ethics, Privacy, AI in Society
FR Ward
The system can't perform the operation now. Try again later.
Articles 1–19