Eyon Jang

Eyon Jang

A(G)I Safety Researcher

About

I work on ensuring that AIs remain beneficial and aligned with human values as they become more capable. My research focuses on developing methods for understanding, evaluating, and controlling advanced AI systems. I'm particularly interested in AI R&D automation and threat models posed by it, as well as alignment techniques leveraging high-compute RL to develop mitigations that scale to superintelligence.

I am currently a MATS scholar researching exploration hacking, mentored by Scott Emmons, David Lindner, Roland Zimmermann (Google DeepMind AGI Safety and Alignment), and Stephen McAleer (Anthropic).

Previously, I spent 6 years as a quantitative researcher on Wall Street. I received my MSc in Statistics and Machine Learning (with Distinction) from the University of Oxford, where I was supervised by Prof. Yee Whye Teh and Prof. Benjamin Bloem-Reddy.

Research interests: Alignment, AI control, Scalable Oversight, Reinforcement Learning

Research

Model Organisms of Exploration Hacking

ongoing · MATS 8.0 · Eyon Jang, Damon Falck, Joschka Braun
Can reasoning models undermine RL training by manipulating their exploration? Through model organisms experiments in realistic settings, we are developing a science of when and how models can influence their own RL training ("exploration hacking"), and build an understanding of what causes current frontier models to do so in the wild.

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

published · COLM 2025 SoLaR workshop · Eyon Jang, Shariqah Hossain, Ashwin Sreevatsa, Diogo Cruz
We investigate whether machine unlearning methods genuinely remove knowledge or merely suppress it, revealing that some approaches remain vulnerable to simple prompt attacks. Our work highlights the need for more reliable unlearning evaluations and provides an open framework for systematic evaluation of unlearning methods.

Automating AI Safety Research using AIs

published · LessWrong · Matthew Shinkle*, Eyon Jang*, Jacques Thibodeau
As AIs become more capable, automating AI R&D is emerging as a promising approach to accelerate progress in model interpretability and, more broadly, AI safety. This project introduces a unified pipeline of AI agent tools for automating interpretability research, spanning literature search and parsing, codebase discovery and preparation, designing and executing experiments. The utility of these tools is demonstrated in a sandbox environment for sparse autoencoders (SAEs), enabling the autonomous implementation and evaluation of diverse SAEBench experiments.

News

Oct 2025
Paper accepted to NeurIPS 2025
"Resisting RL Elicitation of Biosecurity Capabilities: Reasoning Models Exploration Hacking on WMDP" accepted to NeurIPS 2025 Biosafe GenAI workshop (Oral).
Aug 2025
Accepted to MATS 8.1 (Extension)
Secured 6-month funding to continue my research on exploration hacking!
Project selected as a spotlight talk at the MATS Symposium.
Aug 2025
SPAR Fall 2025 Mentor
I'll be co-mentoring 3 SPAR projects with Diogo Cruz.
Jul 2025
Paper accepted to COLM 2025
"Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods" accepted to COLM 2025 SoLaR workshop.
Jun 2025
Accepted to MATS 8.0
I'll be joining MATS 8.0 as a research scholar to study AI safety!
(Google DeepMind stream: Scott Emmons/David Lindner/Erik Jenner)

Miscellaneous

Mentoring

I find it deeply fulfilling to help talented researchers transition into technical AI safety. I'm mentoring 5 AI safety projects through SPAR and Algoverse this fall!

Current Projects:

I plan to mentor more projects through other AI safety programs. If you're interested in working together, please reach out with your background and research interests.

Service

Reviewer: NeurIPS 2025 Mechanistic Interpretability workshop