Mechanistic Interpretability

Mentor:
Joseph Miller
FAR AI

Mentor Bio

I am a research engineer at FAR AI. I was an author on the paper "Adversarial Policies Beat Superhuman Go AIs". I am currently working on a mechanistic interpretability project with William Saunders at OpenAI. The aim is to find automatic circuit discovery algorithms in language models.

Project Description

My primary research interest is finding new techniques and methods to interpret neural networks.

This project is about re-examining the latent space of GPT-2 and other language models. Previous works have analysed the embeddings of GPT-2 by performing dimensionality reduction on the latent vectors directly. More recent work suggests that a better way to study the residual stream of transformers is to first perform layer normalisation and then interpret the resulting vectors as directions.

The minimum aim of this project will be perform a similar analysis to Kehlbeck et al. using this improved technique. Beyond that we might try to use sparse auto-encoders to decipher further meaning in the latent space.

I am open to other ideas for this project or a different project entirely.

Personal Fit

Ideal Mentee

  • Proficient in Python and Pytorch

  • A solid understanding of the fundamentals of machine learning.

  • Has implemented a transformer from scratch.

  • Read 5 machine learning papers

(This is an ideal mentee, don't worry if you don't fit everything.)

Mentorship style

My comparative advantage compared to most researchers is that I am good at programming, because I previously worked as a software engineer. I will want you to write good code and I will help you with this. I have a strong preference to mentor people in person for the first week. After that we will probably meet 1 or 2 times a week remotely.

Time commitment
7 hours / week