Automated Alignment Research

Mentor:
Bogdan-Ionut Cirstea
Independent AI Safety Researcher

Mentor Bio

I have very significant experience in ML (PhD and postdoc) and AI alignment research (~1 year full time, a couple of additional years part-time) and in AI alignment field-building (~ 2.5 years part-time). Doing (conceptual) alignment facilitating for AGISF and for AI safety bootcamps has also allowed me to have a very broad view of alignment research, which I expect to come in very handy for this project.

I am also strongly motivated by this project and expect it to be my main focus for the near-term at the very least.

Project Description

Automating alignment research is one approach to alignment that has gained much more visibility with the Open AI’s superalignment plan announcement. Some of automating alignment’s selling points (if successful) include potentially resulting in an enormous amount of alignment research even in short calendar time and the apparent relative ease of (even if non-robustly) aligning systems similar to the current state-of-the-art (e.g. large language models, foundation models; see this presentation of mine for many more details), which could be used as automated alignment researchers/research assistants. Notably, if successful, automating alignment research could plausibly be the most scalable alignment research agenda (and probably by a wide margin). At the same time, strategies to automate alignment research like the superalignment plan have received a lot of criticism within the alignment community (see e.g. this post).

This project aims to get more grounding into how promising automating alignment research is as a strategy, with respect to both advantages and potential pitfalls, with the superalignment plan as a potential blueprint/example (though ideally the findings would apply more broadly). This will be achieved by reviewing, distilling and integrating relevant research from multiple areas/domains, with a particular focus on the science of deep learning and on empirical findings in deep learning and language modelling (see my presentation for examples of what this might look like/for a potential starting point). Depending on team members’ profiles, this could expand much more broadly, covering e.g. reviewing and distilling relevant literature from AI governance, multidisciplinary intersections (e.g. neuroscience and alignment), relevant predictions on prediction markets, and the promise of automating larger parts of AI risk mitigation research (e.g. including AI governance research). 

This could also inform e.g. how promising it might be to start more automated alignment/AI risk mitigation projects or to dedicate more resources to existing ones. 

See this document with more details.

Personal Fit

A wide range of profiles and skills could contribute significantly to this project. Some (non-comprehensive) examples of relevant crucial considerations and relevant domain areas/example research (see slides for more details/examples on what such research might look like more concretely):

  • How much time between automated alignment researchers and significant x-risk from AI misuse? Compute governance, evals and governance

  • When should we expect automated alignment research vs. automated capabilities research? E.g. scaling laws (including for transfer learning), t-AGI framework, scaling laws and temporal horizons, science of DL

  • How hard does aligning ~human-level foundation systems seem to be? Science of DL, empirical DL findings

    • How human-like are current systems and how hard would it be to make them more human-like for alignment purposes? See this review on comparing artificial and biological neural networks and more linkposts on my LessWrong profile.

Mentorship style
I expect to easily spend at least 10 hrs/week (most likely > 20 hrs/week) and certainly at least 3 hrs/week even in “worst case scenarios”, e.g. I get hired to work on something else. As an independent researcher, this is currently my main research project and will be the largest part of my next round of applying for independent funding.

Time commitment
I would prefer at least 20hrs/week, but flexible.

See more in the skills requirements section of this document.