Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
7

Activation vector steering with BCI

Technical AI safety
lisathiergart avatar

Lisa Thiergart

ActiveGrant
$30,260raised
$244,000funding goal

Donate

Sign in to donate

Project summary

Recent work (https://tinyurl.com/avgpt2xl) has shown that language models can be “steered” (towards text completions which resemble humans in differing mental states) by simply adding vectors to the model’s neural activations. Other recent work (e.g. https://tinyurl.com/latentlin) has shown that latent representations of different models can be bridged by a simple linear mapping. In this experiment our hypothesis is that (some aspects of) human brain states can be bridged to the latent representations of language models by simple mappings. This could contribute to prosaic AI alignment: (1) generative models could be steered to exhibit the specific brain states of specific people, to better represent their attitudes and opinions; (2) reward models could be trained to reproduce humanlike brain states during evaluation, making them more generalizable out-of-distribution; (3) scientific understanding of analogies between LLM behavior patterns and human behavior patterns could be improved.

What are this project's goals and how they be achieved?

Some of the specific steps:

  • Design the fMRI data-collection protocol

  • Implement the data-collection protocol (in particular, the display and keyboard elements)

  • Recruit human subjects

  • Connect with a suitable fMRI center and get the experiment approved (IRB process)

  • Administer the human-subject data-collection

  • Design the ML experiments (fMRI feature extraction pipeline, particular architecture modifications, loss function, validation metrics)

  • Implement the ML experiments (the dataset may be large enough to require cloud resources)

  • Write the technical report/paper

Impact:

  • Advancing the science of direct and meaningful connections between human minds and prosaic AI

  • Which is one potential pathway toward more generalizable AI value alignment—by ultimately modeling the process by which humans make value judgments more causally and mecahnistically, as opposed to merely its behavioral statistical features on a finite training distribution

How will this funding be used?

Salary

  • 108000$ 6 months salary for 1 researchers + 3 months 1 ML engineer (16k/month 3 months for ML, 10k/month 6 months for 1 researcher)

    • This will include one researcher + one ML engineer

  • 900$  fMRI ops contractor (30h * 30$/h)

  • 900$ Participant Volunteer compensation (25 Participants 1h 30$/h)

  • 50000$ tax for the salaries (assumed ~45% total overhead regardless of specific tax optimizations)

Equipment

  • 4800$ compute costs ( A100 GPU * 6 months)

  • 16500$ = 25h of fMRI time at($660 per hour ). We think we’d need 20-25h at the lower bound, and the more hours we can get the better. 

  • 50$  rubber-based “Virtually Indestructible Keyboard” for MRI-compatibility, only available used

  • 2000$ MRI-compatible screens for use inside the machine and/or travel to an fMRI facility with this installation already available

  • 3000$ Research laptop for use onsite at recordings

One-off Misc

  • 15600$ Office Costs (1400$/person office cost at FAR labs monthly 6 months 2 persons)

  • 1776$ Proportional visa costs for 1 researcher for this time period

20% buffer

Total: $244k

Who is on the team and what's their track record on similar projects?

David “davidad” Dalrymple:

  • Suggested this experiment before seeing the original activation-engineering results

  • Coauthor of Physical Principles for Scalable Neural Recording (with Ed Boyden, George Church, Konrad Kording, Adam Marblestone, et al.)

  • Advisor to this Nature Methods paper on 3D neuroimaging (in Acknowledgments): https://www.nature.com/articles/nmeth.2964

  • Advisor to Brain Preservation Foundation https://www.brainpreservation.org/team/david-dalrymple/

  • Studied systems neuroscience in the Biophysics PhD program at Harvard

  • Main claim to fame: youngest MIT graduate student (obtained master’s at age 16)

  • Author of An Open Agency Architecture for Safe Transformative AI (see also this subsequent exposition).

    • That is a completely different approach that relies on formal verification for safety rather than prosaic alignment; however, nonetheless, davidad believes there are some prosaic directions (such as this one) that deserve more attention and effort.


Lisa Thiergart: 

  • Co-author on original activation engineering paper (soon will also be on arxiv) https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector

  • Co-author on adding vector to steer a maze-solving agent https://www.lesswrong.com/posts/gRp6FAWcQiCWkouN5/maze-solving-agents-add-a-top-right-vector-make-the-agent-go

  • SERI MATS scholar

  • Previous experiences: https://www.linkedin.com/in/lisathiergart/

    • neurotech / alignment relevant experience: 

      • 6 months on Team Shard mentored by Alex Turner, various mechanistic interpretability projects including maze and natural abstraction 

      • 4 months working as Research Scientist for BCI startup

      • 3 months upskilling at Entrepreneur First focused on Alignment and Neurotech domain exploration

      • Ran workshop on neurotech for alignment affiliated with foresight

      • 8 months CORE Robotics lab - specialist project on BCI control of robotics, experience with EEG recording, experiment execution with participants and experimental design. IRB certified.

What are the most likely causes and outcomes if this project fails? (premortem)

The most obvious is that AIs don't make value judgements like humans do and this is a waste of time. It still seems well worth trying though.

What other funding is this person or project getting?

Probably some from Foresight since they are applying and we are in discussions with them. They don’t want to very actively spend time seeking grants since it is very time-consuming.

Comments6Donations4
Adrian-Regenfuss avatar

Adrian Regenfuss

donated $110
2024-04-16
vincentweisser avatar

Vincent Weisser

donated $150
2023-09-19
evhub avatar

Evan Hubinger

donated $15K
2023-07-31
MarcusAbramovitch avatar

Marcus Abramovitch

donated $15K
2023-07-31