Projects

Model Interpretability on modFDTGPT2-XL, a partially-aligned model

Comments

Model Interpretability on modFDTGPT2-XL, a partially-aligned model

Miguelito De Guzman

over 2 years ago

Thank you for the offer! Nicholas Doiron

Jacques Thibodeau - Independent AI Safety Research

Miguelito De Guzman

over 2 years ago

Hello,

I upvoted this because I have personally explored this area and have identified numerous possibilities and areas of interest. Comparing base models to their variants in terms of alignment is currently an underexplored aspect. I encourage more people to focus on this area.

Scoping Developmental Interpretability

Miguelito De Guzman

over 2 years ago

I am also conducting phase transitions with GPT2-xl, and I believe there is a need for further research on this mechanism. I fully support this application!

Model Interpretability on modFDTGPT2-XL, a partially-aligned model

Miguelito De Guzman

over 2 years ago

Just finished the first update post on this project. An Analysis of Activation Values (ActVal) in GPT2-xl and modFDTGPTxl

Joseph Bloom - Independent AI Safety Research

Miguelito De Guzman

over 2 years ago

I am one of the ARENA 2.0 online participants and I could say that in my interaction with Joseph he was very insightful. I believe he is competent enough to deliver on his the alignment space.

Model Interpretability on modFDTGPT2-XL, a partially-aligned model

Miguelito De Guzman

over 2 years ago

Thank you @Vincent Weisser.

Much appreciated offers.