Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
kushalt avatarkushalt avatar
Kushal Thaman

@kushalt

AI safety researcher at Stanford

http://kushalthaman.github.io/
$0total balance
$0charity balance
$0cash balance

$0 in pending offers

About Me

I'm a CS + Math student researcher (undergraduate) at Stanford, working on ML safety. Here's a list of my current projects and interests:

  • Find ways (which currently look like "ensembling RMs") to mitigate over-optimization and reward in RLHF. Codebase at https://github.com/kushalthaman/overoptimization-dpo, initial poster with preliminary results at https://drive.google.com/file/d/1shUuvIZZQ3b2hkwGhlvhOHmFOkFujxBF/view.

  • Studying training mechanisms of large language models. How does over-training, SFT, RLHF etc. affect whether the trained models end up becoming path independent, falling into specific loss basins, or giving rise to effective model soups?

  • Grokking how Transformers solve logic problems (writing a paper for ICML).

  • Incidental Polysemanticity: https://arxiv.org/abs/2312.03096

  • Adversarial robustness, Relaxed & Latent Adversarial Training

  • Testing scalable oversight mechanisms (e.g. debate) via scaffolding SoTA language models