Manifund

I'm a CS + Math student researcher (undergraduate) at Stanford, working on ML safety. Here's a list of my current projects and interests:

Find ways (which currently look like "ensembling RMs") to mitigate over-optimization and reward in RLHF. Codebase at https://github.com/kushalthaman/overoptimization-dpo, initial poster with preliminary results at https://drive.google.com/file/d/1shUuvIZZQ3b2hkwGhlvhOHmFOkFujxBF/view.
Studying training mechanisms of large language models. How does over-training, SFT, RLHF etc. affect whether the trained models end up becoming path independent, falling into specific loss basins, or giving rise to effective model soups?
Grokking how Transformers solve logic problems (writing a paper for ICML).
Incidental Polysemanticity: https://arxiv.org/abs/2312.03096
Adversarial robustness, Relaxed & Latent Adversarial Training
Testing scalable oversight mechanisms (e.g. debate) via scaffolding SoTA language models

I'm a CS + Math student researcher (undergraduate) at Stanford, working on ML safety. Here's a list of my current projects and interests:

Find ways (which currently look like "ensembling RMs") to mitigate over-optimization and reward in RLHF. Codebase at https://github.com/kushalthaman/overoptimization-dpo, initial poster with preliminary results at https://drive.google.com/file/d/1shUuvIZZQ3b2hkwGhlvhOHmFOkFujxBF/view.
Studying training mechanisms of large language models. How does over-training, SFT, RLHF etc. affect whether the trained models end up becoming path independent, falling into specific loss basins, or giving rise to effective model soups?
Grokking how Transformers solve logic problems (writing a paper for ICML).
Incidental Polysemanticity: https://arxiv.org/abs/2312.03096
Adversarial robustness, Relaxed & Latent Adversarial Training
Testing scalable oversight mechanisms (e.g. debate) via scaffolding SoTA language models