Manifund

Research Interests: Understanding LLM capabilities, AI safety and alignment, interpretability

My current research focuses specifically on studying LLM capabilities and its failure modes such as alignment faking and model scheming. I use interpretability and representation engineering to understand these mechanisms internally with the goal of improving the safety of AI systems.

Projects