During my MSc dissertation, after leveraging LLMs to identify domain owners by analysing scraped privacy policy texts, a new challenge emerged: the models struggled to accurately identify owners from non-English privacy policy texts by giving random names. This issue was evident across key evaluation metrics, including accuracy and, precision. My research addresses a critical gap in the field of AI Existential Safety: understanding and improving the safety of large language models (LLMs) in multilingual contexts. Current LLM evaluations are predominantly English based, leading to a narrow view of these models' safety and capabilities. This project seeks to expand the understanding of LLM safety across languages, exploring how token-based language encoding affects LLM reasoning, alignment, and robustness. The rapidly expanding capabilities of LLMs necessitate rigorous safety evaluations to prevent potential risks to global communities, especially those in non-English-speaking regions. Multilinguality adds complexity to AI safety challenges by creating multiple “versions” of safety that vary between languages. This research is motivated by a desire to build globally safe AI systems that honor diverse cultural norms while resisting adversarial manipulation.
Research Objectives
Assess Multilingual Safety: Investigate how well existing LLM safety evaluations perform across languages, hypothesizing that substantial safety and bias variations will emerge.
Develop Cross-Language Safety Interventions: Design methodologies to reinforce safety and reduce vulnerabilities across different linguistic contexts.
Enhance Representation Engineering: Use insights from multilingual variations to advance the creation of a language-agnostic safety framework that could contribute to robust, jailbreak-resistant LLMs.
It will be used to fund my study as DPhil student at University of Oxford.
I will work alone under supervision.
These is no fail measurement as it will show how different languages in LLM might affect AI Alignment.
Nothing till now