Evaluating LLM Safety in a Multilingual World

Project summary

During my MSc dissertation, after leveraging LLMs to identify domain owners by analysing scraped privacy policy texts, a new challenge emerged: the models struggled to accurately identify owners from non-English privacy policy texts by giving random names. This issue was evident across key evaluation metrics, including accuracy and, precision. My research addresses a critical gap in the field of AI Existential Safety: understanding and improving the safety of large language models (LLMs) in multilingual contexts. Current LLM evaluations are predominantly English based, leading to a narrow view of these models' safety and capabilities. This project seeks to expand the understanding of LLM safety across languages, exploring how token-based language encoding affects LLM reasoning, alignment, and robustness. The rapidly expanding capabilities of LLMs necessitate rigorous safety evaluations to prevent potential risks to global communities, especially those in non-English-speaking regions. Multilinguality adds complexity to AI safety challenges by creating multiple “versions” of safety that vary between languages. This research is motivated by a desire to build globally safe AI systems that honor diverse cultural norms while resisting adversarial manipulation.

What are this project's goals? How will you achieve them?

Research Objectives

Assess Multilingual Safety: Investigate how well existing LLM safety evaluations perform across languages, hypothesizing that substantial safety and bias variations will emerge.
Develop Cross-Language Safety Interventions: Design methodologies to reinforce safety and reduce vulnerabilities across different linguistic contexts.
Enhance Representation Engineering: Use insights from multilingual variations to advance the creation of a language-agnostic safety framework that could contribute to robust, jailbreak-resistant LLMs.

How will this funding be used?

It will be used to fund my study as DPhil student at University of Oxford.

Who is on your team? What's your track record on similar projects?

I will work alone under supervision.

What are the most likely causes and outcomes if this project fails?

These is no fail measurement as it will show how different languages in LLM might affect AI Alignment.

How much money have you raised in the last 12 months, and from where?

Nothing till now