Post available on lesswrong and submitted to alignment forum.

https://www.lesswrong.com/posts/JxhJfqfTJB9dkq72K/alignment-is-hard-an-uncomputable-alignment-problem-1

4

Alexander Bistagne is seeking funding to formalize the following argument:

If we can't tell that a black box program will loop forever, How could we verify that a black box agent will care about us forever instead of eventually betraying us? Answer: We can't in the worst case.

In order to make the minimum formal argument, this project needs to clearly articulate a large quantity of assumptions about what we mean by betrayal, a black box agent, time, the universe and other things. It will do this through the language of theoretical computer science: Turing Machines and Formal language theory. After defining everything appropriately, this project will reduce the complement of the halting problem to testing the alignment of a black box agent. I previously made decent progress on this project, but got sidetracked because the task of finding a job took priority. I think this project is feasible because I have made creative attempts at alignment before with a school project labeled suicidal AI, and I have attempted complexity theorems(see LvsNC1 on my github)

The goal of this project is to convincingly argue that Alignment is Hard. My motivation for this is that; if we can convince the theoretical computer science community that this problem is important, they might be able to develop enough theory to do something or have enough power to throw the brakes on AGI projects. The concrete deliverable is to get the claim "Alignment is Hard" published in places that the current AGI designers might respect.

Type up the already existing rough draft.

Edit the rough draft to remove discussions about cosmology.

Submit a draft to the Alignment Forum.

Review & incorporate feedback I get on Less Wrong.

Submit the project/paper to FOCS or other conferences on theoretical computer science.

Work on removing agents' immutability assumption.

Work on stop button variation of this problem.

Basic Project: 6,000 USD

5,000- 1 month salary

500- Payroll insurance & Taxes

500- Travel & Board for conferences.

This would let me work on this project as my top priority for 1 month. I would submit the core paper to a theoretical CS conference.

Organized Project: 40,000 USD

30,000- 6 months salary

3,000- Payroll insurance & Taxes

4,000- 26 meetings with an Editor/Mentor

3,000- Travel & Board for conferences

More funding would allow me to attempt to weaken or vary the assumptions and make more general claims about alignment. The most notable claims to weaken are around mutability of an agent’s code and the agent being in an enclosed space. Other angles to approach the problem are from agents with known utility function instead of known architecture.

I think the following projects outline that I am capable of thinking about AI in creative ways, finishing academic papers when deadlines are present, and understanding the deep mathematical field of computational complexity theory.

Suicidal AI paper https://drive.google.com/file/d/1ElnLoRbEfsAXNymIYg1nnkOisYCegSCm/view?usp=drive_link - This paper in Lise Getoor’s class Algorithms & Ethics won the most creative award. It attempted to provide counterexample style evidence against the claim that all agents want to live.

Matching Algorithms paper https://drive.google.com/file/d/1HYYfSjl38LnyBPeeBX43KtyvnMq-AoMP/view?usp=sharing - This paper was for my senior seminar in mathematics at UCSC. It mostly summarizes the topic for an undergraduate math audience.

Complexity Theorem Attempt LvsNC1

https://github.com/Alexhb61/Sorting_Algorithms/blob/main/LvsNC1_attempt - This rough draft aims to prove a nontrivial complexity theory claim, but has an unproven assumption and is thus incomplete.

Singular Value Decomposition attempt https://github.com/Alexhb61/Matrix_Algorithms/blob/main/SVD_draft1.pdf - This rough draft aimed to prove a nontrivial claim about the complexity of basic mathematical operations; specifically that singular value decomposition can be done in subcubic time in the theoretical setting. This paper needs a lot of polish.

TSort https://github.com/Alexhb61/Sorting_Algorithms/blob/main/truncation_sorts/Tsort

This rough pseudocode outlines an algorithm with sub n(logn) computational complexity for numbers with significantly less than n bits.

Cliques Via Cuts

https://github.com/Alexhb61/graph_algorithms/blob/main/Cliques_via_cuts.pdf

An ambitious attempt at disproving the strong exponential time hypothesis. Recently made into a shareable rough draft. This is more evidence of computational complexity work.

This research might convince people to give up on aligning black box agents instead of giving up on making black box agents.

None at the moment. I have applied to the nonlinear network, but did not recieve any concrete indication that funding was available.

Alexander Bistagne

6 months ago

Progress update

Post available on lesswrong and submitted to alignment forum.

https://www.lesswrong.com/posts/JxhJfqfTJB9dkq72K/alignment-is-hard-an-uncomputable-alignment-problem-1

Alexander Bistagne

6 months ago

Project is on github. https://github.com/Alexhb61/Alignment/blob/main/Draft_2.pdf

citations and submitting to Alignment forum tommorrow.

Alexander Bistagne

8 months ago

This project is nearly at its target, but hit a delay near the beginning of september as I needed to take up other work to pay bills. Hopefully, I will post the minimal paper soon.

Austin Chen

10 months ago

I'm not familiar with Alexander or his work, but the votes of confidence from Anton, Quinn, and Greg are heartening.

Approving as the project seems within scope for Manifund (on longtermist research) and not likely to cause harm.

donated $3,800

Greg Colbourn

10 months ago

This research seems promising. I'm pledging enough to get it to proceed. In general we need more of this kind of research to establish consensus on LLMs (foundation models) basically being fundamentally uncontrollable black boxes (that are dangerous at the frontier scale). I think this can lead - in conjunction with laws about recalls for rule breaking / interpretability - to a de facto global moratorium on this kind of dangerous (proto-)AGI. (See: https://twitter.com/gcolbourn/status/1684702488530759680)

Alexander Bistagne

10 months ago

Technical detail worth mentioning; Here is the main theorem of the 6K project:

Proving an immutable code agent with turing-complete architecure in a turing machine simulateable environment has nontrivial betrayal-sensitive alignment is CoR-Hard.

The paper would define nontrivial betrayal-sensitive alignment and some constructions on agents needed in the proof.

Alexander Bistagne

10 months ago

Correction Co-RE is the class not Co-R. The set of problems reducable to the complement of the halting problen

Alexander Bistagne

10 months ago

Conditional on 6k being reached,

I have committed to submitting an edited draft to the alignment forum on August 23rd

donated $1,000

Anton Makiievskyi

10 months ago

I encouraged Alexander applying here, based on that some people in Nonlinear Networked like his application and that funding goal is quite low. I don't think 40k budget would be a good bet, but 6k could be worth a try. I'm interested to learn what regrantors think about it

I'll offer 1000$ donation in hope of others adding to it

Alexander Bistagne

10 months ago

Thanks for the encouragement and donation.

The 40K max would be a much larger project than the 6K project which is what I summarized.

6K would cover editing

-Argument refuting testing anti-betrayal alignments in turing complete architecture

-Argument connecting testing alignment to training alignment in single agent architecture

40k would additionally cover developing and editing

-Arguments around anti-betrayal alignments in deterministic or randomized, P or PSPACE complete architecture

-Arguments around short term anti-betrayal alignments

-Arguments connecting do-no-harm alignments to short term antibetrayal alignments

-Arguments refuting general solutions to the stop button problem which transform the utility function in computable reals context

-Arguments around general solutions to the stop button problem with floating point utility functions

-Foundations for modelling mutable agents or subagents