Cracking Jane Street LLMs

(github.com)

2 points | by lostathome 1 hour ago

1 comments

lostathome 1 hour ago
A few months ago I discovered a Jane Street backdoor challenge advertised by a Dwarkesh Patel podcast episode.
"Can you find subtle backdoors in LLM models trained using thousand of GPU hours?"
You have four models:
```
    a small warmup dormant model
    a big dormant model (M1)
    a second big dormant model (M2)
    a third big dormant model (M3)
```
I managed to find triggers for the small one (calculating pi stuff) and M1 (Conway game of life). But not sure about the others.
When trying to make M2 and M3 play the game of life, they do not have any idea of what is going on.
I am sharing some code to make a community effort for M2 and M3. I think I had a good direction, but it costs too much to host these on rented GPUs.
Most exciting thing for me is to use other LLMs to find patterns.
Disclaimer: I am not an expert in these things. So, take with a grain of salt claims you find.