8 comments

  • hazrmard 6 minutes ago
    Please check my understanding:

    An auto-encoder is trained on [activation] -AV-> [text] -AR-> [activation], where [activation] belongs to one layer in the LLM model M.

    Architecture:

        Model being analyzed (M): >|||||>  
        Verbalizer (AV) same as M: >|||||>  
        Reconstructor (AR) truncated up to the layer being analyzed: ||>
    
    The AV, AR models are initialized using supervised learning on a summarization task. The assumption being that model thoughts are similar to context summary.

    The AR is trained on a simple reconstruction loss.

    The AV is trained using an RL objective of reconstruction loss with a KL penalty to keep the verbalizations similar to the initial weights (to maintain linguistic fluency).

    - Authors acknowledge, and expect, confabulations in verbalizations: factually incorrect or unsubstantiated statements. But, the internal thought we seek is itself, by definition, unsubstantiated. How can we tell if it is not duplicitous?

    - They test this on a layer 2/3 deep into the models. I wonder how shallow and deep abstractions affect thought verbalization?

  • zozbot234 57 minutes ago
    Anthropic has released open weight models for translating the activations of existing models, viz. Qwen 2.5 (7B), Gemma 3 (12B, 27B) and Llama 3.3 (70B) into natural language text. https://github.com/kitft/natural_language_autoencoders https://huggingface.co/collections/kitft/nla-models This is huge news and it's great to see Anthropic finally engage with the Hugging Face and open weights community!
  • NitpickLawyer 54 minutes ago
    > We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia.

    Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.

    • fredericoluz 28 minutes ago
      same. i'm trying to trigger the 'mom is in the next room' russian thing but the model thinks the sentence is from american reddit.
    • fredericoluz 24 minutes ago
      it seems that the examples they showed off with haiku work. i'd guess llama is just too bad
  • Tossrock 57 minutes ago
    Anthropic Research going from strength to strength in interpretability. Publicly releasing the code so other labs can benefit from it is also a great move - very values aligned, and improves the overall AI safety ecosystem.
  • visarga 1 hour ago
    Beautiful idea, an autoencoder must represent everything without hiding if is to recover the original data closely. So it trains a model to verbalize embeddings well. This reveals what we want to know about the model (such as when it thinks it is being tested, or other hidden thoughts).
    • sobellian 4 minutes ago
      It could just invent its own secret language embedded into English akin to steganography. The explanation would not lose information but would remain uninterpretable by humans
  • tjohnell 1 hour ago
    It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.
    • rotcev 54 minutes ago
      This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.
    • astrange 27 minutes ago
      That shouldn't happen as long as the autoencoder isn't used as an RL reward. It will happen (due to Goodhart's law) if it is.

      Of course, if you use it to make any decision that can still happen eventually.

  • firemelt 1 hour ago
    finally a something interesting but this only makes me think that the last judgement is still in human hands to judge claude inner thoughts is correct or not

    I mean who knows if those are really claude thoughts or claude just think that is his thoughts because humans wants it