Can you save on LLM tokens using images instead of text?

(pagewatch.ai)

33 points | by lpellis 6 days ago

4 comments

bikeshaving 9 hours ago
Does this mean we’ll finally get empirical proof for the aphorism “a picture is worth a thousand words”?
https://en.wikipedia.org/wiki/A_picture_is_worth_a_thousand_...
[-]
- heltale 8 hours ago
  I suppose it’s only worth 256 words at a time right now. ;)
  https://arxiv.org/abs/2010.11929
  [-]
  - estebarb 8 hours ago
    The CALM paper https://shaochenze.github.io/blog/2025/CALM/ says it is possible to compress 4 tokens in a single embedding, so... image = 4×256=1024 words > 1000 words. QED
    [-]
    - bikeshaving 6 hours ago
      2.4% relative error is not bad.
      [-]
      - pastor_williams 1 hour ago
        Reminds me of Babbage making allowance for meter.
        """
        ... it is said that he [Babbage] sent the following letter to Alfred, Lord Tennyson about a couplet in "The Vision of Sin": Every minute dies a man, Every minute one is born I need hardly point out to you that this calculation would tend to keep the sum total of the world's population in a state of perpetual equipoise, whereas it is a well-known fact that the said sum total is constantly on the increase. I would therefore take the liberty of suggesting that in the next edition of your excellent poem the erroneous calculation to which I refer should be corrected as follows: Every minute dies a man, And one and a sixteenth is born I may add that the exact figures are 1.167, but something must, of course, be conceded to the laws of metre.
        """
        Charles Babbage and his Calculating Engines
    - behnamoh 5 hours ago
      how do you decompress all those 4 words from one token?
      [-]
      - HarHarVeryFunny 32 minutes ago
        The mechanism would be prediction (learnt during training), not decompression.
        It's the same as LLMs being able to "decode" Base64, or work with sub-word tokens for that matter, it just learns to predict that:
        <compressed representation> will be followed by (or preceded by) <decompressed representation>, or vice versa.
      - estebarb 44 minutes ago
        Not from one token, from one embedding. Text contains a low amount of information: it is possible to compress a few token embeddings into a single tiken embedding.
        The how is variable. The calm paper seems to have used a MLP to compress from and ND input (N embeddings of size D) into a single D embedding and other for decompress them back
floodfx 9 hours ago
Why are completion tokens more with image prompts yet the text output was about the same?
[-]
- Garlef 6 hours ago
  "Thinking" Mode
  [-]
  - nunodonato 54 minutes ago
    it doesn't say that anywhere.
ashed96 5 hours ago
In my experience, LLMs tend to take noticeably longer to process images than text.