2 comments

  • dagmx 56 minutes ago
    This is a pretty phenomenal article.

    Even for those who don’t care about LLM use, this is just a great article on optimizing Swift performance, which is sadly something that doesn’t have a lot of written material for.

    I’m curious if the AMX instructions are truly secret. In theory you could use an M4 or above and get them via SME I think but I’m just guessing as I’ve never tried intrinsic from Swift myself.

    • mathisfun123 8 minutes ago
      > get them via SME

      I have no idea what this means - AMX was replaced by SME on M4. It's a new unit not just an "abstract intrinsic" (which would make zero sense).

  • nromiun 6 minutes ago
    > Is 1.1 Tflop/s good? Theoretically, the GPU on my M3 Max is capable of around 15 Tflop/s. But the real ceiling for this kind of task is going to be 3-5 Tflop/s

    This is so true. And why people should not take basic GPU benchmarks so seriously. Getting peak performance out of a GPU is much more complex than it is with a CPU.

    And it is one of the reasons why Nvidia still has a software moat compared to other GPU companies. CUDA has so many small kernels tuned for getting peak performance for your dataset.