99% of the time is spent on matrix matrix or matrix vector calculation. Activati...

99% of the time is spent on matrix matrix or matrix vector calculation. Activation functions, softmax, RoPE, etc basically cost nothing in comparison.

Most NPUs are programmable, because the bottleneck is data SRAM and memory bandwidth instead of instruction SRAM.

For classic matrix matrix multiplication, the SRAM bottleneck is the number of matrix outputs you can store in SRAM. N rows and M columns get you N X M accumulator outputs. The calculation of the dot product can be split into separate steps without losing the N X M scaling, so the SRAM consumed by the row and column vectors is insignificant in the limit.

For the MLP layers in the unbatched case, the bottleneck lies in the memory bandwidth needed to load the model parameters. The problem is therefore how fast your DDR, GDDR, HBM memory and your NoC/system bus lets you transfer data to the NPU.

Having a programmable processor that controls the matrix multiplication function unit costs you silicon area for the instruction SRAM. For matrix vector multiplication, the memory bottleneck is so big, it doesn't matter what architecture you are using, even CPUs are fast enough. There is no demand for getting rid of the not very costly instruction SRAM.

"but what about the area taken up by the processor itself?"

HAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHA. Nice joke

Wait..., you were serious? The area taken up by an in order VLIW/TTA processor is so insignificant I jammed it in-between the routing gap of two SRAM blocks. Sure, the matrix multiplication unit might take up some space, bit decoding instructions is such an insignificant cost that anyone opposing programmability must have completely different goals and priorities than LLMs or machine learning.