Learning Triton One Kernel at a Time: Softmax

Within the previous article of this collection, operation in all fields of laptop science: matrix multiplication. It’s closely utilized in neural networks to compute the activation of linear layers. Nevertheless, activations on their very own are tough to interpret, since their values and statistics (imply, variance, min-max amplitude) can range wildly from layer to layer. This is without doubt one of the the explanation why we use activation features, for instance the logistic perform (aka sigmoid) which initiatives any actual quantity within the [0; 1] vary.

The softmax perform, also called the normalised exponential perform, is a multi-dimensional generalisation of the sigmoid. It converts a vector of uncooked scores (logits) right into a likelihood distribution over M courses. We will interpret it as a weighted common that behaves as a clean perform and may be conveniently differentiated. It’s a essential element of dot-product consideration, language modeling, and multinomial logistic regression.

On this article, we’ll cowl: