chapter of the in-progress ebook on linear algebra, “A birds eye view of linear algebra”. The desk of contents to date:
Keep tuned for future chapters.
Right here, we are going to describe operations we are able to do with two matrices, however protecting in thoughts they’re simply representations of linear maps.
I) Why care about matrix multiplication?
Virtually any info might be embedded in a vector area. Photos, video, language, speech, biometric info and no matter else you’ll be able to think about. And all of the purposes of machine studying and synthetic intelligence (just like the latest chat-bots, textual content to picture, and so forth.) work on prime of those vector embeddings. Since linear algebra is the science of coping with excessive dimensional vector areas, it’s an indispensable constructing block.
Loads of the methods contain taking some enter vectors from one area and mapping them to different vectors from another area.
However why the concentrate on “linear” when most attention-grabbing capabilities are non-linear? It’s as a result of the issue of constructing our fashions excessive dimensional and that of constructing them non-linear (basic sufficient to seize all types of advanced relationships) transform orthogonal to one another. Many neural community architectures work through the use of linear layers with easy one dimensional non-linearities in between them. And there’s a theorem that claims this type of structure can mannequin any perform.
For the reason that manner we manipulate high-dimensional vectors is primarily matrix multiplication, it isn’t a stretch to say it’s the bedrock of the fashionable AI revolution.

II) Algebra on maps
In chapter 2, we learnt methods to quantify linear maps with determinants. Now, let’s do some algebra with them. We’ll want two linear maps and a foundation.

II-A) Addition
If we are able to add matrices, we are able to add linear maps since matrices are the representations of linear maps. And matrix addition just isn’t very attention-grabbing if scalar addition. Simply as with vectors, it’s solely outlined if the 2 matrices are the identical measurement (identical rows and columns) and includes lining them up and including component by component.

So, we’re simply doing a bunch of scalar additions. Which signifies that the properties of scalar addition logically prolong.
Commutative: when you change, the consequence received’t twitch
A+B = B+A
However commuting to work may not be commutative since going from A to B may take longer than B to A.
Associative: in a sequence, don’t chorus, take any 2 and proceed
A+(B+C) = (A+B)+C
Id: And right here I’m the place I started! That’s no technique to deal with a person!
The presence of a particular component that when added to something leads to the identical factor. Within the case of scalars, it’s the quantity 0. Within the case of matrices, it’s a matrix filled with zeros.
A + 0 = A or 0 + A = A
Additionally, it’s doable to start out at any component and find yourself at some other through addition. So it have to be doable to start out at A and find yourself on the additive id, 0. The factor that have to be added to A to attain that is the additive inverse of A and it’s known as -A.
A + (-A) = 0
For matrices, you simply go to every scalar component within the matrix and change with the additive inverse of every one (switching the indicators if the scalars are numbers) to get the additive inverse of the matrix.
II-B) Subtraction
Subtraction is simply addition with the additive inverse of the second matrix as a substitute.
A-B = A+(-B)
II-C) Multiplication
We might have outlined matrix multiplication simply as we outlined matrix addition. Simply take two matrices which might be the identical measurement (rows and columns) after which multiply the scalars component by component. There’s a identify for that sorts of operation, the Hadamard product.
However no, we outlined matrix multiplication as a much more convoluted operation, extra “unique” than addition. And it isn’t advanced only for the sake of it. It’s a very powerful operation in linear algebra by far.
It enjoys this particular standing as a result of it’s the means by which linear maps are utilized to vectors, constructing on prime of dot merchandise.
The way in which it truly works requires a devoted part, so we’ll cowl that in part III. Right here, let’s record a few of its properties.
Commutative
In contrast to addition, matrix multiplication just isn’t at all times commutative. Which signifies that the order during which you apply linear maps to your enter vector issues.
A.B != B.A
Associative
It’s nonetheless associative
A.B.C = A.(B.C) = (A.B).C
And there’s a lot of depth to this property, as we are going to see in part IV.
Id
Identical to addition, matrix multiplication additionally has an id component, I, a component that when any matrix is multiplied to leads to the identical matrix. The massive caveat being that this component solely exists for sq. matrices and is itself sq..
Now, due to the significance of matrix multiplication, “the id matrix” typically is outlined because the id component of matrix multiplication (not that of addition or the Hadamard product for instance).
The id component for addition is a matrix composed of 0’s and that of the Hadamard product is a matrix composed of 1’s. The id component of matrix multiplication is:

So, 1’s on the primary diagonal and 0’s all over the place else. What sort of definition for matrix multiplication would result in an id component like this? We’ll want to explain the way it works to see, however first let’s go to the ultimate operation.
II-D) Division
Simply as with addition, the presence of an id matrix suggests any matrix, A might be multiplied with one other matrix, A^-1 and brought to the id. That is known as the inverse. Since matrix multiplication isn’t commutative, there are two methods to this. Fortunately, each result in the id matrix.
A.(A^-1) = (A^-1).A = I
So, “dividing” a matrix by one other is solely multiplication with the second ones inverse, A.B^-1. If matrix multiplication is essential, then this operation is as nicely because it’s the inverse. It’s also associated to how we traditionally developed (or perhaps stumbled upon) linear algebra. However extra on that within the subsequent chapter (4).
One other property we’ll be utilizing that could be a mixed property of addition and multiplication is the distributive property. It applies to all types of matrix multiplication from the normal one to the Hadamard product:
A.(B+C) = A.B + A.C
III) Why is matrix multiplication outlined this manner?
We’ve arrived ultimately to the part the place we are going to reply the query within the title, the meat of this chapter.
Matrix multiplication is the best way linear maps act on vectors. So, we get to inspire it that manner.
III-A) How are linear maps utilized in apply?
Take into account a linear map that takes m dimensional vectors (from R^m) as enter and maps them to n dimensional vectors (in R^n). Let’s name the m dimensional enter vector, v.
At this level, it may be useful to consider your self truly coding up this linear map in some programming language. It ought to be a perform that takes the m-dimensional vector, v as enter and returns the n dimensional vector, u.
The linear map has to take this vector and switch it into an n dimensional vector by some means. Within the perform above, you’ll discover we simply generated some vector at random. However this fully ignored the enter vector, v. That’s unreasonable, v ought to have some say. Now, v is simply an ordered record of m scalars v = [v1, v2, v3, …, vm]. What do scalars do? They scale vectors. And the output vector we’d like ought to be n dimensional. How about we take some (fastened) m vectors (pulled out of skinny air, every n dimensional), w1, w2, …, wm. Then, scale w1 by v1, w2 by v2 and so forth and add all of them up. This results in an equation for our linear map (with the output on the left).

Make observe of the equation (1) above since we’ll be utilizing it once more.
For the reason that w1, w2,… are all n dimensional, so is u. And all the weather of v=[v1, v2, …, vm] have an affect on the output, u. The thought in equation (1) is carried out under. We take some randomly generated vectors for the w’s however with fastened seeds (guaranteeing that the vectors are the identical throughout each name of the perform).
We’ve a manner now to “map” m dimensional vectors (v) to n dimensional vectors (u). However does this “map” fulfill the properties of a linear map? Recall from chapter-1, part II the properties of a linear map, f (right here, a and b are vectors and c is a scalar):
f(a+b) = f(a) + f(b)
f(c.a) = c.f(a)
It’s clear that the map specified by equation (1) satisfies the above two properties of a linear map.


The m vectors, w1, w2, …, wm are arbitrary and it doesn’t matter what we select for them, the perform, f outlined in equation (1) is a linear map. So, completely different decisions for these w vectors leads to completely different linear maps. Furthermore, for any linear map you’ll be able to think about, there will likely be some vectors w1, w2,… that may be utilized along side equation (1) to characterize it.
Now, for a given linear map, we are able to accumulate the vectors w1, w2,… into the columns of a matrix. Such a matrix can have n rows and m columns. This matrix represents the linear map, f and its multiplication with an enter vector, v represents the applying of the linear map, f to v. And this software is the place the definition of matrix multiplication comes from.

We are able to now see why the id component for matrix multiplication is the best way it’s:

We begin with a column vector, v and finish with a column vector, u (so only one column for every of them). And because the parts of v should align with the column vectors of the matrix representing the linear map, the variety of columns of the matrix should equal the variety of parts in v. Extra on this in part III-C.
III-B) Matrix multiplication as a composition of linear maps
Now that we described how a matrix is multiplied to a vector, we are able to transfer on to multiplying a matrix with one other matrix.
The definition of matrix multiplication is way more pure once we contemplate the matrices as representations of linear maps.
Linear maps are capabilities that take a vector as enter and produce a vector as output. Let’s say the linear maps corresponding to 2 matrices are f and g. How would you consider including these maps (f+g)?
(f+g)(v) = f(v)+g(v)
That is paying homage to the distributive property of addition the place the argument goes contained in the bracket to each the capabilities and we add the outcomes. And if we repair a foundation, this corresponds to making use of each linear maps to the enter vector and including the consequence. By the distributive property of matrix and vector multiplication, this is identical as including the matrices equivalent to the linear maps and making use of the consequence to the vector.
Now, let’s consider multiplication (f.g).
(f.g)(v) = f(g(v))
Since linear maps are capabilities, essentially the most pure interpretation of multiplication is to compose them (apply them one by one, in sequence to the enter vector).
When two matrices are multiplied, the ensuing matrix represents the composition of the corresponding linear maps. Take into account matrices A and B; the product AB embodies the transformation achieved by making use of the linear map represented by B to the enter vector first after which making use of the linear map represented by A.
So now we have a linear map equivalent to the matrix, A and a linear map equivalent to the matrix, B. We’d wish to know the matrix, Cequivalent to the composition of the 2 linear maps. So, making use of B to any vector first after which making use of A to the consequence ought to be equal to only making use of C.
A.(B.v) = C.v = (A.B).v
Within the final part, we learnt methods to multiply a matrix and a vector. Let’s try this twice for A.(B.v). Say the columns of B are the column vectors, b1, b2, …, bm. From equation (1) within the earlier part,

And what if we utilized the linear map equivalent to C=A.B on to the vector, v. The column vectors of the matrix C are c1, c2, …, ck.

Evaluating the 2 equations above we get,

So, the columns of the product matrix, C=AB are obtained by making use of the linear map equivalent to matrix A to every of the columns of the matrix B. And gathering these ensuing vectors right into a matrix offers us C.
We’ve simply prolonged our matrix-vector multiplication consequence from the earlier part to the multiplication of two matrices. We simply break the second matrix into a set of vectors, multiply the primary matrix to all of them and accumulate the ensuing vectors into the columns of the consequence matrix.

So the primary row and first column of the consequence matrix, C is the dot product of the primary column of B and the primary row of A. And typically the i-th row and j-th column of C is the dot product of the i-th row of A and the j-th column of B. That is the definition of matrix multiplication most of us first be taught.

Associative proof
We are able to additionally present that matrix multiplication is associative now. As a substitute of the one vector, v, let’s apply the product C=AB individually to a bunch of vectors, w1, w2, …, wl. Let’s say the matrix that has these as column vectors is W. We are able to use the very same trick as above to indicate:
(A.B).W = A.(B.W)
It’s as a result of (A.B).w1 = A.(B.w1) and the identical for all the opposite w vectors.
Sum of outer merchandise
Say we’re multiplying two matrices A and B:

Equation (3) might be generalized to indicate that the i,j component of the ensuing matrix, C is:

We’ve a sum over ok phrases. What if we took every of these phrases and created ok particular person matrices out of them. For instance, the primary matrix can have as its i,j-th entry: b_{i,1}. a_{1,j}. The ok matrices and their relationship to C:

This technique of summing over ok matrices might be visualized as follows (paying homage to the animation in part III-A that visualized a matrix multiplied to a vector):

We see right here the sum over ok matrices the entire identical measurement (nxm) which is identical measurement because the consequence matrix, C. Discover in equation (4) how for the primary matrix, A, the column index stays the identical whereas for the second matrix, B, the row index stays the identical. So the ok matrices we’re getting are the matrix merchandise of the i-th column of A and the i-th row of B.
Matrix multiplication as a sum of outer merchandise. Picture by creator.
Contained in the summation, two vectors are multiplied to supply matrices. It’s a particular case of matrix multiplication when utilized to vectors (particular circumstances of matrices) and known as “outer product”. Right here is one more animation to indicate this sum of outer merchandise course of:

This tells us why the variety of row vectors in B ought to be the identical because the variety of column vectors in A. As a result of they should be mapped collectively to get the person matrices.
We’ve seen a number of visualizations and a few math, now let’s see the identical factor through code for the particular case the place A and B are sq. matrices. That is based mostly on part 4.2 of the ebook “Introduction to Algorithms”, [2].
III-C) Matrix multiplication: the structural decisions

Matrix multiplication appears to be structured in a bizarre manner. It’s clear that we have to take a bunch of dot merchandise. So, one of many dimensions has to match. However why make the columns of the primary matrix be equal to the variety of rows of the second?
Received’t it make issues extra simple if we redefine it in a manner that the variety of rows of the 2 matrices ought to be the identical (or the variety of columns)? This is able to make it a lot simpler to establish when two matrices might be multiplied.
The standard definition the place we require the rows of the primary matrix to align with the columns of the second has multiple benefit. Let’s go first to matrix-vector multiplication. Animation (1) in part III-A confirmed us how the normal model works. Let’s visualize what it if we required the rows of the matrix to align with the variety of parts within the vector as a substitute. Now, the n rows of the matrix might want to align with the nparts of the vector.

We see that we’d have to start out with a column vector, v with n rows and one column and find yourself with a row vector, u with 1 row and m columns. That is awkward and makes defining an id component for matrix multiplication difficult because the enter and output vectors can by no means have the identical form. With the normal definition, this isn’t a problem because the enter is a column vector and the output can be a column vector (see animation (1)).
One other consideration is multiplying a sequence of matrices. Within the conventional technique, it’s so straightforward to see to start with that the chain of matrices under might be multiplied collectively based mostly on their dimensionalities.

Additional, we are able to inform that the output matrix can have l rows and p columns.
Within the framework the place the rows of the 2 matrices ought to line up, this shortly turns into a multitude. For the primary two matrices, we are able to inform that the rows ought to align and that the consequence can have n rows and l columns. However visualizing what number of rows and columns the consequence can have after which reasoning about climate it’ll be suitable with C, and so forth. turns into a nightmare.

And that’s the reason we require the rows of the primary matrix to align with the columns of the second matrix. However perhaps I missed one thing. Perhaps there may be an alternate definition that’s “cleaner” and supervisor to side-step these two challenges. Would love to listen to concepts within the feedback 🙂
III-D) Matrix multiplication as a change of foundation
Up to now, we’ve considered matrix multiplication with vectors as a linear map that takes a vector as enter and returns another vector as output. However there may be one other manner to consider matrix multiplication — as a technique to change perspective.
Let’s contemplate two-dimensional area, R². We characterize any vector on this area with two numbers. What do these numbers characterize? The coordinates alongside the x-axis and y-axis. A unit vector that factors simply alongside the x-axis is [1,0] and one which factors alongside the y-axis is [0,1]. These are our foundation for the area. Each vector now has an tackle. For instance, the vector [2,3] means we scale the primary foundation vector by 2 and the second by 3.
However this isn’t the one foundation for the area. Another person (say, he who shall not be named) may wish to use two different vectors as their foundation. For instance, the vectors e1=[3,2] and e2=[1,1]. Any vector within the area R² may also be expressed of their foundation. The identical vector would have completely different representations in our foundation and their foundation. Like completely different addresses for a similar home (maybe based mostly on completely different postal techniques).
After we’re within the foundation of he who shall not be named, the vector e1 = [1,0]and the vector e2 = [0,1] (that are the premise vectors from his perspective by definition of foundation vectors). And the capabilities that interprets vectors from our foundation system to that of he who shall not be named and vise-versa are linear maps. And so the translations might be represented as matrix multiplications. Let’s name the matrix that takes vectors from us to the vectors to he who shall not be named, M1 and the matrix that does the other, M2. How do we discover the matrices for these matrices?

We all know that the vectors we name e1=[3,2] and e2=[1,1], he who shall not be named calls e1=[1,0] and e2=[0,1]. Let’s accumulate our model of the vectors into the columns of a matrix.

And in addition accumulate the vectors, e1 and e2 of he who shall not be named into the columns of one other matrix. That is simply the id matrix.

Since matrix multiplication operates independently on the columns of the second matrix,

Pre-multiplying by an applicable matrix on either side offers us M1:

Doing the identical factor in reverse offers us M2:

This may all be generalized into the next assertion: A matrix with column vectors; w1, w2, …, wn interprets vectors expressed in a foundation the place w1, w2, …, wn are the premise vectors to our foundation.
And the inverse of that matrix interprets vectors from our foundation to the one the place w1, w2, …, wn are the premise.
All sq. matrices can therefore be considered “foundation changers”.
Observe: Within the particular case of an orthonormal matrix (the place each column is a unit vector and orthogonal to each different column), the inverse turns into the identical because the transpose. So, altering to the premise of the columns of such a matrix turns into equal to taking the dot product of a vector with every of the rows.
For extra on this, see the 3B1B video, [1].
Conclusion
Matrix multiplication is arguably one of the crucial essential operations in fashionable computing and in addition with virtually any knowledge science discipline. Understanding deeply the way it works is essential for any knowledge scientist. Most linear algebra textbooks describe the “what” however not why its structured the best way it’s. Hopefully this weblog crammed that hole.
[1] 3B1B video on change of foundation: https://www.youtube.com/watch?v=P2LTAUO1TdA&t=2s
[2] Introduction to Algorithms by Cormen et.al. Third version
[3] Matrix multiplication as sum of outer merchandise: https://math.stackexchange.com/questions/2335457/matrix-at-a-as-sum-of-outer-products
[4] Catalan numbers wikipedia article https://en.wikipedia.org/wiki/Catalan_number