2.2 Linear algebra

2.2.1 Vector geometry

2.2.1.1 Vector spaces

We have already studied vectors in the context of programming. Now we are going to be discussing the mathematics behind vectors so that we can better understand learning procedures. A vector is simply a collection of points in a row. We are going to be using column vectors for the sake of this book. Some people choose to use row vectors, but a collection of data points is usually thought of as a column in most programming languages, R included. We are going to call a column vector an $n\times 1$ vector to be consistent with our matrix notation later: \[ x=\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \] The first element in the pair $n\times r$, $n$, represents the number of rows and the second element, $r$, represents the number of columns. Because we have a vector, it has only $1$ column.

Two critically important operations for vectors are the sum and the scalar product. If we have two vectors of the same length, then we can calculate the sum of the two vectors using: \[ x+y = \begin{bmatrix} x_1+y_1 \\ \vdots \\ x_n+y_b \end{bmatrix} \] or simply by adding the elements one-at-a-time. We can represent the sum of two vectors in the following figure geometrically:

Basically by summing two vectors can be thought of as stepping by the first vector and then by the second. This is equivalent to stepping by the second vector and then by the first. This makes the addition operation commutative (like all addition operations). In other words, $x+y=y+x$.

A basic number that is not a vector is called a scalar. So a scalar is a $1\times 1$ object or just a single element. The scalar product is the multiplication of a vector by a scalar where we simply multiply each element of the vector by the scalar value or: \[ a x = \begin{bmatrix} a x_1 \\ \vdots \\ a x_n \end{bmatrix} \] Scalar multiplication can be represented graphically by:

A critical idea around vectors is the idea of linear combinations of vectors. A linear combination of two $n\times 1$ vectors is $ax+by$ where $a$ and $b$ are scalars. We can also have a linear combination of $r$ different $n\times 1$ vectors $x_1,\dots,x_r$ which would be $a_1 x_1 + \dots + a_r x_r$. Geometrically, this kind of linear combination can be thought of as a point that you can reach by combining steps using the $r$ different vectors. So lets say we have two vectors, $x$ and $y$ and we want to get to some other vector $z$:

Here we can scale $x$ and $y$ using some values to create the vector $z$:

Because we can create $z$ by using $x$ and $y$, we say that $z$ is linearly dependent on $x$ and $y$. The set of all point linearly dependent on $x$ and $y$ is called the span of $x$ and $y$. In the above example, we would say that all the points on the page of paper can be spanned by $x$ and $y$. Also the page can be spanned by $x$ and $z$ or even $y$ and $z$. Of course, the page can also be spanned by $x$, $y$, and $z$, but here we have a linear dependence, so it is not a minimal spanning set. A minimal spanning set is known as a basis. So and pair of $x$, $y$, and $z$ is a basis for the page. It takes two basis vectors to span a plane and three basis vectors to span a 3-dimensional space and so on (even though we cannot imagine higher than 3-dimensional spaces).

In R, we already know how to create vectors:

vx <- c(1,1,1)
vy <- c(1,2,3)

We can add vectors using the standard addition operations:

vx + vy

## [1] 2 3 4

Similarly we can perform scalar multiplication by:

3*vx

## [1] 3 3 3

2.2.1.2 Measures of vectors

There are many different ways to find the norm of vector. In general, the $L_p$-norm is defined by taking the sum of all the elements raised to the $p$-power and then taking the $p$th root of the result or, \[ \left\lVert x\right\rVert_p = \Big(\sum_{i=1}^nx_i^p\Big)^{1/p} \] You might notice that when $p=2$ we have the usual Euclidean norm of the vectors. Because this norm is so universally important particularly for common statistics methodologies, we are going to use it as the default. So if $p$ is not provided, you can assume that we mean $p=2$: \[ \left\lVert x\right\rVert=\left\lVert x\right\rVert_2= \Big(\sum_{i=1}^nx_i^2\Big)^{1/2} \]

The inner product is a way of taking the product of two vectors. Rather than operating element-wise like addition, the inner product takes the sum of all the element-wise products. This makes it something like the sum-product function you may be familiar with from spreadsheet programs. \[ \langle x,y\rangle=\sum_{i=1}^n x_i y_i \] This means we can rewrite our Euclidean norm as \[ \left\lVert x\right\rVert = \sqrt{\langle x,x\rangle} \] The inner product is particularly useful in statistics because it can be easily used to define the sample mean of a vector (consider using $y$ as a vector of ones) and the sample covariance between two vectors (first removing differencing the mean away then using the inner product of the results).

Going back to R, the inner product of two vectors can be found using the crossprod function. If two inputs are provided to crossprod, then we get the inner product of the two vectors. When only one input is provided, the function takes the inner product of the vector with itself:

vx

## [1] 1 1 1

vy

## [1] 1 2 3

crossprod(vx,vy)

##      [,1]
## [1,]    6

crossprod(vx)

##      [,1]
## [1,]    3

If instead we want to take common inner product based summary statistics for a variable like the total number, the sample mean, and the sample standard deviation, and the sample median of customer service calls:

sum(churn$cs_calls)                  # Sum all the values in cs_calls

## [1] 5209

mean(churn$cs_calls)                 # Sample mean of cs_calls

## [1] 1.562856

sd(churn$cs_calls)                   # Sample standard deviation of cs_calls

## [1] 1.315491

median(churn$cs_calls)               # Sample median of cs_calls

## [1] 1

The inner product is also helpful in finding the angle between two vectors using the law of cosines: \[ \cos\theta=\frac{\langle x,y\rangle}{\left\lVert x\right\rVert\left\lVert y\right\rVert} \] Geometrically, this would look like:

$\theta$ in the above plot is negative because x is From this, we can see that for two vectors to be completely orthogonal, \[\begin{aligned} 0&=\cos(\pi/2)=\frac{\langle x,y\rangle}{||x||||y||}\\ 0&=\langle x,y\rangle \end{aligned}\] So the inner product is zero if and only if two vectors are orthogonal. In R,

ADD GRAM SCHMIDT ORTHOGONALIZAITON

2.2.1.3 Variable modification

Sometimes it is necessary to modify variables to better understand and interpret data. For instance, we might want to make a variable in the churn dataset which is the sum of all the component charges. This would not be the total phone bill, but rather an approximation of the bill because we don’t know if there are other costs associated with a phone bill (for instance, so fixed cost would be very common). To make a variable like this, we need to create a space for that variable churn$bill and then assign to that variable the sum:

churn$bill <- churn$day_charges+churn$eve_charges+ 
              churn$night_charges+churn$intl_charges
summary(churn$bill) ## Adding variables to make new variables

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.93   52.38   59.47   59.45   66.48   96.15

Variables with only two outcomes are called dummy variables. They are typically on or off when a condition is met. They are also known as true/false variables, binary variables, or indicator variables. Typically we transform these variables to take the values 0 and 1 – 0 when the condition is false and 1 when the condition is true. This way if you sum the variable you find out how many true cases are in the data set and when you take the sample mean, you find the proportion of true cases in the data set. Currently if we look at the churn$churn variable, we see that it takes the values True. and False.:

table(churn$churn)

## 
## False.  True. 
##   2850    483

Right now churn has two different string values. That means the computer is actually saving the words “True.” and “False.” rather than just a single on/off indicator at every data point. In order to convert this variable we are going to test if the word is “True.”. We can test for equality using the == operator. There are also <, >, <= >= operators for less than/greater than equality (although those ideas are not well-defined for strings). The != operator looks for inequality. Notice the == operator is not the same at the = sign which can be used like the <- for assigning data to a variable.

Here we are going to test if churn$churn=="True." We put True. in quotes to indicate that it is a string. This test is going to give us TRUE and FALSE, R’s global ideas of truth and falsehood. Once we know have these, we can convert the result to numeric using the as.numeric command.

churn$churn <- as.numeric(churn$churn=="True.")

There is an alternative way to do this using the ifelse command. This command is notoriously slow, but for people familiar with excel functions, it behaves very similarly to the =if(.) function in excel. The test is the first parameter, the value to take if the test is true is the second, and the else value (false) is the third parameter:

#churn$churn <- ifelse(churn$churn=="True.",1,0)       # Slower technique, excel-friendly

We want to do similar things to intl_plan and vm_plan which take the values “yes” and “no”:

churn$intl_plan <- as.numeric(churn$intl_plan=="yes")
churn$vm_plan <- as.numeric(churn$vm_plan=="yes")

For the purposes of this tutorial, we are going to use some functions which require our data to have numeric values everywhere. The state and phone variables are strings right now. We could convert these to numeric, but it is easier for us to just drop them:

## Drop non-numeric variables
churn$state <- NULL
churn$phone <- NULL

Now when we summarize the data, you can see that all the variables are number valued:

summary(churn)

##      length        intl_plan          vm_plan          cs_calls    
##  Min.   :  1.0   Min.   :0.00000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.: 74.0   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:1.000  
##  Median :101.0   Median :0.00000   Median :0.0000   Median :1.000  
##  Mean   :101.1   Mean   :0.09691   Mean   :0.2766   Mean   :1.563  
##  3rd Qu.:127.0   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:2.000  
##  Max.   :243.0   Max.   :1.00000   Max.   :1.0000   Max.   :9.000  
##      churn             bill            mins           calls      
##  Min.   :0.0000   Min.   :22.93   Min.   :284.3   Min.   :191.0  
##  1st Qu.:0.0000   1st Qu.:52.38   1st Qu.:531.5   1st Qu.:282.0  
##  Median :0.0000   Median :59.47   Median :593.6   Median :305.0  
##  Mean   :0.1449   Mean   :59.45   Mean   :591.9   Mean   :305.1  
##  3rd Qu.:0.0000   3rd Qu.:66.48   3rd Qu.:652.4   3rd Qu.:328.0  
##  Max.   :1.0000   Max.   :96.15   Max.   :885.0   Max.   :416.0

2.2.2 Matrix algebra

2.2.2.1 Definitions

As we saw with programming in R, matrices are collections of vectors which all have the same length. Consider taking $r$ different $n\times 1$ vectors:

\[ \mathbf{X}=\begin{bmatrix} x_{1,1} & x_{1,2} & \dots & x_{1,r} \\ x_{2,1} & x_{2,2} & \dots & x_{2,r} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \dots & x_{n,r} \end{bmatrix} \] This is what we would call an $n\times r$ (pronounced $n$ by $r$) matrix. Just like vectors, the sum of two matrices is the sum of all the elements: \[ \mathbf{X}+\mathbf{Y}=\begin{bmatrix} x_{1,1}+y_{1,1} & \dots & x_{1,r}+y_{1,r} \\ \vdots & \ddots & \vdots \\ x_{n,1}+y_{n,1} & \dots & x_{n,r}+y_{n,r} \end{bmatrix} \] This is in keeping with our definition of a matrix where we said that it was a collection of vectors. In the same way, the product of a matrix with a scalar is just every element multiplied by the scalar: \[ a\mathbf{X}=\begin{bmatrix} a x_{1,1} & \dots & a x_{1,r} \\ \vdots & \ddots & \vdots \\ a x_{n,1} & \dots & a x_{n,r} \end{bmatrix} \]

Sometimes, to save space, we are going to write matrices using indexing notation: \[ \mathbf{X}=\begin{bmatrix}x_{i,j}\end{bmatrix}_{i=1\dots n, j=1\dots r} \]

In this notation, the definition of the product would be \[ a\mathbf{X}=\begin{bmatrix}a x_{i,j}\end{bmatrix}_{i=1\dots n, j=1\dots r} \]

Another critically important matrix definition is the transpose operation. The transpose changes column vectors into row vectors, and so it flips a matrix along its diagonal (switching the rows and columns): \[ \mathbf{X}'=\begin{bmatrix} x_{1,1} & x_{2,1} & \dots & x_{n,1} \\ x_{1,2} & x_{2,2} & \dots & x_{n,2} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1,r} & x_{2,r} & \dots & x_{n,r} \end{bmatrix} \] So if $\mathbf{X}$ is $n\times r$, then $mathbf{X}'$ is going to be $r\times n$.

In R, we can construct matrices using the matrix function:

mx <- matrix(1:6,nrow=2,ncol=3)
mx

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

We can add matrices and perform scalar multiplication as well:

mx+matrix(1,nrow=2,ncol=3)

##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7

2*mx

##      [,1] [,2] [,3]
## [1,]    2    6   10
## [2,]    4    8   12

The t() function performs the transpose.

t(mx)

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6

2.2.2.2 Linear systems

Typically people are confused when they first learn about the product of two matrices. The reason for this confusion is that the product of two matrices is not a well-defined idea. To understand why the product of two matrices has a complicated definition, we need to first understand that there is not just one product of matrices. There are many. We are going to be concerned with three different kinds of products. Two of those we will discuss later, but the dot product is of particular importance to us. From now on, if we talk about multiplying two matrices, assume that we mean using the dot product because it is going to be our default.

The dot product is not what you would initially think of as a product for matrices, but it is a generalization of an idea we have already studied, the inner product of two vectors. That means that we are going to be doing two operations when we multiply two matrices: multiplication and summation.

The other important background for our understanding is knowing about the following linear equation: \[ \begin{aligned} y_{i} = \beta_{1} x_{i,1} + \beta_2 x_{i,2} + \dots + \beta_r x_{i,r} + e_{i} \end{aligned} \label{eq1}\tag{1} \] This linear equation is going to be the first kind of equation we are going to use for modeling data. We will study it exclusively in Chapter 3. It is of foundation importance to us because it is a simple way to relate data. In the equation, we are modeling some variable in the data, $y_i$, using some other variables, $x_{i,1},\dots,x_{i,r}$. We are relating those variable using a linear (additive) equation of the $\mathbf{X}$ variables and some unknown error $e$. We are doing this for each observation (individual) in the data: $i=1,\dots,n$. In vector notation, we could write this as: \[ y=\beta_1 x_1 + \dots + \beta_r x_r + e \] So how are we to write this if we are going to think about $\mathbf{X}$ as a $n\times r$ matrix and $\beta$ as an $r\times 1$ vector of the $\beta_j$s? The simple way is to write: \[ y=\mathbf{X}\beta + e \] This is going to be the equation for linear modeling, but to make this work, matrix multiplication must be defined correctly. Remember Eq. (1), so in each row, we need to copy the entire vector of $\beta$. That means that we are going to have $n$ copies of $\beta$ when we do our multiplication. The other important part is that we are going to need to sum the columns in order to create the desired result (because $y$ and $e$ are both $n\times 1$ vectors, so $\mathbf{X}\beta$ must be as well). That’s why matrix multiplication involves both products and sums is that we are trying to simply express the linear model that we know from Eq. (1).

To make this kind of multiplication work, the following is the definition of the matrix dot product: \[ \mathbf{XY} = \begin{bmatrix} \sum_{k=1}^m x_{i,k}y_{k,j}\end{bmatrix}_{i=1\dots n, j=1\dots r} \] We can see from this definition that $\mathbf{XY}$ is an $n\times r$ matrix. To make this, we can see that we sum $x_{i,k}$ from $i=1,\dots,n$ and $k=1,\dots,m$. This means that $\mathbf{X}$ must be $n\times m$. Similarly we sum $y_{k,j}$ from $k=1,\dots,m$ and $j=1,\dots,r$. That means that $\mathbf{Y}$ must be $m\times r$. Notice that the number of columns of $\mathbf{X}$ must be equal to the number of rows from $\mathbf{Y}$. This is what makes matrix multiplication confusing, but if you remember Eq. (1), this definition makes sense. \[ \mathbf{X}\beta = \begin{bmatrix} \sum_{k=1}^r x_{i,k}\beta_{k,j}\end{bmatrix}_{i=1\dots n, j=1} = \begin{bmatrix} \sum_{k=1}^r \beta_{k}x_{i,k}\end{bmatrix}_{i=1\dots n} \] Which is the exact column vector that we need for Eq. (1).

What you will notice is that matrix multiplication is not commutative. That means that, $\mathbf{XY}\neq\mathbf{YX}$, and in fact, if $\mathbf{XY}$ exists because the inside dimensions match, that doesn’t mean that the inside dimensions $\mathbf{YX}$ will match, so $\mathbf{YX}$ might not even be defined.

The inner product between two vectors, we can now define using matrix notation: \[ \langle x,y\rangle = x'y \] You can verify for yourself that the above is true. An important matrix to know about is $\mathbf{I}_n$ which is the $n\times n$ identity matrix: \[ \mathbf{I}_n=\begin{bmatrix} 1&0&\dots&0\\ 0&1&\dots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\dots&1 \end{bmatrix} \] Any matrix when multiplied by the identity matrix will remain itself. That is: \[ \mathbf{X}_{n\times r}=\mathbf{I}_n \mathbf{X}=\mathbf{X} \mathbf{I}_r \]

There is an important thing to note here. You don’t need to know exactly how to calculate matrix multiplication for the purposes of this book. In fact, for much of the book, you barely need to understand this at all. The computer is going to be doing a lot of matrix algebra, but most of this happens behind-the-scenes where a data analyst doesn’t even see it at all. However, many data science books are written almost entirely in matrix algebra, so to learn more about data science techniques and tools requires some awareness of matrices and their products.

If we want to multiply matrices in R, then we can use the %*% operator:

mx

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

my <- matrix(1,nrow=2,ncol=2)
my%*%mx

##      [,1] [,2] [,3]
## [1,]    3    7   11
## [2,]    3    7   11

mx%*%my

## Error in mx %*% my: non-conformable arguments

The second result threw an error here because we cannot multiply a $2\times 3$ by a $2\times 2$ because their inner dimensions do not match. Sometimes, we want to take the transpose before multiplying. The helpful function crossprod which we used for the inner product does just that and will produce my$'$mx:

my <- matrix(1:4,nrow=2,ncol=2)
crossprod(my,mx)

##      [,1] [,2] [,3]
## [1,]    5   11   17
## [2,]   11   25   39

2.2.2.3 Measures of matrices

There are two important summary statistics to be aware of when it comes to matrices. The first is the trace. The trace of any square ($n\times n$) matrix is the sum of the elements along its diagonal. That is \[ \text{tr}(\mathbf{X})=\sum_{i=1}^{\min(n,r)}x_{i,i} \] The trace is a useful operation for working on some problems because of its interesting properties. It is obvious that the trace of a $1\times 1$ scalar is just the value of that scalar. This is helpful when combined with the fact that the trace allows us to cycle any multiplied matrices inside it while it remains the same. That is \[ \text{tr}(\mathbf{ABC})=\text{tr}(\mathbf{CAB})=\text{tr}(\mathbf{BCA}) \] Sometimes it is easier to calculate the trace of the cycled matrices than it is to calculate the product by itself. It is important to note that cycling through the matrices is not the same thing as commutatively reordering them, that is: \[ \text{tr}(\mathbf{ABC})\neq\text{tr}(\mathbf{BAC}) \] where $\mathbf{A}$ and $\mathbf{B}$ have just been swapped. Instead we can only move the first element to the back or the back element to the front repeatedly.

The determinant of a matrix is a concept which is generally poorly understood (and with good reason). It’s definition is one of the worst in this book. For a square matrix ($n\times n$) $\mathbf{X}$, the determinant of $X$ is: \[ \det(\mathbf{X}) = \sum_{p\in S_n}\Big(\text{sgn}(p)\prod_{i=1}^n x_{i,p_i}\Big) \] where $S_n$ is the set of all the permutations of the numbers $1,\dots,n$. The determinant of a matrix is something like a distance measure for the matrix (not exactly, there are matrix norms which are different). Don’t be concerned if you can’t pin down an exact meaning of what this statistic is. It’s difficult. Just know that it’s a value of a matrix that our computers can calculate to tell us some summary information about the matrix. Most people working with matrices use the determinant. Few understand it completely.

Because we already know the concept of span for vectors, the column space is easy to define. The column space is the space spanned by the column vectors of a matrix. The rank is the number of dimensions in that space. So if we have an $n\times 2$ matrix, then it’s rank could be at most two if the vectors are linearly independent, or it could be one if there is a linear dependence, or it could be zero if the matrix is a matrix entirely of zeros.

For a square matrix, the determinant is zero if and only if the rank of the matrix is equal to its number of rows and columns, that is: \[ \det(\mathbf X_{n\times n}) \neq 0 \iff \text{rank}(\mathbf{X}) = n \] We would say in that case that $X$ is full-rank. It is possible for a non-square matrix to be full-rank too. We would say that happens when $\text{rank}(\mathbf{X}_{n,r})=min(n,r)$.

To calculate the trace of a matrix in R, the easiest way is to use the diag function to pull out the diagonal elements of the matrix and then sum the result:

mx <- matrix(1:4,nrow=2,ncol=2)
mx

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

sum(diag(mx))  ## Trace of a matrix

## [1] 5

The determinant can be found easily using the det function:

det(mx)       ## Determinant

## [1] -2

The rank of the matrix can be estimated using the QR decomposition and pulling the rank estimate from the result:

qr(mx)$rank   ## Rank of a matrix

## [1] 2

You don’t need to know what the QR decomposition is. It’s interesting and used heavily in computational statistics, but it is beyond the scope of this book. See Rao (1996???) Handbook of Statistics 9, Computational Statistics.

2.2.2.4 Solving systems

Using our definition of matrices, we can define a linear system of equations as: \[ \mathbf{A}x=y \] where $\mathbf{A}$ is an $n\times n$ (square) matrix and $x$ and $y$ are $n$ vectors. We can solve this system of equations for $x$ if $A$ does not have any linear dependence, i.e., when $\mathbf{A}$ is full-rank. When this equation is solvable, then we can create the inverse matrix for $\mathbf{A}^{-1}$. $\mathbf{A}^{-1}$ is a unique $n\times n$ matrix which when multiplied by $\mathbf{A}$ results in the identity matrix: \[ \mathbf{I}_n = \mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} \]

We don’t need to know how to calculate the inverse of a matrix. The computer will do this for us, what you should understand however is that using the inverse of the $\mathbf{A}$ matrix, we can solve the linear system above: \[ \begin{aligned} \mathbf{A}x &= y \\ \mathbf{A}^{-1}\mathbf{A}x &= \mathbf{A}^{-1} y \\ \mathbf{I_n}x &= \mathbf{A}^{-1} y \\ x &= \mathbf{A}^{-1} y \end{aligned} \]

ADD PARTITIONED MATRICES

In R, we can invert matrices when the determinant is not zero:

A <- matrix( c(5, 1, 0,
               3,-1, 2,
               4, 0,-1), nrow=3, byrow=TRUE)
det(A)

## [1] 16

Since the determinant is not zero, we can invert matrices using the solve() function

solve(A)

##        [,1]    [,2]   [,3]
## [1,] 0.0625  0.0625  0.125
## [2,] 0.6875 -0.3125 -0.625
## [3,] 0.2500  0.2500 -0.500

If we want to inverse and then immediately multiply by another vector we can use the solve function with two inputs. The following will give us $\mathbf{A}^{-1}b$.

b <- matrix(1:3,nrow=3)
solve(A,b)

##         [,1]
## [1,]  0.5625
## [2,] -1.8125
## [3,] -0.7500

This is faster than computing the inverse and then multiplying separately because it solves the system in one step.

2.2.2.5 Additional products

There are two additional products that you should be aware of for matrices. The first is the Hadamard product. For many students, this is the matrix product they expect. It is also known as the element-wise product (a name which we will often use because it is more descriptive). The Hadamard product works element-by-element so if we have two $n\times r$ matrices $\mathbf{X}$ and $\mathbf{Y}$, the Hadamard product will be: \[ \mathbf{X}\circ\mathbf{Y}=\begin{bmatrix}x_{i,j}y_{i,j}\end{bmatrix}_{i=1\dots n, j=1\dots r} \] This operation is very useful for operations such as scaling data.

The other product you should be aware of is the Kronecker product. The Kronecker product copies a matrix a number of times. \[ \mathbf{X}\otimes\mathbf{Y} = \begin{bmatrix} x_{1,1}\mathbf{Y} & \dots & x_{1,r} \mathbf{Y} \\ \vdots & \ddots & \vdots \\ x_{n,1}\mathbf{Y} & \dots & x_{n,r}\mathbf{Y} \end{bmatrix} \] Each element in the result has a value of $\mathbf{X}$ times a copy of $Y$. If $\mathbf{X}$ is $n\times r$ and $\mathbf{Y}$ is $m\times s$, the resulting matrix is going to be $nm \times rs$ which quickly becomes quite massive.

Relating back to the basic matrix dot product, we should also be aware of quadratic forms of matrices. If $\mathbf{A}_{n\times n}x_{n\times 1}$ is an $n\times 1$ linear system of $x$, then a quadratic would be something like $x'Ax$. We call this kind of relationship a quadratic form and it results in a scalar value ($1\times 1$). As many of the optimization problems we are interest in solving are of a quadratic nature, this kind of form is extremely useful to understand.

In R, element-wise multiplication is performed with the basic * operator:

mx <- matrix(1:6,nrow=3)
my <- matrix(c(0,1,0,1,0,1),nrow=3)
mx*my

##      [,1] [,2]
## [1,]    0    4
## [2,]    2    0
## [3,]    0    6

The Kronecker product can be found using the kronecker function or the %x% operator:

mx <- diag(3)
mx

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

my <- matrix(1,nrow=4,ncol=1)
my

##      [,1]
## [1,]    1
## [2,]    1
## [3,]    1
## [4,]    1

mx%x%my

##       [,1] [,2] [,3]
##  [1,]    1    0    0
##  [2,]    1    0    0
##  [3,]    1    0    0
##  [4,]    1    0    0
##  [5,]    0    1    0
##  [6,]    0    1    0
##  [7,]    0    1    0
##  [8,]    0    1    0
##  [9,]    0    0    1
## [10,]    0    0    1
## [11,]    0    0    1
## [12,]    0    0    1

kronecker(my,mx)

##       [,1] [,2] [,3]
##  [1,]    1    0    0
##  [2,]    0    1    0
##  [3,]    0    0    1
##  [4,]    1    0    0
##  [5,]    0    1    0
##  [6,]    0    0    1
##  [7,]    1    0    0
##  [8,]    0    1    0
##  [9,]    0    0    1
## [10,]    1    0    0
## [11,]    0    1    0
## [12,]    0    0    1

Notice that while mx %x% my and my %x% mx are the same shape, they are very different matrices.

2.2.3 Calculus and Eigenanalysis

2.2.3.1 Matrix calculus

In this section, we are going to talk about matrix calculus. Essentially, to do matrix calculus, we only need to know a few rules about how derivatives are going to apply to matrix expressions. For our purposes, we only need to be concerned about first and second derivatives of functionals. A functional is a function which can have many inputs, but only has a single scalar output: \[ f(x):\mathbb{R}^r \longrightarrow \mathbb{R} \] Most of the procedures we are going to use are defined to optimize some kind of functional. Ordinary least squares minimizes a functional of the squared errors. Maximum likelihood estimation maximizes a functional which is the likelihood.

Limiting our study to first and second derivatives of functionals is going to simplify our discussion dramatically. The first derivative of a vector-input functional is called a Jacobian. The second derivative is called a Hessian. Because we have a vector input, that means that we are going to have different derivatives for each variable (one with respect to each variable of the input). That means that if our original input vector was an $r\times 1$ vector, the Jacobian or first derivative is also going to be an $r\times 1$ vector. Mathematically we would write that as: \[ \text{J}(f(x))= \begin{bmatrix} \frac{\partial f}{\partial x_1}(x) \\ \vdots \\ \frac{\partial f}{\partial x_r}(x) \end{bmatrix} \]

Because the Jacobian is an $r$ vector already, when we calculate the Hessian we have to take the derivative of each derivative with respect to each variable. This means that the Hessian is going to be an $r\times r$ matrix: \[ \mathbf{H}(f(x)) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}(x) & \dots & \frac{\partial^2 f}{\partial x_1\partial x_r}(x) \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_r\partial x_1}(x) & \dots & \frac{\partial^2 f}{\partial x_r^2}(x) \end{bmatrix} \] Note that the Jacobian and Hessian matrices are still both functions just like $f(x)$. You put a function in, you get a function out.

Now for the actual calculus portion. For this, we need to know three rules: how to calculate the derivative of constant terms with respect to $x$, how to calculate the derivative of linear terms in $x$, and how to calculate the derivative of quadratic forms of $x$. If the function $f(x)$ does not vary with $x$, then the derivative is simply the same as basic calculus: the derivative is zero. \[ f(x)=a \Rightarrow \text{D}f(x)=0 \]

If $f(x)$ is linear with respect to $x$, the derivative is a bit more complicated because this can take two forms: \[ f(x)=x'\mathbf{A}\; \text{or}\; f(x)=\mathbf{A}'x \Rightarrow \text{D}f(x)=\mathbf{A} \] Either way it ends up being the same, just make sure that you apply the transpose operation carefully. For quadratic forms, the derivative becomes linear: \[ f(x) = x'\mathbf{A}x \Rightarrow \text{D}f(x)=2\mathbf{A}x \]

The last rule that we need to know is that the derivative of a sum is just the sum of the derivative of each piece separately, just like regular calculus: \[ f(x) = g(x) + h(x) \Rightarrow \text{D}f(x)=\text{D}g(x)+\text{D}h(x) \]

2.2.3.2 Eigenvalues and eigenvectors

  Determinant definition
  Maximum variance definition
  Eigenvalue decomposition
  Singular value decomposition
  Matrix square root
  Generalized inverse