2.5 Algorithms

2.5.1 Measuring complexity

2.5.1.1 Order notation

Consider an algorithm to find the largest value in a vector. The simple way to do this is to use loops. In our study of algorithms, we should program all the computations and not call outside functions which we do not completely understand because we don’t necessarily know how many calculation so-called “black box” functions will make. The simple code we might use could look something like this:

findmax <- function(vec){
  n <- NROW(vec)
  maxv <- vec[1]
  for(i in 2:n){
    if(maxv<vec[i]){
      maxv <- vec[i]
    }
  }
  return(maxv)
}

Now think about how many calculations the computer is going to make here. First we find the length of the vector, that’s 1 calculation. Next we set maxv which is going to become our eventual maximum value. The computer has to look up the value in vec[1] then assign that value to maxv. That’s 2 steps. Then we set up a loop. Each time the loop changes, we are going to perform the following calculations: update i to be the current value, recall the values from maxv and vec[i], compare those values, and potentially change the maxv value to be the vec[i] if the maxv is smaller than vec[i]. That’s 5 steps assuming that we don’t count recalling values multiple times. Now how many times do we do this loop? We do it n-1 times. So how many total steps are we going to have? Well, we could have at the very least, 3+(n-1)*4 steps and we could have at most 3+(n-1)*5 steps.

So we could have between \(3+4(n-1)\) and \(3+5(n-1)\) steps for the procedure to finish where \(n\) is the number of points in the vector. Which should we choose? Sometimes we must consider the worst-case time complexity of a problem. Other times, we can consider the average time-complexity (although this is often harder to compute). Here, it doesn’t matter which we choose. We would say here that the computation time is going to be linear in \(n\) because both \(3+4(n-1)\) and \(3+5(n-1)\) are linear functions of \(n\). Often we only care about the order of the complexity. That is, a linear function of \(n\) is better than a quadratic function of \(n\) which is better than an exponential function of \(n\). But while one linear function can be better than another, we are not concerned with such improvements. The reason is that it gets too complicated to count every single calculation in some advanced algorithms will make. Obviously we could do computationally, but really differences in the order of the calculation are enough for our purposes.

Mathematically, we can say that a linear function of \(n\) is denoted as \(O(n)\). Technically, this means that function can be bounded from above by a linear function, but we are typically when we use this notation, we mean that this is the order of the function itself. That is, technically all \(O(n)\) functions are also \(O(n^2)\) and\(O(e^n)\), but we typically state the lowest bound that we can, so if a function is linear in \(n\) we will write \(O(n)\) unless otherwise stated clearly.

2.5.1.2 Searching

Consider a vector which has been sorted from smallest to largest. Suppose that we want to find whether or not a particular element is in the vector and maybe find its position in the list. Obviously there is a simple solution to this problem which is to just loop over the list looking for the element.

basicsearch <- function(element,vec){
  n <- NROW(vec)
  for(i in 1:n){
    if(element == vec[i]){
      return(i)
    }
  }
  return(FALSE)
}

This simple program would run in \(O(n)\) time, which is fairly fast. However, because we can rely on the knowledge that the vector is pre-sorted, there is a faster algorithm. Consider the following binary search:

binarysearch <- function(element, vec){
  n <- NROW(vec)
  mid <- floor((n+1)/2)
  diff <- element-vec[mid]
  if(diff==0) return(mid)
  if(n==1) return(FALSE)
  if(diff<0){
    result <- binarysearch(element,vec[1:mid])
    return(result)
  }
  if(diff>0){
    result <- binarysearch(element,vec[(mid+1):n])
    if(result==FALSE){return(FALSE)}
    return(result+mid)
  }
}

This algorithm is a little different, so let’s go through it. Effectively what we are doing is splitting the vector in the middle position (line: mid <- floor((n+1)/2)). We calculate the difference between our element and the middle element to make later checks quicker: diff <- element-vec[mid]. If our element is less that the middle element, then we know it would be in the smaller half of the vector (if it is present at all, line: if(diff<0){). If element is greater than the middle element, then we know if it would be in the top half (line: if(diff>0){). Therefore, we can “divide and conquer” the problem. We just have to perform the same function on the upper half or the lower half, so we call binarysearch on one of those pieces (e.g., result <- binarysearch(element,vec[1:mid])). We call this kind of function recursive because it calls itself. The rest is just housekeeping: if the middle element is exactly equal to our searching value, then return that result (line: if(diff==0) return(mid)). And if the list we are working on has only one element, then don’t use recursion, just check if the element is equal or not: if(n==1) return(FALSE).

ADD A DIAGRAM FOR BINARY SEARCH

Let’s test the code to make sure that it runs properly. Remember that to R, FALSE and 0 are equivalent:

vec <- seq(100,108,2)
for(i in 100:110){
  result <- binarysearch(i,vec)
  print(c(i,result))
}

## [1] 100   1
## [1] 101   0
## [1] 102   2
## [1] 103   0
## [1] 104   3
## [1] 105   0
## [1] 106   4
## [1] 107   0
## [1] 108   5
## [1] 109   0
## [1] 110   0

When I was writing the above binary search program, I wrote the code in exactly the order that I described above, but you will notice that the program itself is not in the same order. This is very common when you are writing programs. The main “meat” of the program happens late in the code after the housekeeping has been done. Nonetheless, we write the important parts of the code first because those usually reveal what housekeeping problems we need to resolve.

Now think about the speed of this algorithm. This is harder to pin down. First we split the data into two sub parts. Then we split it again and again. We do this \(\log_2(n)=\lg(n)\) times in the worst-case scenario. That’s the speed of this algorithm: \(O(\lg n)\). When working on recursive functions, it can be difficult to work out the exact timing, so we commonly use a theorem in algorithm analysis known as the Master theorem to help us:

Theorem 3 (Master theorem) Suppose we have a recursive algorithm. Each time the algorithm runs, it performs some function of the data first which takes \(f(n)\) calculations. Next the function splits the data into \(a\) different subproblems of the data each of size \(n/b\). Then we can use the following conditions to bound the algorithms exececution time:

If \(f(n)=O(n^c)\) where \(c<\log_b a\), then the algorithm is \(O(n^{\log_b a})\).
If \(f(n)=O(n^{\log_b a}(\lg n)^k)\), then the algorithm is \(O(n^{\log_b a}(\lg n)^{k+1})\).
If \(f(n)=O(n^c)\) where \(c>\log_b a\), then we do not necessarily know how long the algorithm is going to take.
If \(f(n)=O(n^c)\) where \(c>\log_b a\) and \(af(n/b)\leq kf(n)\) for some \(k<1\) for large values of \(n\), then the algorithm is bounded by \(O(f(n))\).

Consider the binary search problem. For this problem, each time the algorithm is run, it first does a number of checks before splitting the problem. These checks are all \(O(1)\) because they do not require looping over the data. Hence, \(c\) in condition 1 would be equal to zero. The the algorithm is either run on the upper half or the lower half of the data. Because there is only one subproblem, \(a=1\). Because we then use half of the original size of the data, \(b=2\). Therefore, \(\log_b a = \log_2 1 = 0\). Because \(c=\log_b a\), we cannot use condition 1 to help us solve this problem. So we go to condition 2. In condition 2, \(f(n)\) would have to be \(O((\lg n)^k)\) because \(\log_b a = 0\). This is true for \(k=0\) because \(f(n)=O(1)\). Hence we know that the algorithm is going to be \(O(n^{\log_2 1}(\lg n)^{0+1})=O(\lg n)\).

The Master theorem is extremely powerful and allows us to find the approximate run time of many algorithms. It is going to be extremely useful for hierarchical models later in this book. When we write algorithms,

2.5.1.3 Sorting

Another interesting problem in algorithm analysis is the problem of sorting a vector of numbers from smallest to largest. To do sorting, we are often going to want to swap two items in a list. To do this easily, we have written the following swap function. The inputs are the vector where things should be swamped and the two indices that we want to swap. The return is the same vector with the items swapped:

swap <- function(vec,i,j){
  tmp <- vec[i]
  vec[i] <- vec[j]
  vec[j] <- tmp
  return(vec)
}

The following is a very basic sorting algorithm. Basically we are just repeatedly finding the smallest number in our list. Then we pull that number out and keep working on the rest of the list:

selectsort <- function(vec){
  n <- NROW(vec)
  for(j in 1:(n-1)){
    minv <- vec[j]
    mini <- j
    for(i in (j+1):n){
      if(vec[i]<minv){
        minv <- vec[i]
        mini <- i
      }
    }
    vec <- swap(vec,j,mini)
  }
  return(vec)
}

To show that this function works, consider the following check code:

vec <- rank(runif(10))
vec

##  [1]  5  1  9  7  2 10  8  4  6  3

selectsort(vec)

##  [1]  1  2  3  4  5  6  7  8  9 10

As you can see our function works. Let’s explore how long this algorithm will take. Each time we want to find the smallest number, we have to perform \(O(n)\) calculations (a loop over \(n\) data points). We have to do this \(n\) times. So the total is going to \(O(n^2)\). We can do better. Consider the following divide-and-conquer algorithm:

quicksort <- function(vec,st=1,end=NROW(vec)){
  if(end-st+1 <= 1) return(vec)
  result <- partition(vec,st,end)
  vec    <- result[[1]]
  i      <- result[[2]]
  vec    <- quicksort(vec,st,i-1)
  vec    <- quicksort(vec,i+1,end)
  return(vec)
}

In this algorithm, we first check to make sure that we have more than one element to be sorted. If we have one or no elements, then the procedure is done. Second, we partition the data. We haven’t yet defined the partition() function, but it is some function that takes the data, partitions it on some element (separates the data smaller and larger than the value in question), and returns the partitioned data as well as the final index of the special partitioned element. Then we pull out vec and i from the results of the partition function. Then we execute quicksort() on lower and upper halves of the data. This procedure divides the data into two halves so that it can then be conquered on each half, hence, divide and conquer. We just need to define the partition code:

partition <- function(vec,st,end){
  splitv <- vec[st]
  small  <- c()
  for(i in (st+1):end){
    if(vec[i]<splitv) small <- c(small,i)
  }
  spliti <- NROW(small)+st
  if(spliti==st){
    result <- list()
    result[[1]] <- vec
    result[[2]] <- spliti
    return(result)
  }
  vec <- swap(vec,st,spliti)
  for(i in st:(spliti-1)){
    if(vec[i]>splitv){
      vec   <- swap(vec,i,tail(small,n=1))
      small <- head(small,n=-1)
    }
  }
  result <- list()
  result[[1]] <- vec
  result[[2]] <- spliti
  return(result)
}

The partition function is more complicated, so let’s run through it. We take the first value in our vector as the partition value (splitv). In order to get the swaps right, we need to know where in the final list our partition value should be. So we make a list of all the indices where we have values smaller than our partitioned value. This list will be helpful later. So to create that list, we run:

small  <- c()
for(i in (st+1):end){
  if(vec[i]<splitv) small <- c(small,i)
}

Now, to find the position for our swap element spliti, we can simply add up how many elements are in small and add st to move the number over. Now, if spliti is the same as start, then we’re basically done, so we check for that case and handle it specially. If spliti does not equal start, then we have to do things to the data so we continue. We need to swap the values at spliti and st, so we do that. Then we need to move values greater than splitv to the right of spliti and values less than splitv to the left of spliti. So what we do is look through the values left of the vector again. If we find a value there that is greater than splitv in that group, we swap it with the largest index in small. That index is going to be a number greater than spliti whose value is less than splitv, so it needs to be moved. We then remove that last value from our consideration with a head call:

for(i in st:(spliti-1)){
  if(vec[i]>splitv){
    vec   <- swap(vec,i,tail(small,n=1))
    small <- head(small,n=-1)
  }
}

This partitions our list. The final steps are just packaging the output for the next function to use both the partitioned vector and the split index. To test our code, we run the following:

vec <- rank(runif(10))
vec

##  [1]  1  8 10  4  2  5  6  9  3  7

quicksort(vec)

##  [1]  1  2  3  4  5  6  7  8  9 10

Now consider how long this code will take. We have a recursive function, so we should use the Master theorem. The splits of the data are going to be roughly 50%/50% (\(b=2\)). We are splitting the data into two groups (\(a=2\)). The partition code which we have to run at each split has one loop performed twice which is going to be \(O(n)=O(n^k)\) where \(k=1\). So since \[\log_b(a)=\log_2(2)=1=k\], we have to use condition (2) of the theorem and the algorithm will run in \(O(n\lg n)\) time which is faster than our selection sort which ran in \(O(n^2)\) time.

ADD VISUALIZATION OF QUICKSORT

2.5.1.4 NP Completeness

Some problems in the area of algorithm design are very difficult and cannot be solved in any polynomial function of the data size \(O(n)\). For instance, a problem might be \(O(2^n)\) which cannot be bounded by any function \(O(n^k)\) as \(n\rightarrow \infty\). There is a subset of these problems which is very interesting. Some of these problems seemingly cannot be solved in polynomial time (\(O(n^k)\)), but any potential solution to the problem can be verified in polynomial time. An example of this kind of problem is the factorization of large numbers. Consider a number which has \(n\) digits. Suppose we know that this number is the product of two prime numbers. If we know the prime numbers, it is easy to tell that they multiply together to make the large number in question. But just given the number, it is going to take a very long time to check all the possible primes to see if they are the factors that make up the larger number.

These kind of problems are known as NP-hard problems. They are very interesting to computer scientists because they do not have easy solutions. Many modern cryptographic programs use hard problems like these to protect data. For instance, they will have a large integer and they will not let anyone access the data unless they can tell them the factors. Anyone who can tell them the factors has solved a very difficult problem or they knew the factors to begin with. This is a way of verify if a person is who they claim to be. For our purposes, we are usually not interested in solving problems that are this difficult. In fact, if a problem can be reduced to a statement like this, we usually consider it unsolvable and move along (even if it could be solved). We actually do not know if there is a polynomial time solution to these problems, but that is beside the point. We are not going to work on problems this hard.

2.5.2 Paradigms

Paradigms are common computing strategies that help us solve common problems. They are strategies, not complete algorithms by themselves, but by using them, we can create good solutions which run in a very reasonable amount of time.

2.5.2.1 Divide and conquer

We have already discussed divide and conquer when talking about searching and sorting. The divide-and-conquer strategy is to break the problem apart into small pieces where the solution is obvious, and then combine our solutions to get the final answer. The divide and conquer approach to searching is therefore to split the data into two halves over and over until we find the location where our search-element would be located in the list. Once we have only one potential element, it is easy to make a comparison between them.

2.5.2.2 Greedy algorithms

Greedy algorithms are important when we are talking about optimization. As you should already be aware, it is possible when minimizing a function to find a local minimum which is not the global minimum. Greedy algorithms looks for local minima. The idea is to make the best choice we can of all the local choices available. That means that we can miss global minima, but we are sure that any solution we have is locally the best available.

Consider the traveling salesperson problem. This problem is NP-hard, so it would be impossible for us to find a great solution to this problem. The idea behind this problem is that we are trying to move a traveling salesperson from city to city. The cities are connected by a set of roads. We wish to make our salesperson visit the cities such that she travels the shortest distance possible and visits each city exactly once.

A greedy algorithm for this problem would be to always visit the closest unvisited city. This is going to produce a result which is more than the globally minimal solution (minimal travel time), but it will run in a relatively short amount of time, \(O(n^2)\), while finding the global solution would take \(O(n^n)\) time (where \(n\) is the number of cities). This is because the set of all permutations of \(n\) cities has size \(n!\) and we can bound \(n!\) using Stirling’s approximation: \[ n!\approx \sqrt{2\pi n}\Big(\frac{n}{e}\Big)^n \]

2.5.2.3 Dynamic programming

Dynamic programming works on the principle that we should never calculate the same thing twice. While this may seem obvious, it has many not-so-obvious implementations and is very useful in writing code. For instance consider the following program to print Pascal’s triangle:

getrowpascal <- function(n){
  row <- rep(0,n+3)
  for(i in 0:n){
    row[i+2] <- choose(n,i)
  } 
  return(row)
}
for(i in 0:4){
  print(getrowpascal(i))
}

## [1] 0 1 0
## [1] 0 1 1 0
## [1] 0 1 2 1 0
## [1] 0 1 3 3 1 0
## [1] 0 1 4 6 4 1 0

system.time(
  for(i in 0:nsim){
    getrowpascal(i)
  }
)

##    user  system elapsed 
##  21.985   0.000  21.985

The above program is going to run in \(O(n^3)\) time where \(n\) is the number of rows. The issue is that recalculating each row one element at a time using the choose function is going to take a serious amount of time. Choose basically just calculates \(n!/((n-r)!r!)\) but this calculation involves multiplying at the most \(n/2\) numbers together. That’s going to take \(O(n)\) time plus the two loops. If instead we use the last row as information to calculate the next row, this is going to take \(O(n^2)\) time which is much faster:

getpascal <- function(n){
  outmat <- matrix(0,nrow=n,ncol=n+2)
  outmat[1,2] <- 1
  for(i in 2:n){
    for(j in 1:i){
      outmat[i,j+1] <- outmat[i-1,j]+outmat[i-1,j+1]
    }
  }
  return(outmat)
}
system.time(getpascal(nsim))

##    user  system elapsed 
##  10.503   0.113  10.617

This is dynamic programming. We are using known information about the last row to save time in calculating the next row.

2.5.2.4 Linear programming

2.5.2.5 EM algorithms

EM algorithms are used when we have two goals that we need to accomplish at the same time. This happens when we have a “chicken and the egg” situation. For instance, we might want to “fill in” missing values to help us calculate a model. But how are we going to fill in those values? We probably want to use our model. So which do we do first, fill in the values or calculate the model? The answer is that we switch back and forth between the two repeatedly.

EM stands for expectation maximization (or minimization depending on the circumstance). For instance, in the missing data problem described, we would first calculate the model dropping the missing values, that is we would minimize the sum of the squared errors with something like an OLS model. Then we would calculate the expected missing values using our model (the E step). Then we would fit the model again with OLS (the M step). Then we would repeat until we found a stable result.

This kind of algorithm is used very often in statistics. It’s a simple way to accomplish two goals at once, and often we have many goals.

2.5.3 Practical computing

Practically speaking there are other time saves that we can implement in our code. In this section, we will discuss a few of those ways to speed up our code.

2.5.3.1 Vector operations

As we discussed in our simulation setup, avoiding loops can lead to enormous time saves when running code. Consider for a moment a simple program to calculate the sample mean:

getmean <- function(vec){
  n <- NROW(vec)
  suma <- 0
  for(i in 1:n){
    suma <- suma+vec[i]
  }
  return(suma/n)
}

This program should be very fast. It only involves one loop of the numbers, so it should run in \(O(n)\) time. We can measure how long it takes using the system.time() function.

x <- rnorm(mean=5,nsim^2)
system.time(mx <- getmean(x))

##    user  system elapsed 
##    5.52    0.00    5.52

mx

## [1] 4.999995

Even though it is fast, it is running on a billion numbers, so it takes some time. Matrix operations have been optimized to run on the hardware much faster than standard loops. The following mean() function uses matrix operations (the inner product) to calculate the mean. It gets the same answer but the calculations are done much faster:

system.time(mx <- mean(x))

##    user  system elapsed 
##   0.272   0.000   0.273

mx

## [1] 4.999995

If you are using Microsoft OpenR, the operations are being done by Intel’s Math Kernel Library which is essentially the fastest way to run code on Intel chipsets. If you have an AMD or ARM processor or you are just using CranR, then you are using LAPACK and BLAS which are extremely fast as well (if you compile R yourself, be sure to make R point to LAPACK libraries).

All this means that a simple way to speed up R code is to use vector/matrix operations instead of basic loops. We will be using these operations as much as possible in our code.

2.5.3.2 Data.table package

Another way to seriously speed up operations is to use the data.table package. We have already talked about how operations are faster because we can use group by calls instead of looping, but there are two more reasons that data.table code is faster. For one, data.table calls use multithreading which we talk about in the next section. That is going to lead to a huge speed up by itself. The other reason is more subtle but still very important. Essentially, data.table does not needlessly copy data over and over again.

All computer languages use a system of pointers to relate variable names to the data underneath. The problem is that copying a variable is ambiguous. Are you copying the data or are you copying the pointer and pointing to the same location? R was designed to avoid having to think about pointers. This is extremely helpful because we normally wouldn’t have to care about them. That means that whenever you reference or move data from one function to another in R, you are copying the entire data set. That takes time and memory. data.table is a system of pointers so that we can pass data from one function to another without copying over and over again. Of course, that means that we have to care about a little bit about pointers which could occasionally lead to an error (occasionally because most calls will create a copy), but the time saves make it more than worthwhile.

2.5.3.3 Parallel computing

As we said in the last section, data.table calls use parallelism behavior to speed up operation. That means that if we embed our calls inside of data.table calls (e.g., using group bys), we can do parallel computing without thinking about it. However, if we want to think about it, we can use the parallel package. The issue is that parallel computing work differently on Microsoft Windows than it does on Unix-based systems. On Unix, we can fork a subprocess to do a job and keep going. On Windows, it is harder to create subprocesses for security reasons. As I am using Unix, I will use the fork notation. If you are using Windows, you can either just stick with data.table calls or you can look into the %dopar% function.

For Unix-based systems, we can fork a process to do a job with the mcparallel() function. This function tells the system to make a process to do the job, once the process is created (not completed), our central process will keep going. If we put this in a loop, we can create a number of subprocesses. Eventually, we are going to want the result from these processes, so we need to wait on them to finish and collect their results. We do this with the mccollect() function:

para <- function(dtable,func,jobs=parallel::detectCores()-2){
  bigN <- nrow(dt)
  for(i in 1:jobs){
    isim <- seq(i,bigN,jobs)
    parallel::mcparallel(dtable[sim%in%isim,eval(func),by=sim])
  }
  result <- rbindlist(parallel::mccollect())[order(sim)]
  return(result)
}

To use the functions more easily, we have put them into a function. It is important to note that this function is slower than typical data.table calls because data.table becomes single-threaded (not parallel) when it is put inside a parallel call. This is to prevent the creation of too many subprocesses which could crash the computer. To test that our parallel code works, I wrote the following check code. This is the output with data.table calls only:

dt <- build.nsim(nsim,c(10,25,50,100,250))
bigN <- nrow(dt)
dt$x <- rnorm(bigN,0,1)
result <- dt[,lapply(.SD,mean),by=sim]
result[,sd(x),by=n]

##      n         V1
## 1:  10 0.31686429
## 2:  25 0.20142312
## 3:  50 0.14299160
## 4: 100 0.10126725
## 5: 250 0.06351962

And you can see that the output is the same with the para() function:

result <- para(dt,quote(lapply(.SD,mean)))
result[,sd(x),by=n]

##      n         V1
## 1:  10 0.31686429
## 2:  25 0.20142312
## 3:  50 0.14299160
## 4: 100 0.10126725
## 5: 250 0.06351962