2.1 Programming
At its most basic level, R is a calculator. Of course, it is a sophisticated calculator, but it is still just a calculator. The basic operations of R you should already know. Adding, subtracting, multiplying, and dividing:
5+4
## [1] 9
7*3
## [1] 21
8/4
## [1] 2
When I write code here (as I did above), that is code that can be run in the R console. If you are using a development environment, such as RStudio, then you can also write entire R scripts (.R files) which can run as well. You should not trust R to save your console work from session to session. When you restart R, you might lose all your work that is not saved in scripts. So scripting is a helpful way to keep track of your work so that you can reproduce your results. In RStudio you can run scripts one-line-at-a-time by repeatedly pressing control-enter. This runs the current line of code (or highlighted region if applicable) and advances to the next line.
When it comes to writing good code, you should comment your intentions. This is both for other people looking at your code today and so that you know what your code is doing when you look back later (possibly years later). Commenting is also a good way to think through what you are doing so that you code with purpose.
To write a comment, you can simply use the #
operator. If you place a #
on any line, anything that occurs after that mark will be a comment which R understands not to execute:
5+4 # 7+3
## [1] 9
You can see above that R executes 5+4
but ignores 7+3
because it occurs after a comment tag. In the following code, you can see an entire line comment as well as a better inline comment.
#Simple product code
7*3 # takes 7 and multiplies by 3
## [1] 21
For the most part, I am not going to use entire line comments in this book. I have the text of the book to give you that information and it complicates the typesetting process, but you should put nice comments in your code to document what we are doing.
2.1.1 Data types
In order to get started programming in R, it is critical to understand the primary data structures. As a reference here, you can read the Quick-R on data types or the first section of Hadley Wickham’s Advanced R book.
2.1.1.1 Primitives
Any single value in R can be represented with a type. We are going to call these fundamental data types ‘primitives.’ The primitive types are numeric
, character
, factor
, and logical
. Any single value must have one of these types, although there are more complicated objects which contain multiple data points and multiple types of data.
numeric
variables are simply numbers. character
variables (a.k.a, strings) are words, phrases, or collections of letters and numbers. Basically, these are just some kind of text. factor
variables are R’s system for categorical data. logical
values are either TRUE
or FALSE
.
2.1.1.2 Vectors
Vectors in R are collections of a single type of primitives. For instance, you can have each of the following:
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
Vectors are the most fundamental structures in R. Everything we do is going to be based on vectors and vector operation which are typically very fast in R. Vectors are built to be a certain size. Changing the size of a vector usually means creating an entirely new vector from an existing on. It is important to note that R uses 1 based indexing. This means that the first element in a vector would be obtained by calling a[1]
not a[0]
as in some other languages.
2.1.1.3 Matrices
Matrices are simply a collection of vectors with the same length. All the elements in a matrix must have the same primitive type. This is a fairly limiting assumption. The following is a simple matrix of zeros:
mat <- matrix(0L,nrow=4,ncol=3)
mat
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
## [4,] 0 0 0
Matrix elements would then be referenced by calling mat[i,j]
where i
is the row and j
is the column.
2.1.1.4 Lists
Lists are another kind of collection. Unlike vectors, lists can grow in size. Also unlike vectors, lists can contain objects which are not just primitives. The more advanced classes in R are all built off of the list model. To declare an empty list, we just call the list()
command. Then we can put whatever we want into the list. Items in a list are referenced using the double bracket notation. The following is an example of creating and referencing a list:
newlist <- list()
newlist[[1]] <- 5
newlist[[2]] <- c(1,2,3)
print(newlist)
## [[1]]
## [1] 5
##
## [[2]]
## [1] 1 2 3
2.1.1.5 Data Frames
A data.frame is a technically a list of vectors, but because it is constrained so that all the vectors have the same length, it functions like a matrix. You should think of a data.frame as a generalized matrix where the vectors can have different primitive types (e.g., string, numeric, factor, logical).
df <- data.frame(x=c(1,2,3),y=c(2.5,3.5,6.5),z=c('a','b','c'))
summary(df)
## x y z
## Min. :1.0 Min. :2.500 a:1
## 1st Qu.:1.5 1st Qu.:3.000 b:1
## Median :2.0 Median :3.500 c:1
## Mean :2.0 Mean :4.167
## 3rd Qu.:2.5 3rd Qu.:5.000
## Max. :3.0 Max. :6.500
2.1.1.6 data.table
Data.tables are just a faster version of data.frames with improved syntax. We are going to be using data.tables, and you will be very familiar with their syntax before long.
2.1.2 Databases
2.1.2.1 Reading files with data.table
Before reading any files it is critically important to set the working directory to the folder where your files are to be stored. To do this you can press Cntl+Shift+h
or go to the Session menu > Set Working Directory > Choose Directory...
Alternatively you can set the working directory manually using the following command:
setwd('~/Dropbox/UTDallas/buan6356/post')
In order to read a data set, it is useful to import the data.table
(Dowle and Srinivasan 2019) library. If you haven’t installed the library for the first time, you will need to do this (Tools > Install Packages… > Type “data.table” and press return). Once you have installed the packages, you can run the library command to import the package into memory. You will have to do this whenever you restart R and want to use this package:
library(data.table) # import library into memory
To read a file (e.g., churn.csv), you then have to run fread('churn.csv')
which will read the file. Typically we want to save the file to a particular variable where we can access it:
churn <- fread('churn.csv') # reading csv file using data.table
Now churn
will be a data.table type variable you can modify or summarize. If you want to save a data.table to a file (e.g., MyFile.csv), type:
fwrite(churn,file='MyFile.csv') # data.table write
To learn more about data.tables, the following vignette (Dowle and Srinivasan 2019) is extremely helpful. It is possible to read and write data without using data.tables. Doing this is much slower and the syntax for working with them is different, but if you need to do this, you can use the commands:
churn <- read.csv('churn.csv') # data frame read/write (MUCH SLOWER)
write.csv(churn,file='MyFile.csv')
2.1.2.2 Data validation
The class
function will tell us what type of variable we are working with:
class(churn) # What even is churn?
## [1] "data.table" "data.frame"
Here, the data has class data.table and class data.frame because I read the data with the fread
command. If the data was read using read.csv
it would be only class data.frame and not data.table. Often the first thing to do with data is to get a basic understanding of what the data contains. To look at the first 6 rows of a data set, use:
head(churn) # Get the first 6 rows of churn
## state length code phone intl_plan vm_plan vm_mess day_mins day_calls
## 1: KS 128 415 382-4657 no yes 25 265.1 110
## 2: OH 107 415 371-7191 no yes 26 161.6 123
## 3: NJ 137 415 358-1921 no no 0 243.4 114
## 4: OH 84 408 375-9999 yes no 0 299.4 71
## 5: OK 75 415 330-6626 yes no 0 166.7 113
## 6: AL 118 510 391-8027 yes no 0 223.4 98
## day_charges eve_mins eve_calls eve_charges night_mins night_calls
## 1: 45.07 197.4 99 16.78 244.7 91
## 2: 27.47 195.5 103 16.62 254.4 103
## 3: 41.38 121.2 110 10.30 162.6 104
## 4: 50.90 61.9 88 5.26 196.9 89
## 5: 28.34 148.3 122 12.61 186.9 121
## 6: 37.98 220.6 101 18.75 203.9 118
## night_charges intl_mins intl_calls intl_charges cs_calls churn
## 1: 11.01 10.0 3 2.70 1 False.
## 2: 11.45 13.7 3 3.70 1 False.
## 3: 7.32 12.2 5 3.29 0 False.
## 4: 8.86 6.6 7 1.78 2 False.
## 5: 8.41 10.1 3 2.73 3 False.
## 6: 9.18 6.3 6 1.70 0 False.
To get only the first 5 rows, you can use head(churn,5)
. Similarly, there is also a tail
function for looking at the last 6 rows of a data set. To get just the column names, we can use names
:
names(churn) # What are the column names in churn?
## [1] "state" "length" "code" "phone"
## [5] "intl_plan" "vm_plan" "vm_mess" "day_mins"
## [9] "day_calls" "day_charges" "eve_mins" "eve_calls"
## [13] "eve_charges" "night_mins" "night_calls" "night_charges"
## [17] "intl_mins" "intl_calls" "intl_charges" "cs_calls"
## [21] "churn"
To find the number of rows in churn, we can use the nrow
function:
nrow(churn) # How many rows are in churn?
## [1] 3333
Similarly ncol
can be used to find the number of columns in churn. To call only one variable in churn (e.g., cs_calls), just type churn$cs_calls
. Some useful functions for working with single variables are:
unique(churn$cs_calls) # What unique values does cs_calls have?
## [1] 1 0 2 3 4 5 7 9 6 8
To even better understand the data, we can pull these summary statistics. The summary
command provides these summary statistics for use to peruse:
summary(churn) # Summary statistics for churn
## state length code phone
## Length:3333 Min. : 1.0 Min. :408.0 Length:3333
## Class :character 1st Qu.: 74.0 1st Qu.:408.0 Class :character
## Mode :character Median :101.0 Median :415.0 Mode :character
## Mean :101.1 Mean :437.2
## 3rd Qu.:127.0 3rd Qu.:510.0
## Max. :243.0 Max. :510.0
## intl_plan vm_plan vm_mess day_mins
## Length:3333 Length:3333 Min. : 0.000 Min. : 0.0
## Class :character Class :character 1st Qu.: 0.000 1st Qu.:143.7
## Mode :character Mode :character Median : 0.000 Median :179.4
## Mean : 8.099 Mean :179.8
## 3rd Qu.:20.000 3rd Qu.:216.4
## Max. :51.000 Max. :350.8
## day_calls day_charges eve_mins eve_calls
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 87.0 1st Qu.:24.43 1st Qu.:166.6 1st Qu.: 87.0
## Median :101.0 Median :30.50 Median :201.4 Median :100.0
## Mean :100.4 Mean :30.56 Mean :201.0 Mean :100.1
## 3rd Qu.:114.0 3rd Qu.:36.79 3rd Qu.:235.3 3rd Qu.:114.0
## Max. :165.0 Max. :59.64 Max. :363.7 Max. :170.0
## eve_charges night_mins night_calls night_charges
## Min. : 0.00 Min. : 23.2 Min. : 33.0 Min. : 1.040
## 1st Qu.:14.16 1st Qu.:167.0 1st Qu.: 87.0 1st Qu.: 7.520
## Median :17.12 Median :201.2 Median :100.0 Median : 9.050
## Mean :17.08 Mean :200.9 Mean :100.1 Mean : 9.039
## 3rd Qu.:20.00 3rd Qu.:235.3 3rd Qu.:113.0 3rd Qu.:10.590
## Max. :30.91 Max. :395.0 Max. :175.0 Max. :17.770
## intl_mins intl_calls intl_charges cs_calls
## Min. : 0.00 Min. : 0.000 Min. :0.000 Min. :0.000
## 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300 1st Qu.:1.000
## Median :10.30 Median : 4.000 Median :2.780 Median :1.000
## Mean :10.24 Mean : 4.479 Mean :2.765 Mean :1.563
## 3rd Qu.:12.10 3rd Qu.: 6.000 3rd Qu.:3.270 3rd Qu.:2.000
## Max. :20.00 Max. :20.000 Max. :5.400 Max. :9.000
## churn
## Length:3333
## Class :character
## Mode :character
##
##
##
We will discuss the exact meaning behind these sample statistics later, so if some of these are not familiar to you, don’t worry. It is possible to call one particular vector of a data frame or data table using the the $
notation. For instance, churn$cs_calls
will tell us the number of customer service calls for each person. It is also possible to see how often cs_calls takes particular values:
table(churn$cs_calls) # Distribution of data values
##
## 0 1 2 3 4 5 6 7 8 9
## 697 1181 759 429 166 66 22 9 2 2
Some functions can be applied to two variables at once. For instance, table can be used with two variables to obtain a “two-way” table:
table(churn$cs_calls,churn$churn) # Two-way table distribution
##
## False. True.
## 0 605 92
## 1 1059 122
## 2 672 87
## 3 385 44
## 4 90 76
## 5 26 40
## 6 8 14
## 7 4 5
## 8 1 1
## 9 0 2
2.1.2.3 Plotting
Basic plotting in R is quite simple. If you have two different variables, the plot function will produce a simple xy scatter plot:
plot(churn$length,churn$cs_calls)
This plot can be helpful, but it isn’t production quality. Typically when we want to make nice-looking plots we use the ggplot2
package (Wickham et al. 2018):
library(ggplot2)
ggplot(churn,aes(x=length,y=cs_calls)) + geom_point() +
scale_x_continuous(name="Account length") +
scale_y_continuous(name="Customer service calls")
The code here looks complicated, but if we break it down, it becomes clear. The ggplot command sets up our axes. This specifies that our data is churn
and the x and y variables are going to be length
and cs_calls
. Then we added points to this plot with the geom_point()
command. The other commands are just setting the axis labels and are not necessary. You can find a ggplot command for almost every kind of static plot you can think of. This reference has 50 different ggplots that you can use to create almost any kind of visualization you could want.
2.1.2.4 Merging data
2.1.2.5 Structure of databases
2.1.2.6 Connecting to databases
To connect to sql databases (specifically the wooldridge2 database we are going to use almost exclusively), we are going to use the DBI
package (R Special Interest Group on Databases (R-SIG-DB), Wickham, and Müller 2018) and RSQLite
package (Müller et al. 2018). Install those packages and then you can call:
library(DBI)
This command loads the DBI
library which has the procedures for working with databases. Then we are then going to use the RSQLite
library to access the wooldridge2.db.
con <- dbConnect(RSQLite::SQLite(),'wooldridge2.db')
dbConnect
from DBI
opens a connection to the database. The file is now “in use” and cannot be modified by other programs until we close the connection. To see what is in the database (the table names) we can call:
dbListTables(con)
## [1] "401k" "401k_labels" "401ksubs"
## [4] "401ksubs_labels" "admnrev" "admnrev_labels"
## [7] "affairs" "affairs_labels" "airfare"
## [10] "airfare_labels" "alcohol" "alcohol_labels"
## [13] "apple" "apple_labels" "athlet1"
## [16] "athlet1_labels" "athlet2" "athlet2_labels"
## [19] "attend" "attend_labels" "audit"
## [22] "audit_labels" "barium" "barium_labels"
## [25] "beauty" "beauty_labels" "benefits"
## [28] "benefits_labels" "beveridge" "beveridge_labels"
## [31] "big9salary" "big9salary_labels" "bwght"
## [34] "bwght2" "bwght2_labels" "bwght_labels"
## [37] "campus" "campus_labels" "card"
## [40] "card_labels" "cement" "cement_labels"
## [43] "ceosal1" "ceosal1_labels" "ceosal2"
## [46] "ceosal2_labels" "charity" "charity_labels"
## [49] "consump" "consump_labels"
Now if we want to pull a particular table from the database, we can use the dbReadTable
function on the connection with the table name. This function returns a data.frame object. Because we prefer data.tables, we are then going to migrate the data.frame to a data.table with the data.table()
function and store that data.table as a variable:
bwght <- data.table(dbReadTable(con,'bwght'))
In the wooldridge2 database, every table of data has some labels describing that data. To see the labels for the bwght
data, call:
dbReadTable(con,'bwght_labels')
## index variable.name type format variable.label
## 1 0 faminc float %9.0g 1988 family income, $1000s
## 2 1 cigtax float %9.0g cig. tax in home state, 1988
## 3 2 cigprice float %9.0g cig. price in home state, 1988
## 4 3 bwght int %8.0g birth weight, ounces
## 5 4 fatheduc byte %8.0g father's yrs of educ
## 6 5 motheduc byte %8.0g mother's yrs of educ
## 7 6 parity byte %8.0g birth order of child
## 8 7 male byte %8.0g =1 if male child
## 9 8 white byte %8.0g =1 if white
## 10 9 cigs byte %8.0g cigs smked per day while preg
Now all there is left to do is disconnect from the database:
dbDisconnect(con)
This closes the connection so that other programs can use or manipulate the file again. Don’t memorize these database commands. Below, we will write a wrapper function for all these operations so that we don’t have to remember them specifically.
2.1.3 Control structures
2.1.3.1 If statements
If statements are one of the core building blocks of computer code. If a condition is met, we are going to execute some code. Before we do this, let’s create a number x:
x <- rnorm(1,mean=5,sd=1)
x
## [1] 4.556368
Here x was randomly created, so I (the author) don’t know whether it is less than \(5\) or greater than \(5\) because the code hasn’t been executed at the time I’m writing this. Now look at the following if statement:
if(x < 5){
print("x less than 5")
}
The code here (if it were executed) would first check if the statement in the parentheses is true (i.e., x<5
). If it is true, it will execute the code in the brackets (i.e., print("x less than 5")
). So if x is less than \(5\), this would print out “x less than 5”. We can also add a case for when the condition is false using the else
clause:
if(x < 5){
print("x less than 5")
}else{
print("x greater than 5")
}
## [1] "x less than 5"
Here you can see the output which will either say “x less than 5” or “x greater than 5”.
2.1.3.2 For loops
The second basic building block of coding is looping. For this purpose, we are going to use the for loop. A for loop is a loop that runs for a specific number of cycles. It uses an iterator which is a variable that grows or shrinks each cycle. The following loop uses the iterator i
which starts at \(1\) and grows to \(5\) by steps of \(1\). So i
goes \(1\), \(2\), \(3\), \(4\), and then \(5\) and the loop is over. The code inside the brackets is going to be run each time. So for each value of i
, R is going to print that value. So this loop prints the numbers \(1\) to \(5\), one at a time:
for(i in 1:5){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
We can loop over other sets too. For instance the seq(a,b,c)
function can be used to grow a number from a
to b
by units of c
:
for(i in seq(0,1,0.2)){
print(i)
}
## [1] 0
## [1] 0.2
## [1] 0.4
## [1] 0.6
## [1] 0.8
## [1] 1
2.1.3.3 User-defined functions
The last import building block of computer code is the user-defined function. Here I am going to create a sample standard deviation function. This function already exists in R, so this is an illustration of the concept of creating functions rather than a practical example of a code which is going to save us time. Below in section 2.4.1.5, we are going to create a function that is actually helpful to us. To define a function, you just have to save the function
call to a variable name (e.g., sd
):
sd <- function(data){
val <- var(data)
val <- sqrt(val)
return(val)
}
Here we are defining the sd
function. This function takes some input data
. At the end of the function, the function is going to output something using the return
call. Here we are returning val
. So how is R going to calculate val
? First val is obtained by finding the sample variance of data
. Once we have the sample variance, we are going to modify val by taking its square root: val <- sqrt(val)
. Then we simply return val
to the user. So:
sd(rnorm(100,mean=5,sd=2))
## [1] 2.233749
is about \(2\).
2.1.3.4 wpull function
References
Dowle, Matt, and Arun Srinivasan. 2019. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, and Kara Woo. 2018. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
R Special Interest Group on Databases (R-SIG-DB), Hadley Wickham, and Kirill Müller. 2018. DBI: R Database Interface. https://CRAN.R-project.org/package=DBI.
Müller, Kirill, Hadley Wickham, David A. James, and Seth Falcon. 2018. RSQLite: ’SQLite’ Interface for R. https://CRAN.R-project.org/package=RSQLite.