1 Introduction

The purpose of this project is to create a machine learning model that can predict the price of a diamond based on its charateristics. This report will start with an initial exploratory data analysis (EDA). The EDA will be filled with data visualizations and inital data insights to help understand a data set. The EDA is then followed by a model exploration. Different machine learning alogrithms will be used to create predictive models. Models will then be compared against each other and the most accurate predictive model will be determined. The conclusion section will provide an executive summary of data insights and the selected predictive model.

2 EDA

2.1 First Look

This data set is provided in the ggplot2 package from CRAN. Let’s load the tidyverse collection of packages, which includes ggplot2 and some other packages (like dplyr) which are useful for data manipulation. Let’s also load the plotly package, which allows us to create advanced visualizations.

# Load libraries needed for EDA
library(tidyverse)
library(plotly)

# save diamonds data into an object called data
data <- ggplot2::diamonds

A sample of the data is provided below:

head(data)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Looks like there is a mix of numeric and categorical data. The numeric data (carat, depth, table, x, y, & z) are doubles (dbl), which means they are numerics with decimal places. Price is a numeric integer, so it will not have any decimal places. The categorical data (cut, color, and clarity) are ordered (ord), which means it has a specific order, such as best to worst. Similar to the sample above, the glimpse function provides a sample of the data in a different format that is more preferable when a data set is very large.

glimpse(data)
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...

There are 53940 rows of data. Each row is one observation / one diamond. There are 10 columns of data. Each column is a variable / feature. Below is a quick statistical summary of each feature:

summary(data)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

A minimum, 1st quartile, median, mean, 3rd quartile, and maximum is provided for all the numeric data. For example, the median price of diamonds in this data set is 2401. A data count is provided for all categorical data. For example, there are 21551 diamonds that are Ideal cut in this data set. Since all the categorical data is ordered, the counts are in its pre-specified order. For the example, the order for color is: D, E, F, G, H, I, J. Below is a definition for each feature:

carat: weight of the diamond (0.2 to 5.01)
cut: quality of the cut (in order of worst to best; Fair, Good, Very Good, Premium, Ideal)
color: diamond color (in order of best to worst; D, E, F, G, H, I, J)
clarity: a measurement of how clear the diamond is (in order of worst to best; I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF)
depth: total depth percentage (total height divided by total width) = 2 * z / (x + y) (43 to 79)
table: width of the top of a diamond relative to its widest point (43 to 95)
price: price in US dollars (326 to 18823)
x: length in mm (0 to 10.74)
y: width in mm (0 to 58.9)
z: depth in mm (0 to 31.8)

My OCD is kicking in a little bit for color. It is in the order of best to worst, while cut and clarity are in order of worst to best. The script below reorders color from worst to best for consistency. This will have no effect what-so-ever on the analysis, it’s just my personal preference.

# Reordering color so it's from worst to best. Just like all the other ordered features.
# This is personal preference. It's not a necessary step.
data <- data %>%                  # Create a new object, data, from the old object, data
  mutate(                         # Change its columns
    color = ordered(color,        # Make the color column and ordered column
                    levels = c(   # The order/levels are as follows:
                      "J",
                      "I",
                      "H",
                      "G",
                      "F",
                      "E",
                      "D"
                    )))

Now we can get to the interesting stuff. Let’s look at how price is affected by each individual feature.

2.1.1 Price vs Carat

Below is a scatter plot of price vs carat:

ggplot(data, aes(x=carat, y=price)) + # Plot the object data, with carat mapped to the x axis and price mapped to the y axis
  geom_point(alpha=0.1)               # Make the plot a scatter plot. Make each point 10% transparent/alpha


Price increases at an increasing rate as carat increases.

2.1.2 Price vs Cut

Below is a box-plot of price by cut.

{ggplot(data, aes(x=cut, y=price)) + # Plot the object data, with cut mapped to the x axis and price mapped to the y axis
  geom_boxplot()} %>%                # Make the plot a box plot
  ggplotly()                         # Make the plot interactive with the ggplotly function


There is no immediately visible trend. I would expect price to increase as the cut becomes better. However, other features may be affecting this box-plot. For example, maybe the average carat size for ideal diamonds is lower than the average carat size for fair. Therefore, the average price for ideal cuts in this data set are lower than fair cuts. This will be explored in more depth later on in this analysis. For now, we are only going to look at price vs each individual feature.

2.1.3 Price vs Color

Below is a box-plot of price by color.

{ggplot(data, aes(x=color, y=price))+
  geom_boxplot()} %>% 
  ggplotly()


Very interesting! As color gets better, price decreases. This absolutely does not pass the common sense test. Similar to cut, other features (like carat) must be affecting these results.

2.1.4 Price vs Clarity

Below is a box-plot of price by clarity.

{ggplot(data, aes(x=clarity, y=price))+
  geom_boxplot()} %>% 
  ggplotly()


These are similar results to color. As clarity gets better, price decreases. Once again, this does not pass the common sense test.

2.1.5 Price vs Depth Percentage

Below is a scatter plot of price vs depth percentage.

ggplot(data, aes(x=depth, y=price))+
  geom_point(alpha=0.1)


It looks like as the range of depth narrows, the price increases. It appears more expensive diamonds are roughly around 58% to 63% depth percentage.

2.1.6 Price vs Table Percentage

Below is a scatter plot of price vs table percentage.

ggplot(data, aes(x=table, y=price))+
  geom_point(alpha=0.1)


This looks similar to depth percentage. As the range of table percentage narrows, the price increases. Although, it doesn’t to appear to be as drastic as depth percentage.

2.1.7 Price vs Length (x)

Below is a scatter plot of price vs length (x).

ggplot(data, aes(x=x, y=price))+
  geom_point(alpha=0.1)


This looks similar to carat. Price increases at an increasing rate as length increases. There’s also some data points that have a length of zero. These must be missing data points and should be converted to NAs. If they are not, this can mess up some predictive models down the line. First let’s find out how many zeros there are.

df <- data %>%     # Create a new object df from the object data
  filter(x==0)     # show only the data that has a zero value for x

print(df)          # Display the object df
## # A tibble: 8 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1.07 Ideal     F     SI2      61.6    56  4954     0  6.62     0
## 2  1    Very Good H     VS2      63.3    53  5139     0  0        0
## 3  1.14 Fair      G     VS1      57.5    67  6381     0  0        0
## 4  1.56 Ideal     G     VS2      62.2    54 12800     0  0        0
## 5  1.2  Premium   D     VVS1     62.1    59 15686     0  0        0
## 6  2.25 Premium   H     SI2      62.8    59 18034     0  0        0
## 7  0.71 Good      F     SI2      64.1    60  2130     0  0        0
## 8  0.71 Good      F     SI2      64.1    60  2130     0  0        0

There are 8 rows of data with x labeled as zero and there are 53940 rows of data total. This is a small percent (1.483129410^{-4}) of the total data, but this should still be addressed. These values will be converted into NAs and what to do with them will be determined during model exploration. These rows of data could be removed from the data set completely, or a median length can be assumed, or potentially x can be calculated from depth and table percentage. Is it even worth addressing it all? This will be determined later in this project.

I also see zeros for y and z. The script below determines how many zeros there are for width (y):

df <- data %>% 
  filter(y==0)

nrow(df)   # display how many rows are in object df
## [1] 7

And for depth (z):

df <- data %>% 
  filter(z==0)

nrow(df)
## [1] 20

There are 20 missing data points for z. That is 3.707823510^{-4} percent of the data. It’s slightly more that x and y, but still a small sample. The script below will convert all zeros for x, y, and z into NAs, specifically numeric NAs.

data <- data %>%          # create a new object called data, from the old object called data
  mutate(                 # change the columns in the old object
    x = case_when(        # Change the column x by the following conditions
      x==0 ~ NA_real_,    # When x is 0, make it a NA
      TRUE ~ x),          # For everything else, keep x as is
    y = case_when(        # same logic as x
      y==0 ~ NA_real_,   
      TRUE ~ y),   
    z = case_when(        # same logic as x
      z==0 ~ NA_real_,
      TRUE ~ z))

All zeros are now NAs.

2.1.8 Price vs Width (y)

Below is a scatter plot of price vs width (y).

ggplot(data, aes(x=y, y=price))+
  geom_point(alpha=0.1)


First off, there are some outliers past a y of 20. Let’s see what these points are.

df <- data %>% 
  filter(y>20)

print(df)
## # A tibble: 2 x 10
##   carat cut     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  2    Premium H     SI2      58.9    57 12210  8.09  58.9  8.06
## 2  0.51 Ideal   E     VS1      61.8    55  2075  5.15  31.8  5.12

These values for y can’t be right. One way to double check is to use the equation for depth percentage: depth = 2 * z/(x + y). Using a little algebra we can solve for y: y = (2z/depth) - x. Before using these equations, we need a proof of concept. Depth and y will be calculated using these equations and we’ll check if they equal the depth and y provided in the data. If they do, the equations are true.

df <- data %>%
  select(depth, table, x, y, z) %>%   # Select only these columns from data
  mutate(
    depth_calc = round(((2*z)/(x+y))*100, # create a new column, depth_calc using the equation
                       digits = 1),       # round to the nearest one digit
    y_calc = round((2*z/(depth/100))-x,   # create a new column, y_calc using the equation
                   digits = 2),           # round to the nearest two digits
    depth_check = depth==depth_calc,      # create depth_check, which is true if depth equals depth_calc
    y_check = y==y_calc) %>%              # create y_check, which is true if y equals y_calc
    select(depth, depth_calc, depth_check, y, y_calc, y_check) # Select only these columns
  
head(df, n = 20) # show the first 20 rows
## # A tibble: 20 x 6
##    depth depth_calc depth_check     y y_calc y_check
##    <dbl>      <dbl> <lgl>       <dbl>  <dbl> <lgl>  
##  1  61.5       61.3 FALSE        3.98   3.95 FALSE  
##  2  59.8       59.8 TRUE         3.84   3.84 TRUE   
##  3  56.9       56.9 TRUE         4.07   4.07 TRUE   
##  4  62.4       62.4 TRUE         4.23   4.23 TRUE   
##  5  63.3       63.3 TRUE         4.35   4.35 TRUE   
##  6  62.8       62.8 TRUE         3.96   3.96 TRUE   
##  7  62.3       62.3 TRUE         3.98   3.98 TRUE   
##  8  61.9       61.9 TRUE         4.11   4.1  FALSE  
##  9  65.1       65.1 TRUE         3.78   3.78 TRUE   
## 10  59.4       59.4 TRUE         4.05   4.05 TRUE   
## 11  64         64   TRUE         4.28   4.28 TRUE   
## 12  62.8       62.8 TRUE         3.9    3.9  TRUE   
## 13  60.4       60.4 TRUE         3.84   3.84 TRUE   
## 14  62.2       62.2 TRUE         4.37   4.36 FALSE  
## 15  60.2       60.2 TRUE         3.75   3.75 TRUE   
## 16  60.9       60.9 TRUE         4.42   4.42 TRUE   
## 17  62         62   TRUE         4.34   4.34 TRUE   
## 18  63.4       63.4 TRUE         4.29   4.29 TRUE   
## 19  63.8       63.8 TRUE         4.26   4.27 FALSE  
## 20  62.7       62.7 TRUE         4.27   4.27 TRUE

It looks like the equations work for the most part. In some instances they don’t match 100%, but they are in the ballpark. This may be due to rounding errors. Now that these equations are confirmed. Let’s calculate y for the two outliers.

df <- data %>% 
  filter(y>20) %>% 
  mutate(y_calc = round((2*z/(depth/100))-x,
                   digits = 2))

print(df)
## # A tibble: 2 x 11
##   carat cut     color clarity depth table price     x     y     z y_calc
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>
## 1  2    Premium H     SI2      58.9    57 12210  8.09  58.9  8.06   19.3
## 2  0.51 Ideal   E     VS1      61.8    55  2075  5.15  31.8  5.12   11.4

These values seem a lot more reasonable. Now let’s commit this change to the data and re-plot.

data <- data %>% 
  mutate(
    y = case_when(
      y>20 ~ round((2*z/(depth/100))-x,digits = 2),
      TRUE ~ y))
ggplot(data, aes(x=y, y=price))+
  geom_point(alpha=0.1)


Those data points still look like they may be outliers but they are more reasonable than before, and trend is easier to see in this chart. Price increases at an increasing rate as y increases.

2.1.9 Price vs Depth (z)

Below is a scatter plot of price vs depth (z).

ggplot(data, aes(x=z, y=price))+
  geom_point(alpha=0.1)


Looks like there is an outlier past a depth (z) of 30. This may be a similar case as seen with width (y). Below shows this data point:

df <- data %>% 
  filter(z>30)

print(df)
## # A tibble: 1 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.51 Very Good E     VS1      61.8  54.7  1970  5.12  5.15  31.8

This doesn’t pass the common sense test. Using the depth percentage equation we can calcuate depth (z) by using: z = depth*(x+y)/2. The results are below:

df <- df %>% 
  mutate(
    z_calc = round(((depth/100)*(x+y))/2, digits = 2)
  )

print(df)
## # A tibble: 1 x 11
##   carat cut       color clarity depth table price     x     y     z z_calc
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>
## 1  0.51 Very Good E     VS1      61.8  54.7  1970  5.12  5.15  31.8   3.17

The calculated z looks more reasonable. The script below commits this change to the data and re-plots.

data <- data %>% 
  mutate(
    z = case_when(
      z>30 ~ round(((depth/100)*(x+y))/2, digits = 2),
      TRUE ~ z
    )
  )
ggplot(data, aes(x=z, y=price))+
  geom_point(alpha=0.1)


That’s better! This trend appears to be the same as carat, x, and y. Now, onto some more detailed data exploration.

2.2 Best of the Best

Best of the Best: Imagine someone wants to buy the best possible diamond, and money is no object. They only want to consider diamonds in the top categories of cut (Ideal), color (D), and clarity (IF). They want the most ideal range for depth (59-63) and table (54-57). Within the dataset, if we plot carat versus price can we fit a clean trendline? What’s the price of the largest carat, and is it the most expensive?

First we need to filter the data set to only the best cut, color, clarity, depth, and table.

df <- data %>% 
  filter(
    cut=="Ideal",
    color=="D",
    clarity=="IF",
    between(depth, 59, 63),
    between(table, 54, 57))

print(df)
## # A tibble: 24 x 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.51 Ideal D     IF       62      56  3446  5.14  5.18  3.2 
##  2  0.51 Ideal D     IF       62.1    55  3446  5.12  5.13  3.19
##  3  0.53 Ideal D     IF       61.5    54  3517  5.27  5.21  3.22
##  4  0.53 Ideal D     IF       62.2    55  3812  5.17  5.19  3.22
##  5  0.59 Ideal D     IF       60.9    57  4208  5.4   5.43  3.3 
##  6  0.56 Ideal D     IF       62.4    56  4216  5.24  5.28  3.28
##  7  0.56 Ideal D     IF       61.9    57  4293  5.28  5.31  3.28
##  8  0.63 Ideal D     IF       62.5    55  6549  5.47  5.5   3.43
##  9  0.63 Ideal D     IF       62.5    55  6607  5.5   5.47  3.43
## 10  1.04 Ideal D     IF       61.8    57 14494  6.49  6.52  4.02
## # ... with 14 more rows


This filters the data to only 24 options. Let’s plot carat versus price.

ggplot(df, aes(x=carat, y=price)) +
  geom_point()


Looks like a linear model could be viable. Let’s create one.

model_df_lm <- lm(price ~ carat, df)

summary(model_df_lm)
## 
## Call:
## lm(formula = price ~ carat, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2123.5 -1264.7   245.6  1026.5  2344.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5621.9      623.1  -9.023 7.57e-09 ***
## carat        20259.9      911.8  22.221  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1288 on 22 degrees of freedom
## Multiple R-squared:  0.9573, Adjusted R-squared:  0.9554 
## F-statistic: 493.8 on 1 and 22 DF,  p-value: < 2.2e-16


Carat passes the T test (p-value) by a mile and the R squared isn’t too bad. Let’s plot this model now.

pred_df_lm <- predict(model_df_lm, df)

df <- df %>% 
  mutate(pred = pred_df_lm)

ggplot(df, aes(x=carat, y=price)) +
  geom_point(color="blue") +
  geom_line(color="red", aes(y=pred))


Looks like the most expensive diamond might not be the largest one. Let’s find these two data points.

df2 <- df %>% 
  filter(
    carat == max(carat) |
      price == max(price)
  )

print(df2)
## # A tibble: 2 x 11
##   carat cut   color clarity depth table price     x     y     z   pred
##   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>
## 1  1.07 Ideal D     IF       60.9    54 17042  6.66  6.73  4.08 16056.
## 2  1.03 Ideal D     IF       62      56 17590  6.55  6.44  4.03 15246.


Interesting! The most expensive diamond is the not the largest diamond, but it has a larger depth and table compared to the largest carat diamond. This needs a closer look.

2.3 3D Scatter Plots

This next section will explore the data using 3-dimensional scatter plots.

First let’s create a 2D scatter plot of price vs carat.

ggplot(data, aes(x=carat, y=price)) +
  geom_point()


Next, let’s create a scatter plot of price vs depth.

ggplot(data, aes(x=depth, y=price)) +
  geom_point()


Now let’s create a scatter plot of price vs table.

ggplot(data, aes(x=table, y=price)) +
  geom_point()

2.3.1 Price vs Carat vs Depth

A 3D scatter plot of price vs carat vs depth. This chart is interactive. Click and drag your mouse to change the orientation of the chart.

plot_ly(data=data, x=~price, y=~carat, z=~depth, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))


A 3D scatter plot of price vs carat vs depth with cut mapped to color. This chart shows that the range of depth is narrower for diamonds with a better cut, like ideal.

plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~cut, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))


A 3D scatter plot of price vs carat vs depth with color mapped to color. Like previous charts, this chart shows that price increases as carat increases. However, it also shows that price increases at a faster rate for diamonds with a better color.

plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~color, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))


A 3D scatter plot of price vs carat vs depth with clarity mapped to color. It appears that price increases at a faster rate for diamonds with a better clarity, similar to color.

plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~clarity, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))

2.3.2 Price vs Carat vs Table

A 3D scatter plot of price vs carat vs table.

plot_ly(data=data, x=~price, y=~carat, z=~table, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="Table")))


A 3D scatter plot of price vs carat vs table with cut mapped to color. Diamonds with a better cut have lower table percentages than diamonds with a worse cut.

plot_ly(data=data, x=~price, y=~carat, z=~table, color=~cut, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))


A 3D scatter plot of price vs carat vs table with color mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~table, color=~color, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))


A 3D scatter plot of price vs carat vs table with clarity mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~table, color=~clarity, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))


3 Model Exploration

The script below summarizes all the data prep and cleaning steps in previous sections.

# Load libraries needed for model exploration
library(tidyverse)      # for data manipulation
library(Metrics)        # to quickly develop metrics to determine accuracy of models
library(rpart)          # to create decision tree models
library(rpart.plot)     # to plot decision tree models
library(randomForest)   # to create random forest models
library(gbm)            # to create gradient boosted models
library(broom)          # to analyze model performance

# This script is to prep the data
# This way I don't have to re-run everything before this
data <- ggplot2::diamonds

# Reordering color so it's from worst to best. Just like all the other ordered features.
# This is personal preference. It's not a necessary step.
data <- data %>%                  # Create a new object, data, from the old object, data
  mutate(                         # Change its columns
    color = ordered(color,        # Make the color column and ordered column
                    levels = c(   # The order/levels are as follows:
                      "J",
                      "I",
                      "H",
                      "G",
                      "F",
                      "E",
                      "D"
                    )))

# Changing zeros into NAs
data <- data %>%          # create a new object called data, from the old object called data
  mutate(                 # change the columns in the old object
    x = case_when(        # Change the column x by the following conditions
      x==0 ~ NA_real_,    # When x is 0, make it a NA
      TRUE ~ x),          # For everything else, keep x as is
    y = case_when(        # same logic as x
      y==0 ~ NA_real_,   
      TRUE ~ y),   
    z = case_when(        # same logic as x
      z==0 ~ NA_real_,
      TRUE ~ z))

# Estimate y outliers
data <- data %>% 
  mutate(
    y = case_when(
      y>20 ~ round((2*z/(depth/100))-x,digits = 2),
      TRUE ~ y))

# Estimate z outliers
data <- data %>% 
  mutate(
    z = case_when(
      z>30 ~ round(((depth/100)*(x+y))/2, digits = 2),
      TRUE ~ z
    )
  )

One more step before we go into creating models. Let’s create two random samples from the data. One sample will train the ML model, this is called the training data. And another smaller sample will be used to test the accuracy of the ML, this is called the testing data. Splitting up the data this way will allow for a fair comparison between models that doesn’t allow an overfit model to have a competitive edge over other models.

# Set the seed for reproducibility
set.seed(123)

# Create a random number vector from 1 to the number of rows in the data set.
# Make the length of the vector equal to about 80% of the data.
train_index <- sample(1:nrow(data), round(nrow(data)*0.8, digits = 0))

# subset the data into training data (80% of the data)
train <- data[train_index,]
# subset the data into testing data (the remaining 20%)
test <- data[-train_index,]

Now let’s create some predictive machine learning models!

3.1 Linear Regression

First, let’s use a classic linear regression model. We’ll start with a simple linear regression and then try some multivariate linear regression.

3.1.1 Simple Linear Regression

Simple linear regression is your classic y=b+mx. The y is your dependent variable; the variable we wish to predict. In this case, it is price. The x is you independent variable; the feature that will predict the dependent variable. In this case, we’ll start with carat which seemed to have a strong correlation with price. The m is your slope and the b is you y-intercept. The script below creates the simple linear model:

# Set the seed for reproducibility
set.seed(123)

# Create an object called model_lm1 by using the lm fucntion
model_lm1 <- lm(price ~ carat,        # The formula is price as a function of carat
                train)                 # The data is an object called train

summary(model_lm1) # Stats summary of the model
## 
## Call:
## lm(formula = price ~ carat, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18610.5   -802.5    -17.3    540.3  12730.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2261.06      14.48  -156.1   <2e-16 ***
## carat        7762.38      15.61   497.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1538 on 43150 degrees of freedom
## Multiple R-squared:  0.8514, Adjusted R-squared:  0.8514 
## F-statistic: 2.472e+05 on 1 and 43150 DF,  p-value: < 2.2e-16

The created formula is Price = -2261.0600631 + -2261.0600631 * Carat. Now, let’s use this model to predict the values in the test data, and then analyze its effectiveness using three metrics: MAE, RMSE, and Adjusted R Square.

# Create predictions for the test data using the simple linear model
predict_lm1 <- predict(model_lm1, test)

# calculate mean absolute error of model's predictions
mae_lm1 <- round(mae(test$price, predict_lm1), digits = 2)
mae_lm1
## [1] 1032.87
# calculate root mean squared error of model's predictions
rmse_lm1 <- round(rmse(test$price, predict_lm1), digits = 2)
rmse_lm1
## [1] 1591.2
# calculate r squared
SSR_lm1 <-sse(test$price, predict_lm1)      # residual sum of squares
SST_lm1 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_lm1 <- 1 - SSR_lm1/SST_lm1

# calculate adjusted r squared
RsqAdj_lm1 <- round(1 - (1-Rsq_lm1)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_lm1
## [1] 0.8411

This model predicted the test data with a mean absolute error of 1032.87, a root mean squared error of 1591.2, and a adjusted R square of 0.8411. Mean absolute error is the average of the absolute difference between actual values and the predictions. The root mean squared error is the square root of the squared difference between the predictions and actual values. The adjusted r square measures how well the actual outcomes are replicated by the regression and is adjusted (penalized) for how many independent variables are used for the model. Now let’s build a multivariate linear regression and compare the results.

3.1.2 Multivariate Linear Regression

3.1.2.1 Price vs All

The script below creates a multivariate linear regression model using all the potential variables in the data.

# Set the seed for reproducibility
set.seed(123)

# Create an object called model_lm1 by using the lm fucntion
model_lm2 <- lm(price ~ .,        # The formula is price as a function of all other variables
                train,                # The data is an object called train
                na.action = na.omit)  # Exclude NAs               

summary(model_lm2) # Stats summary of the model
## 
## Call:
## lm(formula = price ~ ., data = train, na.action = na.omit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21720.6   -579.1   -178.8    364.6  10784.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1188.181    744.231  -1.597 0.110380    
## carat       11629.283     57.356 202.754  < 2e-16 ***
## cut.L         568.775     24.948  22.798  < 2e-16 ***
## cut.Q        -272.345     20.071 -13.569  < 2e-16 ***
## cut.C         107.724     17.423   6.183 6.35e-10 ***
## cut^4         -18.884     13.768  -1.372 0.170209    
## color.L      1946.220     19.171 101.518  < 2e-16 ***
## color.Q      -670.107     17.428 -38.450  < 2e-16 ***
## color.C       169.228     16.288  10.390  < 2e-16 ***
## color^4        52.363     14.979   3.496 0.000473 ***
## color^5        96.591     14.159   6.822 9.10e-12 ***
## color^6       -38.716     12.837  -3.016 0.002563 ** 
## clarity.L    4025.203     33.855 118.896  < 2e-16 ***
## clarity.Q   -1928.536     31.603 -61.025  < 2e-16 ***
## clarity.C     955.613     26.991  35.405  < 2e-16 ***
## clarity^4    -367.537     21.478 -17.112  < 2e-16 ***
## clarity^5     223.968     17.469  12.821  < 2e-16 ***
## clarity^6       9.599     15.157   0.633 0.526534    
## clarity^7      85.135     13.369   6.368 1.93e-10 ***
## depth          59.744     10.992   5.435 5.50e-08 ***
## table         -27.361      3.222  -8.492  < 2e-16 ***
## x            -847.530     87.841  -9.648  < 2e-16 ***
## y            1034.730     85.048  12.166  < 2e-16 ***
## z           -2234.162    169.734 -13.163  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1117 on 43113 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.9215, Adjusted R-squared:  0.9215 
## F-statistic: 2.201e+04 on 23 and 43113 DF,  p-value: < 2.2e-16

Now let’s predict the testing data.

# Create predictions for the test data using the simple linear model
predict_lm2 <- predict(model_lm2, test,
                       na.action = na.omit) # exclude NAs

# calculate mean absolute error of model's predictions
mae_lm2 <- round(mae(test$price, predict_lm2), digits = 2)
mae_lm2
## [1] 1109.38
# calculate root mean squared error of model's predictions
rmse_lm2 <- round(rmse(test$price, predict_lm2), digits = 2)
rmse_lm2
## [1] 2238.97
# calculate r squared
SSR_lm2 <-sse(test$price, predict_lm2)      # residual sum of squares
SST_lm2 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_lm2 <- 1 - SSR_lm2/SST_lm2

# calculate adjusted r squared
RsqAdj_lm2 <- round(1 - (1-Rsq_lm2)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_lm2
## [1] 0.6854

Interesting! Adding more variables, led to a worse model. It has a higher MAE & RMSE, and a lower adjusted R Square. What if we only used the 4 C’s. #### Price vs 4 C’s

# Set the seed for reproducibility
set.seed(123)

# Create an object called model_lm1 by using the lm fucntion
model_lm3 <- lm(price ~ carat + cut + color + clarity,        # The formula is price as a function of all other variables
                train,                # The data is an object called train
                na.action = na.omit)  # Exclude NAs               

summary(model_lm3) # Stats summary of the model
## 
## Call:
## lm(formula = price ~ carat + cut + color + clarity, data = train, 
##     na.action = na.omit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16834.6   -678.1   -195.8    467.3  10424.9 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -3707.235     15.566 -238.163  < 2e-16 ***
## carat        8879.929     13.384  663.469  < 2e-16 ***
## cut.L         709.492     22.659   31.311  < 2e-16 ***
## cut.Q        -328.882     19.957  -16.480  < 2e-16 ***
## cut.C         170.384     17.325    9.835  < 2e-16 ***
## cut^4           6.604     13.861    0.476  0.63375    
## color.L      1892.293     19.704   96.035  < 2e-16 ***
## color.Q      -612.504     17.915  -34.189  < 2e-16 ***
## color.C       174.145     16.778   10.379  < 2e-16 ***
## color^4        31.726     15.426    2.057  0.03973 *  
## color^5        84.454     14.584    5.791 7.05e-09 ***
## color^6       -41.302     13.224   -3.123  0.00179 ** 
## clarity.L    4176.710     34.628  120.616  < 2e-16 ***
## clarity.Q   -1833.327     32.434  -56.525  < 2e-16 ***
## clarity.C     900.704     27.727   32.485  < 2e-16 ***
## clarity^4    -367.479     22.111  -16.620  < 2e-16 ***
## clarity^5     208.358     17.975   11.591  < 2e-16 ***
## clarity^6       6.884     15.614    0.441  0.65932    
## clarity^7     104.862     13.768    7.616 2.66e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1151 on 43133 degrees of freedom
## Multiple R-squared:  0.9168, Adjusted R-squared:  0.9167 
## F-statistic: 2.639e+04 on 18 and 43133 DF,  p-value: < 2.2e-16
# Create predictions for the test data using the simple linear model
predict_lm3 <- predict(model_lm3, test,
                       na.action = na.omit) # exclude NAs

# calculate mean absolute error of model's predictions
mae_lm3 <- round(mae(test$price, predict_lm3), digits = 2)
mae_lm3
## [1] 809.05
# calculate root mean squared error of model's predictions
rmse_lm3 <- round(rmse(test$price, predict_lm3), digits = 2)
rmse_lm3
## [1] 1180
# calculate r squared
SSR_lm3 <-sse(test$price, predict_lm3)      # residual sum of squares
SST_lm3 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_lm3 <- 1 - SSR_lm3/SST_lm3

# calculate adjusted r squared
RsqAdj_lm3 <- round(1 - (1-Rsq_lm3)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_lm3
## [1] 0.9126

This is looking better! What about just depth and table? I have a hunch these two variables are going to be the worst predictors.

3.1.2.2 Price vs Depth and Table

# Set the seed for reproducibility
set.seed(123)

# Create an object called model_lm1 by using the lm fucntion
model_lm4 <- lm(price ~ depth + table,        # The formula is price as a function of all other variables
                train,                # The data is an object called train
                na.action = na.omit)  # Exclude NAs               

summary(model_lm4) # Stats summary of the model
## 
## Call:
## lm(formula = price ~ depth + table, data = train, na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7305  -2746  -1485   1351  15742 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14408.80    1124.59 -12.812  < 2e-16 ***
## depth           74.19      13.93   5.326 1.01e-07 ***
## table          239.38       8.92  26.835  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3956 on 43149 degrees of freedom
## Multiple R-squared:  0.01659,    Adjusted R-squared:  0.01654 
## F-statistic: 363.9 on 2 and 43149 DF,  p-value: < 2.2e-16
# Create predictions for the test data using the simple linear model
predict_lm4 <- predict(model_lm4, test,
                       na.action = na.omit) # exclude NAs

# calculate mean absolute error of model's predictions
mae_lm4 <- round(mae(test$price, predict_lm4), digits = 2)
mae_lm4
## [1] 2984.07
# calculate root mean squared error of model's predictions
rmse_lm4 <- round(rmse(test$price, predict_lm4), digits = 2)
rmse_lm4
## [1] 3955.46
# calculate r squared
SSR_lm4 <-sse(test$price, predict_lm4)      # residual sum of squares
SST_lm4 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_lm4 <- 1 - SSR_lm4/SST_lm4

# calculate adjusted r squared
RsqAdj_lm4 <- round(1 - (1-Rsq_lm4)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_lm4
## [1] 0.0182

Just as expected. This is not a good model at all! How about x, y, and z?

3.1.2.3 Price vs Dimensions (x, y, & z)

# Set the seed for reproducibility
set.seed(123)

# Create an object called model_lm1 by using the lm fucntion
model_lm5 <- lm(price ~ x + y + z,        # The formula is price as a function of all other variables
                train,                # The data is an object called train
                na.action = na.omit)  # Exclude NAs               

summary(model_lm5) # Stats summary of the model
## 
## Call:
## lm(formula = price ~ x + y + z, data = train, na.action = na.omit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18680.9  -1276.4   -183.8    997.9  12058.5 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -14282.28      46.16 -309.441  < 2e-16 ***
## x             1246.88     110.82   11.251  < 2e-16 ***
## y             1588.43     108.43   14.650  < 2e-16 ***
## z              553.47      96.83    5.716  1.1e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1826 on 43133 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.7901, Adjusted R-squared:  0.7901 
## F-statistic: 5.414e+04 on 3 and 43133 DF,  p-value: < 2.2e-16
# Create predictions for the test data using the simple linear model
predict_lm5 <- predict(model_lm5, test,
                       na.action = na.omit) # exclude NAs

# calculate mean absolute error of model's predictions
mae_lm5 <- round(mae(test$price, predict_lm5), digits = 2)
mae_lm5
## [1] 1724.47
# calculate root mean squared error of model's predictions
rmse_lm5 <- round(rmse(test$price, predict_lm5), digits = 2)
rmse_lm5
## [1] 2634.51
# calculate r squared
SSR_lm5 <-sse(test$price, predict_lm5)      # residual sum of squares
SST_lm5 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_lm5 <- 1 - SSR_lm5/SST_lm5

# calculate adjusted r squared
RsqAdj_lm5 <- round(1 - (1-Rsq_lm5)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_lm5
## [1] 0.5645

Not the best, but does it help or hurt if we add it to the 4 C’s?

3.1.2.4 Price vs 4 C’s and Dimensions

# Set the seed for reproducibility
set.seed(123)

# Create an object called model_lm1 by using the lm fucntion
model_lm6 <- lm(price ~ carat + cut + color + clarity + x + y + z,        # The formula is price as a function of all other variables
                train,                # The data is an object called train
                na.action = na.omit)  # Exclude NAs               

summary(model_lm6) # Stats summary of the model
## 
## Call:
## lm(formula = price ~ carat + cut + color + clarity + x + y + 
##     z, data = train, na.action = na.omit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21880.6   -581.5   -182.5    366.7  10844.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   926.4334    95.3963   9.711  < 2e-16 ***
## carat       11624.4260    57.3676 202.630  < 2e-16 ***
## cut.L         633.2127    23.1566  27.345  < 2e-16 ***
## cut.Q        -259.1068    20.0620 -12.915  < 2e-16 ***
## cut.C         139.1130    17.1617   8.106 5.37e-16 ***
## cut^4          -0.7948    13.6257  -0.058 0.953487    
## color.L      1944.6314    19.1945 101.312  < 2e-16 ***
## color.Q      -671.9649    17.4511 -38.506  < 2e-16 ***
## color.C       170.3164    16.3088  10.443  < 2e-16 ***
## color^4        53.5914    14.9982   3.573 0.000353 ***
## color^5        97.9259    14.1777   6.907 5.02e-12 ***
## color^6       -39.3243    12.8542  -3.059 0.002220 ** 
## clarity.L    4029.5104    33.8768 118.946  < 2e-16 ***
## clarity.Q   -1931.5841    31.6425 -61.044  < 2e-16 ***
## clarity.C     959.4578    27.0260  35.501  < 2e-16 ***
## clarity^4    -368.4433    21.5072 -17.131  < 2e-16 ***
## clarity^5     225.5876    17.4910  12.897  < 2e-16 ***
## clarity^6      10.1858    15.1763   0.671 0.502119    
## clarity^7      84.5724    13.3871   6.317 2.69e-10 ***
## x           -1204.6756    73.2082 -16.455  < 2e-16 ***
## y             717.0309    69.6271  10.298  < 2e-16 ***
## z           -1145.2097    65.4820 -17.489  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1119 on 43115 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.9213, Adjusted R-squared:  0.9213 
## F-statistic: 2.403e+04 on 21 and 43115 DF,  p-value: < 2.2e-16
# Create predictions for the test data using the simple linear model
predict_lm6 <- predict(model_lm6, test,
                       na.action = na.omit) # exclude NAs

# calculate mean absolute error of model's predictions
mae_lm6 <- round(mae(test$price, predict_lm6), digits = 2)
mae_lm6
## [1] 1111.51
# calculate root mean squared error of model's predictions
rmse_lm6 <- round(rmse(test$price, predict_lm6), digits = 2)
rmse_lm6
## [1] 2239.09
# calculate r squared
SSR_lm6 <-sse(test$price, predict_lm6)      # residual sum of squares
SST_lm6 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_lm6 <- 1 - SSR_lm6/SST_lm6

# calculate adjusted r squared
RsqAdj_lm6 <- round(1 - (1-Rsq_lm6)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_lm6
## [1] 0.6854

There’s our answer. It hurts, doesn’t help. It looks like the 4 C’s are the best predictors. Now let’s try a ML model other than linear regression.

3.2 Regression Decision Tree

3.2.1 Price vs 4 C’s

set.seed(123)

model_dt1 <- rpart(formula = price ~ cut + carat + color + clarity,
                   data = train,
                   method = "anova")
print(model_dt1)
## n= 43152 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 43152 686526200000  3926.589  
##    2) carat< 0.995 27934  34904930000  1631.023  
##      4) carat< 0.625 19871   5283100000  1050.602 *
##      5) carat>=0.625 8063   6429671000  3061.449 *
##    3) carat>=0.995 15218 234217800000  8140.306  
##      6) carat< 1.475 10245  47971710000  6129.327  
##       12) clarity=I1,SI2,SI1,VS2 7773  14496730000  5363.278 *
##       13) clarity=VS1,VVS2,VVS1,IF 2472  14570490000  8538.104 *
##      7) carat>=1.475 4973  59461550000 12283.170  
##       14) carat< 1.915 3210  27894050000 10867.790 *
##       15) carat>=1.915 1763  13428350000 14860.240 *
rpart.plot(x=model_dt1,
           yesno = 2, type = 0, extra = 0) # personal preference on how the plot is displayed

# Create predictions for the test data using the simple linear model
predict_dt1 <- predict(model_dt1, test)

# calculate mean absolute error of model's predictions
mae_dt1 <- round(mae(test$price, predict_dt1), digits = 2)
mae_dt1
## [1] 912.34
# calculate root mean squared error of model's predictions
rmse_dt1 <- round(rmse(test$price, predict_dt1), digits = 2)
rmse_dt1
## [1] 1433.26
# calculate r squared
SSR_dt1 <-sse(test$price, predict_dt1)      # residual sum of squares
SST_dt1 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_dt1 <- 1 - SSR_dt1/SST_dt1

# calculate adjusted r squared
RsqAdj_dt1 <- round(1 - (1-Rsq_dt1)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_dt1
## [1] 0.8711

3.2.2 Price vs All

set.seed(123)

model_dt2 <- rpart(formula = price ~ .,
                   data = train,
                   method = "anova")
print(model_dt2)
## n= 43152 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 43152 686526200000  3926.589  
##    2) carat< 0.995 27934  34904930000  1631.023  
##      4) y< 5.535 19999   5440120000  1056.695 *
##      5) y>=5.535 7935   6242051000  3078.531 *
##    3) carat>=0.995 15218 234217800000  8140.306  
##      6) y< 7.195 10296  47444790000  6123.230  
##       12) clarity=I1,SI2,SI1,VS2 7858  15736360000  5389.146 *
##       13) clarity=VS1,VVS2,VVS1,IF 2438  13825520000  8489.281 *
##      7) y>=7.195 4922  57255330000 12359.690  
##       14) y< 7.865 3204  27464010000 10959.290 *
##       15) y>=7.865 1718  11789580000 14971.380 *
rpart.plot(x=model_dt2,
           yesno = 2, type = 0, extra = 0) # personal preference on how the plot is displayed

# Create predictions for the test data using the simple linear model
predict_dt2 <- predict(model_dt2, test)

# calculate mean absolute error of model's predictions
mae_dt2 <- round(mae(test$price, predict_dt2), digits = 2)
mae_dt2
## [1] 915.34
# calculate root mean squared error of model's predictions
rmse_dt2 <- round(rmse(test$price, predict_dt2), digits = 2)
rmse_dt2
## [1] 1438.64
# calculate r squared
SSR_dt2 <-sse(test$price, predict_dt2)      # residual sum of squares
SST_dt2 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_dt2 <- 1 - SSR_dt2/SST_dt2

# calculate adjusted r squared
RsqAdj_dt2 <- round(1 - (1-Rsq_dt2)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_dt2
## [1] 0.8701

3.2.2.1 Tuning CP to 0.001

Changing the complexity parameter from 0.01 to 0.001.Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. For instance, with anova splitting, this means that the overall R-squared must increase by cp at each step.

set.seed(123)

model_dt3 <- rpart(formula = price ~ .,
                   data = train,
                   method = "anova",
                   control = rpart.control(cp=0.001))
rpart.plot(x=model_dt3,
           yesno = 2, type = 0, extra = 0) # personal preference on how the plot is displayed

# Create predictions for the test data using the simple linear model
predict_dt3 <- predict(model_dt3, test)

# calculate mean absolute error of model's predictions
mae_dt3 <- round(mae(test$price, predict_dt3), digits = 2)
mae_dt3
## [1] 589.92
# calculate root mean squared error of model's predictions
rmse_dt3 <- round(rmse(test$price, predict_dt3), digits = 2)
rmse_dt3
## [1] 986.47
# calculate r squared
SSR_dt3 <-sse(test$price, predict_dt3)      # residual sum of squares
SST_dt3 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_dt3 <- 1 - SSR_dt3/SST_dt3

# calculate adjusted r squared
RsqAdj_dt3 <- round(1 - (1-Rsq_dt3)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_dt3
## [1] 0.9389

3.2.2.2 Tuning CP to 0.000001

Now let’s try an even lower complexity parameter that makes a decision tree too complex to plot.

set.seed(123)

model_dt4 <- rpart(formula = price ~ .,
                   data = train,
                   method = "anova",
                   control = rpart.control(cp=0.000001))
# Create predictions for the test data using the simple linear model
predict_dt4 <- predict(model_dt4, test)

# calculate mean absolute error of model's predictions
mae_dt4 <- round(mae(test$price, predict_dt4), digits = 2)
mae_dt4
## [1] 326.34
# calculate root mean squared error of model's predictions
rmse_dt4 <- round(rmse(test$price, predict_dt4), digits = 2)
rmse_dt4
## [1] 633.96
# calculate r squared
SSR_dt4 <-sse(test$price, predict_dt4)      # residual sum of squares
SST_dt4 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_dt4 <- 1 - SSR_dt4/SST_dt4

# calculate adjusted r squared
RsqAdj_dt4 <- round(1 - (1-Rsq_dt4)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_dt4
## [1] 0.9748

3.3 Random Forest

3.3.1 Price vs 4 C’s

Let’s create a random forest model with the 4 C’s.

set.seed(123)

model_rf1 <- randomForest(price ~ carat + cut + color + clarity, 
                          train)
model_rf1
## 
## Call:
##  randomForest(formula = price ~ carat + cut + color + clarity,      data = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 1982524
##                     % Var explained: 87.54
plot(model_rf1)


It looks like there 100 trees (iterations) are enough for this model.

# Create predictions for the test data using the simple linear model
predict_rf1 <- predict(model_rf1, test)

# calculate mean absolute error of model's predictions
mae_rf1 <- round(mae(test$price, predict_rf1), digits = 2)
mae_rf1
## [1] 934.28
# calculate root mean squared error of model's predictions
rmse_rf1 <- round(rmse(test$price, predict_rf1), digits = 2)
rmse_rf1
## [1] 1403.13
# calculate r squared
SSR_rf1 <-sse(test$price, predict_rf1)      # residual sum of squares
SST_rf1 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_rf1 <- 1 - SSR_rf1/SST_rf1

# calculate adjusted r squared
RsqAdj_rf1 <- round(1 - (1-Rsq_rf1)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_rf1
## [1] 0.8765

It’s good, but not as good as the linear regression. What if we use all the data?

3.3.2 Price vs All

set.seed(123)

model_rf2 <- randomForest(price ~ ., 
                          train,
                          na.action = na.omit, # remove NAs
                          ntree=300)           # only create 300 trees / iterations
model_rf2
## 
## Call:
##  randomForest(formula = price ~ ., data = train, ntree = 300,      na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 300
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 294825.6
##                     % Var explained: 98.14
plot(model_rf2)


It looks like there 100 trees (iterations) are enough for this model too.

# Create predictions for the test data using the simple linear model
predict_rf2 <- predict(model_rf2, test)

# Create new tibble for test that includes predictions
# and filters out NAs
test_rf2 <- test %>% 
  mutate(pred = predict_rf2) %>% 
  filter(is.na(pred)==F)

# calculate mean absolute error of model's predictions
mae_rf2 <- round(mae(test_rf2$price, test_rf2$pred), digits = 2)
mae_rf2
## [1] 275.41
# calculate root mean squared error of model's predictions
rmse_rf2 <- round(rmse(test_rf2$price, test_rf2$pred), digits = 2)
rmse_rf2
## [1] 564.83
# calculate r squared
SSR_rf2 <-sse(test_rf2$price, test_rf2$pred)      # residual sum of squares
SST_rf2 <-sse(test_rf2$price, mean(test_rf2$price)) # total sum of squares
Rsq_rf2 <- 1 - SSR_rf2/SST_rf2

# calculate adjusted r squared
RsqAdj_rf2 <- round(1 - (1-Rsq_rf2)*((nrow(test_rf2)-1)/(nrow(test_rf2)-1-1)), digits = 4)
RsqAdj_rf2
## [1] 0.98

This model out-performs the best linear model (Price vs The 4 C’s). The RMSE is 1180 for the linear model, and 564.83 for this random forest model. The adjusted R square is 0.9126361 for the linear model and 0.9799764 for the random forest model. However, the down side to the random forest model is that it can only use complete data. If there are NAs for x, y, and z, as we saw in the raw data, then this random forest model cannot predict price, and the linear model would have to be used.

3.4 Gradient Boosting Machine

3.4.1 Price vs 4 C’s

set.seed(123)

model_gbm1 <- gbm(formula = price ~ carat + cut + color + clarity,
                  data = train,
                  distribution = "gaussian",
                  n.trees = 1000)

Below shows the relative importance of each predictive variable.

summary(model_gbm1)

##             var    rel.inf
## carat     carat 95.2463663
## clarity clarity  3.0094951
## color     color  1.5262655
## cut         cut  0.2178731

It looks like carat has a lot more predictive power than clarity, color, and cut.

print(model_gbm1)
## gbm(formula = price ~ carat + cut + color + clarity, distribution = "gaussian", 
##     data = train, n.trees = 1000)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## There were 4 predictors of which 4 had non-zero influence.

Although carat is the best predictor, all four features had a non-zero influence.

# Create predictions for the test data using the simple linear model
predict_gbm1 <- predict(model_gbm1, test)

# calculate mean absolute error of model's predictions
mae_gbm1 <- round(mae(test$price, predict_gbm1), digits = 2)
mae_gbm1
## [1] 694.97
# calculate root mean squared error of model's predictions
rmse_gbm1 <- round(rmse(test$price, predict_gbm1), digits = 2)
rmse_gbm1
## [1] 1065.85
# calculate r squared
SSR_gbm1 <-sse(test$price, predict_gbm1)      # residual sum of squares
SST_gbm1 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_gbm1 <- 1 - SSR_gbm1/SST_gbm1

# calculate adjusted r squared
RsqAdj_gbm1 <- round(1 - (1-Rsq_gbm1)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_gbm1
## [1] 0.9287

This isn’t too bad, but not as good as the Random Forest model.

3.4.2 Price vs All

set.seed(123)

model_gbm2 <- gbm(formula = price ~ .,
                  data = train,
                  distribution = "gaussian",
                  n.trees = 1000)
summary(model_gbm2)

##             var     rel.inf
## y             y 67.63546385
## carat     carat 15.46431048
## z             z  8.59241920
## x             x  4.03740092
## clarity clarity  2.66925283
## color     color  1.46180531
## depth     depth  0.10285109
## cut         cut  0.02739915
## table     table  0.00909717

y is the best predictor followed by carat, z, and then x. It looks like all the numeric values are the best predictors.

print(model_gbm2)
## gbm(formula = price ~ ., distribution = "gaussian", data = train, 
##     n.trees = 1000)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## There were 9 predictors of which 9 had non-zero influence.

All features have a non-zero influence.

# Create predictions for the test data using the simple linear model
predict_gbm2 <- predict(model_gbm2, test)

# calculate mean absolute error of model's predictions
mae_gbm2 <- round(mae(test$price, predict_gbm2), digits = 2)
mae_gbm2
## [1] 679.36
# calculate root mean squared error of model's predictions
rmse_gbm2 <- round(rmse(test$price, predict_gbm2), digits = 2)
rmse_gbm2
## [1] 1062.52
# calculate r squared
SSR_gbm2 <-sse(test$price, predict_gbm2)      # residual sum of squares
SST_gbm2 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_gbm2 <- 1 - SSR_gbm2/SST_gbm2

# calculate adjusted r squared
RsqAdj_gbm2 <- round(1 - (1-Rsq_gbm2)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_gbm2
## [1] 0.9292

This model is only slightly better than the Price vs 4 C’s model.

3.4.2.1 Tuning

The following parameters can be tuned to increase the performance of a GBM model:
1. n.trees: number of trees
2. bag.fraction: proportion of observations to be sampled in each tree
3. n.minobsinnode: minimum number of observations in the trees terminal nodes
4. interaction.depth: maximum nodes per tree
5. shrinkage: learning rate

The script below determines the optimal ntree based on Out-of-Bag (OOB) error.

ntree_opt_oob <- gbm.perf(model_gbm2, method = "OOB", oobag.curve = T)

ntree_opt_oob
## [1] 335
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4, 
##     length(x)/10), 50))
## 
## Number of Observations: 1000 
## Equivalent Number of Parameters: 40 
## Residual Standard Error: 31340

The script below determines the optimal ntree based on the Cross Validation (CV) error with 5 cross validation folds.

set.seed(123)

model_gbm2_cv <- gbm(formula = price ~ .,
                  data = train,
                  distribution = "gaussian",
                  n.trees = 1000,
                  cv.folds = 5)

ntree_opt_cv <- gbm.perf(model_gbm2_cv, method = "cv")

ntree_opt_cv
## [1] 949

Now time to compare the predictions between the two options for ntree.

pred_oob <- predict(model_gbm2, test, n.trees = ntree_opt_oob)
rmse(test$price, pred_oob)
## [1] 1072.422
pred_cv <- predict(model_gbm2, test, n.trees = ntree_opt_cv)
rmse(test$price, pred_cv)
## [1] 1063.767

Based on the cross validation error, the optimal number of trees when predicting should be 949. However, this RMSE is less than the RMSE of the original model. Therfore, this ntree should be left as is.

3.5 Model Comparison

The script below combines combines all the metrics into one tibble.

model_compare <- tibble(
  ModelName = c(
    "model_dt1",
    "model_dt2",
    "model_dt3",
    "model_dt4",
    "model_gbm1",
    "model_gbm2",
    "model_lm1",
    "model_lm2",
    "model_lm3",
    "model_lm4",
    "model_lm5",
    "model_lm6",
    "model_rf1",
    "model_rf2"
  ),
  ModelType = c(
    "Decision Tree",
    "Decision Tree",
    "Decision Tree",
    "Decision Tree",
    "Gradient Boosting Machine",
    "Gradient Boosting Machine",
    "Linear Regression",
    "Linear Regression",
    "Linear Regression",
    "Linear Regression",
    "Linear Regression",
    "Linear Regression",
    "Random Forest",
    "Random Forest"
  ),
  RMSE = c(
    rmse_dt1,
    rmse_dt2,
    rmse_dt3,
    rmse_dt4,
    rmse_gbm1,
    rmse_gbm2,
    rmse_lm1,
    rmse_lm2,
    rmse_lm3,
    rmse_lm4,
    rmse_lm5,
    rmse_lm6,
    rmse_rf1,
    rmse_rf2
  ),
  MAE = c(
    mae_dt1,
    mae_dt2,
    mae_dt3,
    mae_dt4,
    mae_gbm1,
    mae_gbm2,
    mae_lm1,
    mae_lm2,
    mae_lm3,
    mae_lm4,
    mae_lm5,
    mae_lm6,
    mae_rf1,
    mae_rf2
  ),
  AdjRSquare = c(
    RsqAdj_dt1,
    RsqAdj_dt2,
    RsqAdj_dt3,
    RsqAdj_dt4,
    RsqAdj_gbm1,
    RsqAdj_gbm2,
    RsqAdj_lm1,
    RsqAdj_lm2,
    RsqAdj_lm3,
    RsqAdj_lm4,
    RsqAdj_lm5,
    RsqAdj_lm6,
    RsqAdj_rf1,
    RsqAdj_rf2
  )
) %>% 
  arrange(RMSE, MAE, AdjRSquare) %>% 
  mutate(Rank = c(1:nrow(.))) %>% 
  select(Rank, everything())

head(model_compare, n = nrow(model_compare))
## # A tibble: 14 x 6
##     Rank ModelName  ModelType                  RMSE   MAE AdjRSquare
##    <int> <chr>      <chr>                     <dbl> <dbl>      <dbl>
##  1     1 model_rf2  Random Forest              565.  275.     0.98  
##  2     2 model_dt4  Decision Tree              634.  326.     0.975 
##  3     3 model_dt3  Decision Tree              986.  590.     0.939 
##  4     4 model_gbm2 Gradient Boosting Machine 1063.  679.     0.929 
##  5     5 model_gbm1 Gradient Boosting Machine 1066.  695.     0.929 
##  6     6 model_lm3  Linear Regression         1180   809.     0.913 
##  7     7 model_rf1  Random Forest             1403.  934.     0.876 
##  8     8 model_dt1  Decision Tree             1433.  912.     0.871 
##  9     9 model_dt2  Decision Tree             1439.  915.     0.870 
## 10    10 model_lm1  Linear Regression         1591. 1033.     0.841 
## 11    11 model_lm2  Linear Regression         2239. 1109.     0.685 
## 12    12 model_lm6  Linear Regression         2239. 1112.     0.685 
## 13    13 model_lm5  Linear Regression         2635. 1724.     0.564 
## 14    14 model_lm4  Linear Regression         3955. 2984.     0.0182

The best model is model_rf2, a Random Forest model using the formula price vs carat + cut + color + clarity + depth + table + x + y + z. It has an RMSE of 564.83, a MAE of 275.41, and an adjusted R square of 0.98.

4 Discussion

Throughout this analysis some assumptions and data cleaning steps have been made. These decisions will impact how each predictive model will perform. A sensitivity analysis may have to be completed for the following data cleaning issues:

1. Issue: x, y, and z values labeled as zeros.
Action took in this analysis: Make all zeros NAs and ignore NAs in model exploration.

2. Issue: Potential outliers for y and z.
Action took in this analysis: Estimate y and z using the depth equation

5 Conclusion

The best model is model_rf2, a Random Forest model using the formula price vs carat + cut + color + clarity + depth + table + x + y + z. It has an RMSE of 564.83, a MAE of 275.41, and an adjusted R square of 0.98.