The purpose of this project is to create a machine learning model that can predict the price of a diamond based on its charateristics. This report will start with an initial exploratory data analysis (EDA). The EDA will be filled with data visualizations and inital data insights to help understand a data set. The EDA is then followed by a model exploration. Different machine learning alogrithms will be used to create predictive models. Models will then be compared against each other and the most accurate predictive model will be determined. The conclusion section will provide an executive summary of data insights and the selected predictive model.
This data set is provided in the ggplot2 package from CRAN. Let’s load the tidyverse collection of packages, which includes ggplot2 and some other packages (like dplyr) which are useful for data manipulation. Let’s also load the plotly package, which allows us to create advanced visualizations.
# Load libraries needed for EDA
library(tidyverse)
library(plotly)
# save diamonds data into an object called data
data <- ggplot2::diamonds
A sample of the data is provided below:
head(data)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Looks like there is a mix of numeric and categorical data. The numeric data (carat, depth, table, x, y, & z) are doubles (dbl), which means they are numerics with decimal places. Price is a numeric integer, so it will not have any decimal places. The categorical data (cut, color, and clarity) are ordered (ord), which means it has a specific order, such as best to worst. Similar to the sample above, the glimpse function provides a sample of the data in a different format that is more preferable when a data set is very large.
glimpse(data)
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
There are 53940 rows of data. Each row is one observation / one diamond. There are 10 columns of data. Each column is a variable / feature. Below is a quick statistical summary of each feature:
summary(data)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
A minimum, 1st quartile, median, mean, 3rd quartile, and maximum is provided for all the numeric data. For example, the median price of diamonds in this data set is 2401. A data count is provided for all categorical data. For example, there are 21551 diamonds that are Ideal cut in this data set. Since all the categorical data is ordered, the counts are in its pre-specified order. For the example, the order for color is: D, E, F, G, H, I, J. Below is a definition for each feature:
carat: weight of the diamond (0.2 to 5.01)
cut: quality of the cut (in order of worst to best; Fair, Good, Very Good, Premium, Ideal)
color: diamond color (in order of best to worst; D, E, F, G, H, I, J)
clarity: a measurement of how clear the diamond is (in order of worst to best; I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF)
depth: total depth percentage (total height divided by total width) = 2 * z / (x + y) (43 to 79)
table: width of the top of a diamond relative to its widest point (43 to 95)
price: price in US dollars (326 to 18823)
x: length in mm (0 to 10.74)
y: width in mm (0 to 58.9)
z: depth in mm (0 to 31.8)
My OCD is kicking in a little bit for color. It is in the order of best to worst, while cut and clarity are in order of worst to best. The script below reorders color from worst to best for consistency. This will have no effect what-so-ever on the analysis, it’s just my personal preference.
# Reordering color so it's from worst to best. Just like all the other ordered features.
# This is personal preference. It's not a necessary step.
data <- data %>% # Create a new object, data, from the old object, data
mutate( # Change its columns
color = ordered(color, # Make the color column and ordered column
levels = c( # The order/levels are as follows:
"J",
"I",
"H",
"G",
"F",
"E",
"D"
)))
Now we can get to the interesting stuff. Let’s look at how price is affected by each individual feature.
Below is a scatter plot of price vs carat:
ggplot(data, aes(x=carat, y=price)) + # Plot the object data, with carat mapped to the x axis and price mapped to the y axis
geom_point(alpha=0.1) # Make the plot a scatter plot. Make each point 10% transparent/alpha
Price increases at an increasing rate as carat increases.
Below is a box-plot of price by cut.
{ggplot(data, aes(x=cut, y=price)) + # Plot the object data, with cut mapped to the x axis and price mapped to the y axis
geom_boxplot()} %>% # Make the plot a box plot
ggplotly() # Make the plot interactive with the ggplotly function
Below is a box-plot of price by color.
{ggplot(data, aes(x=color, y=price))+
geom_boxplot()} %>%
ggplotly()
Below is a box-plot of price by clarity.
{ggplot(data, aes(x=clarity, y=price))+
geom_boxplot()} %>%
ggplotly()
Below is a scatter plot of price vs depth percentage.
ggplot(data, aes(x=depth, y=price))+
geom_point(alpha=0.1)
It looks like as the range of depth narrows, the price increases. It appears more expensive diamonds are roughly around 58% to 63% depth percentage.
Below is a scatter plot of price vs table percentage.
ggplot(data, aes(x=table, y=price))+
geom_point(alpha=0.1)
This looks similar to depth percentage. As the range of table percentage narrows, the price increases. Although, it doesn’t to appear to be as drastic as depth percentage.
Below is a scatter plot of price vs length (x).
ggplot(data, aes(x=x, y=price))+
geom_point(alpha=0.1)
This looks similar to carat. Price increases at an increasing rate as length increases. There’s also some data points that have a length of zero. These must be missing data points and should be converted to NAs. If they are not, this can mess up some predictive models down the line. First let’s find out how many zeros there are.
df <- data %>% # Create a new object df from the object data
filter(x==0) # show only the data that has a zero value for x
print(df) # Display the object df
## # A tibble: 8 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.07 Ideal F SI2 61.6 56 4954 0 6.62 0
## 2 1 Very Good H VS2 63.3 53 5139 0 0 0
## 3 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 4 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 5 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 6 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 7 0.71 Good F SI2 64.1 60 2130 0 0 0
## 8 0.71 Good F SI2 64.1 60 2130 0 0 0
There are 8 rows of data with x labeled as zero and there are 53940 rows of data total. This is a small percent (1.483129410^{-4}) of the total data, but this should still be addressed. These values will be converted into NAs and what to do with them will be determined during model exploration. These rows of data could be removed from the data set completely, or a median length can be assumed, or potentially x can be calculated from depth and table percentage. Is it even worth addressing it all? This will be determined later in this project.
I also see zeros for y and z. The script below determines how many zeros there are for width (y):
df <- data %>%
filter(y==0)
nrow(df) # display how many rows are in object df
## [1] 7
And for depth (z):
df <- data %>%
filter(z==0)
nrow(df)
## [1] 20
There are 20 missing data points for z. That is 3.707823510^{-4} percent of the data. It’s slightly more that x and y, but still a small sample. The script below will convert all zeros for x, y, and z into NAs, specifically numeric NAs.
data <- data %>% # create a new object called data, from the old object called data
mutate( # change the columns in the old object
x = case_when( # Change the column x by the following conditions
x==0 ~ NA_real_, # When x is 0, make it a NA
TRUE ~ x), # For everything else, keep x as is
y = case_when( # same logic as x
y==0 ~ NA_real_,
TRUE ~ y),
z = case_when( # same logic as x
z==0 ~ NA_real_,
TRUE ~ z))
All zeros are now NAs.
Below is a scatter plot of price vs width (y).
ggplot(data, aes(x=y, y=price))+
geom_point(alpha=0.1)
First off, there are some outliers past a y of 20. Let’s see what these points are.
df <- data %>%
filter(y>20)
print(df)
## # A tibble: 2 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 2 Premium H SI2 58.9 57 12210 8.09 58.9 8.06
## 2 0.51 Ideal E VS1 61.8 55 2075 5.15 31.8 5.12
These values for y can’t be right. One way to double check is to use the equation for depth percentage: depth = 2 * z/(x + y). Using a little algebra we can solve for y: y = (2z/depth) - x. Before using these equations, we need a proof of concept. Depth and y will be calculated using these equations and we’ll check if they equal the depth and y provided in the data. If they do, the equations are true.
df <- data %>%
select(depth, table, x, y, z) %>% # Select only these columns from data
mutate(
depth_calc = round(((2*z)/(x+y))*100, # create a new column, depth_calc using the equation
digits = 1), # round to the nearest one digit
y_calc = round((2*z/(depth/100))-x, # create a new column, y_calc using the equation
digits = 2), # round to the nearest two digits
depth_check = depth==depth_calc, # create depth_check, which is true if depth equals depth_calc
y_check = y==y_calc) %>% # create y_check, which is true if y equals y_calc
select(depth, depth_calc, depth_check, y, y_calc, y_check) # Select only these columns
head(df, n = 20) # show the first 20 rows
## # A tibble: 20 x 6
## depth depth_calc depth_check y y_calc y_check
## <dbl> <dbl> <lgl> <dbl> <dbl> <lgl>
## 1 61.5 61.3 FALSE 3.98 3.95 FALSE
## 2 59.8 59.8 TRUE 3.84 3.84 TRUE
## 3 56.9 56.9 TRUE 4.07 4.07 TRUE
## 4 62.4 62.4 TRUE 4.23 4.23 TRUE
## 5 63.3 63.3 TRUE 4.35 4.35 TRUE
## 6 62.8 62.8 TRUE 3.96 3.96 TRUE
## 7 62.3 62.3 TRUE 3.98 3.98 TRUE
## 8 61.9 61.9 TRUE 4.11 4.1 FALSE
## 9 65.1 65.1 TRUE 3.78 3.78 TRUE
## 10 59.4 59.4 TRUE 4.05 4.05 TRUE
## 11 64 64 TRUE 4.28 4.28 TRUE
## 12 62.8 62.8 TRUE 3.9 3.9 TRUE
## 13 60.4 60.4 TRUE 3.84 3.84 TRUE
## 14 62.2 62.2 TRUE 4.37 4.36 FALSE
## 15 60.2 60.2 TRUE 3.75 3.75 TRUE
## 16 60.9 60.9 TRUE 4.42 4.42 TRUE
## 17 62 62 TRUE 4.34 4.34 TRUE
## 18 63.4 63.4 TRUE 4.29 4.29 TRUE
## 19 63.8 63.8 TRUE 4.26 4.27 FALSE
## 20 62.7 62.7 TRUE 4.27 4.27 TRUE
It looks like the equations work for the most part. In some instances they don’t match 100%, but they are in the ballpark. This may be due to rounding errors. Now that these equations are confirmed. Let’s calculate y for the two outliers.
df <- data %>%
filter(y>20) %>%
mutate(y_calc = round((2*z/(depth/100))-x,
digits = 2))
print(df)
## # A tibble: 2 x 11
## carat cut color clarity depth table price x y z y_calc
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2 Premium H SI2 58.9 57 12210 8.09 58.9 8.06 19.3
## 2 0.51 Ideal E VS1 61.8 55 2075 5.15 31.8 5.12 11.4
These values seem a lot more reasonable. Now let’s commit this change to the data and re-plot.
data <- data %>%
mutate(
y = case_when(
y>20 ~ round((2*z/(depth/100))-x,digits = 2),
TRUE ~ y))
ggplot(data, aes(x=y, y=price))+
geom_point(alpha=0.1)
Those data points still look like they may be outliers but they are more reasonable than before, and trend is easier to see in this chart. Price increases at an increasing rate as y increases.
Below is a scatter plot of price vs depth (z).
ggplot(data, aes(x=z, y=price))+
geom_point(alpha=0.1)
Looks like there is an outlier past a depth (z) of 30. This may be a similar case as seen with width (y). Below shows this data point:
df <- data %>%
filter(z>30)
print(df)
## # A tibble: 1 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.51 Very Good E VS1 61.8 54.7 1970 5.12 5.15 31.8
This doesn’t pass the common sense test. Using the depth percentage equation we can calcuate depth (z) by using: z = depth*(x+y)/2. The results are below:
df <- df %>%
mutate(
z_calc = round(((depth/100)*(x+y))/2, digits = 2)
)
print(df)
## # A tibble: 1 x 11
## carat cut color clarity depth table price x y z z_calc
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.51 Very Good E VS1 61.8 54.7 1970 5.12 5.15 31.8 3.17
The calculated z looks more reasonable. The script below commits this change to the data and re-plots.
data <- data %>%
mutate(
z = case_when(
z>30 ~ round(((depth/100)*(x+y))/2, digits = 2),
TRUE ~ z
)
)
ggplot(data, aes(x=z, y=price))+
geom_point(alpha=0.1)
That’s better! This trend appears to be the same as carat, x, and y. Now, onto some more detailed data exploration.
Best of the Best: Imagine someone wants to buy the best possible diamond, and money is no object. They only want to consider diamonds in the top categories of cut (Ideal), color (D), and clarity (IF). They want the most ideal range for depth (59-63) and table (54-57). Within the dataset, if we plot carat versus price can we fit a clean trendline? What’s the price of the largest carat, and is it the most expensive?
First we need to filter the data set to only the best cut, color, clarity, depth, and table.
df <- data %>%
filter(
cut=="Ideal",
color=="D",
clarity=="IF",
between(depth, 59, 63),
between(table, 54, 57))
print(df)
## # A tibble: 24 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.51 Ideal D IF 62 56 3446 5.14 5.18 3.2
## 2 0.51 Ideal D IF 62.1 55 3446 5.12 5.13 3.19
## 3 0.53 Ideal D IF 61.5 54 3517 5.27 5.21 3.22
## 4 0.53 Ideal D IF 62.2 55 3812 5.17 5.19 3.22
## 5 0.59 Ideal D IF 60.9 57 4208 5.4 5.43 3.3
## 6 0.56 Ideal D IF 62.4 56 4216 5.24 5.28 3.28
## 7 0.56 Ideal D IF 61.9 57 4293 5.28 5.31 3.28
## 8 0.63 Ideal D IF 62.5 55 6549 5.47 5.5 3.43
## 9 0.63 Ideal D IF 62.5 55 6607 5.5 5.47 3.43
## 10 1.04 Ideal D IF 61.8 57 14494 6.49 6.52 4.02
## # ... with 14 more rows
This filters the data to only 24 options. Let’s plot carat versus price.
ggplot(df, aes(x=carat, y=price)) +
geom_point()
Looks like a linear model could be viable. Let’s create one.
model_df_lm <- lm(price ~ carat, df)
summary(model_df_lm)
##
## Call:
## lm(formula = price ~ carat, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2123.5 -1264.7 245.6 1026.5 2344.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5621.9 623.1 -9.023 7.57e-09 ***
## carat 20259.9 911.8 22.221 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1288 on 22 degrees of freedom
## Multiple R-squared: 0.9573, Adjusted R-squared: 0.9554
## F-statistic: 493.8 on 1 and 22 DF, p-value: < 2.2e-16
Carat passes the T test (p-value) by a mile and the R squared isn’t too bad. Let’s plot this model now.
pred_df_lm <- predict(model_df_lm, df)
df <- df %>%
mutate(pred = pred_df_lm)
ggplot(df, aes(x=carat, y=price)) +
geom_point(color="blue") +
geom_line(color="red", aes(y=pred))
Looks like the most expensive diamond might not be the largest one. Let’s find these two data points.
df2 <- df %>%
filter(
carat == max(carat) |
price == max(price)
)
print(df2)
## # A tibble: 2 x 11
## carat cut color clarity depth table price x y z pred
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1.07 Ideal D IF 60.9 54 17042 6.66 6.73 4.08 16056.
## 2 1.03 Ideal D IF 62 56 17590 6.55 6.44 4.03 15246.
Interesting! The most expensive diamond is the not the largest diamond, but it has a larger depth and table compared to the largest carat diamond. This needs a closer look.
This next section will explore the data using 3-dimensional scatter plots.
First let’s create a 2D scatter plot of price vs carat.
ggplot(data, aes(x=carat, y=price)) +
geom_point()
Next, let’s create a scatter plot of price vs depth.
ggplot(data, aes(x=depth, y=price)) +
geom_point()
Now let’s create a scatter plot of price vs table.
ggplot(data, aes(x=table, y=price)) +
geom_point()
A 3D scatter plot of price vs carat vs depth. This chart is interactive. Click and drag your mouse to change the orientation of the chart.
plot_ly(data=data, x=~price, y=~carat, z=~depth, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))
plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~cut, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))
plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~color, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))
plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~clarity, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))
A 3D scatter plot of price vs carat vs table.
plot_ly(data=data, x=~price, y=~carat, z=~table, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="Table")))
plot_ly(data=data, x=~price, y=~carat, z=~table, color=~cut, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))
plot_ly(data=data, x=~price, y=~carat, z=~table, color=~color, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))
plot_ly(data=data, x=~price, y=~carat, z=~table, color=~clarity, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))
The script below summarizes all the data prep and cleaning steps in previous sections.
# Load libraries needed for model exploration
library(tidyverse) # for data manipulation
library(Metrics) # to quickly develop metrics to determine accuracy of models
library(rpart) # to create decision tree models
library(rpart.plot) # to plot decision tree models
library(randomForest) # to create random forest models
library(gbm) # to create gradient boosted models
library(broom) # to analyze model performance
# This script is to prep the data
# This way I don't have to re-run everything before this
data <- ggplot2::diamonds
# Reordering color so it's from worst to best. Just like all the other ordered features.
# This is personal preference. It's not a necessary step.
data <- data %>% # Create a new object, data, from the old object, data
mutate( # Change its columns
color = ordered(color, # Make the color column and ordered column
levels = c( # The order/levels are as follows:
"J",
"I",
"H",
"G",
"F",
"E",
"D"
)))
# Changing zeros into NAs
data <- data %>% # create a new object called data, from the old object called data
mutate( # change the columns in the old object
x = case_when( # Change the column x by the following conditions
x==0 ~ NA_real_, # When x is 0, make it a NA
TRUE ~ x), # For everything else, keep x as is
y = case_when( # same logic as x
y==0 ~ NA_real_,
TRUE ~ y),
z = case_when( # same logic as x
z==0 ~ NA_real_,
TRUE ~ z))
# Estimate y outliers
data <- data %>%
mutate(
y = case_when(
y>20 ~ round((2*z/(depth/100))-x,digits = 2),
TRUE ~ y))
# Estimate z outliers
data <- data %>%
mutate(
z = case_when(
z>30 ~ round(((depth/100)*(x+y))/2, digits = 2),
TRUE ~ z
)
)
One more step before we go into creating models. Let’s create two random samples from the data. One sample will train the ML model, this is called the training data. And another smaller sample will be used to test the accuracy of the ML, this is called the testing data. Splitting up the data this way will allow for a fair comparison between models that doesn’t allow an overfit model to have a competitive edge over other models.
# Set the seed for reproducibility
set.seed(123)
# Create a random number vector from 1 to the number of rows in the data set.
# Make the length of the vector equal to about 80% of the data.
train_index <- sample(1:nrow(data), round(nrow(data)*0.8, digits = 0))
# subset the data into training data (80% of the data)
train <- data[train_index,]
# subset the data into testing data (the remaining 20%)
test <- data[-train_index,]
Now let’s create some predictive machine learning models!
First, let’s use a classic linear regression model. We’ll start with a simple linear regression and then try some multivariate linear regression.
Simple linear regression is your classic y=b+mx. The y is your dependent variable; the variable we wish to predict. In this case, it is price. The x is you independent variable; the feature that will predict the dependent variable. In this case, we’ll start with carat which seemed to have a strong correlation with price. The m is your slope and the b is you y-intercept. The script below creates the simple linear model:
# Set the seed for reproducibility
set.seed(123)
# Create an object called model_lm1 by using the lm fucntion
model_lm1 <- lm(price ~ carat, # The formula is price as a function of carat
train) # The data is an object called train
summary(model_lm1) # Stats summary of the model
##
## Call:
## lm(formula = price ~ carat, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18610.5 -802.5 -17.3 540.3 12730.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2261.06 14.48 -156.1 <2e-16 ***
## carat 7762.38 15.61 497.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1538 on 43150 degrees of freedom
## Multiple R-squared: 0.8514, Adjusted R-squared: 0.8514
## F-statistic: 2.472e+05 on 1 and 43150 DF, p-value: < 2.2e-16
The created formula is Price = -2261.0600631 + -2261.0600631 * Carat. Now, let’s use this model to predict the values in the test data, and then analyze its effectiveness using three metrics: MAE, RMSE, and Adjusted R Square.
# Create predictions for the test data using the simple linear model
predict_lm1 <- predict(model_lm1, test)
# calculate mean absolute error of model's predictions
mae_lm1 <- round(mae(test$price, predict_lm1), digits = 2)
mae_lm1
## [1] 1032.87
# calculate root mean squared error of model's predictions
rmse_lm1 <- round(rmse(test$price, predict_lm1), digits = 2)
rmse_lm1
## [1] 1591.2
# calculate r squared
SSR_lm1 <-sse(test$price, predict_lm1) # residual sum of squares
SST_lm1 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_lm1 <- 1 - SSR_lm1/SST_lm1
# calculate adjusted r squared
RsqAdj_lm1 <- round(1 - (1-Rsq_lm1)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_lm1
## [1] 0.8411
This model predicted the test data with a mean absolute error of 1032.87, a root mean squared error of 1591.2, and a adjusted R square of 0.8411. Mean absolute error is the average of the absolute difference between actual values and the predictions. The root mean squared error is the square root of the squared difference between the predictions and actual values. The adjusted r square measures how well the actual outcomes are replicated by the regression and is adjusted (penalized) for how many independent variables are used for the model. Now let’s build a multivariate linear regression and compare the results.
The script below creates a multivariate linear regression model using all the potential variables in the data.
# Set the seed for reproducibility
set.seed(123)
# Create an object called model_lm1 by using the lm fucntion
model_lm2 <- lm(price ~ ., # The formula is price as a function of all other variables
train, # The data is an object called train
na.action = na.omit) # Exclude NAs
summary(model_lm2) # Stats summary of the model
##
## Call:
## lm(formula = price ~ ., data = train, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21720.6 -579.1 -178.8 364.6 10784.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1188.181 744.231 -1.597 0.110380
## carat 11629.283 57.356 202.754 < 2e-16 ***
## cut.L 568.775 24.948 22.798 < 2e-16 ***
## cut.Q -272.345 20.071 -13.569 < 2e-16 ***
## cut.C 107.724 17.423 6.183 6.35e-10 ***
## cut^4 -18.884 13.768 -1.372 0.170209
## color.L 1946.220 19.171 101.518 < 2e-16 ***
## color.Q -670.107 17.428 -38.450 < 2e-16 ***
## color.C 169.228 16.288 10.390 < 2e-16 ***
## color^4 52.363 14.979 3.496 0.000473 ***
## color^5 96.591 14.159 6.822 9.10e-12 ***
## color^6 -38.716 12.837 -3.016 0.002563 **
## clarity.L 4025.203 33.855 118.896 < 2e-16 ***
## clarity.Q -1928.536 31.603 -61.025 < 2e-16 ***
## clarity.C 955.613 26.991 35.405 < 2e-16 ***
## clarity^4 -367.537 21.478 -17.112 < 2e-16 ***
## clarity^5 223.968 17.469 12.821 < 2e-16 ***
## clarity^6 9.599 15.157 0.633 0.526534
## clarity^7 85.135 13.369 6.368 1.93e-10 ***
## depth 59.744 10.992 5.435 5.50e-08 ***
## table -27.361 3.222 -8.492 < 2e-16 ***
## x -847.530 87.841 -9.648 < 2e-16 ***
## y 1034.730 85.048 12.166 < 2e-16 ***
## z -2234.162 169.734 -13.163 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1117 on 43113 degrees of freedom
## (15 observations deleted due to missingness)
## Multiple R-squared: 0.9215, Adjusted R-squared: 0.9215
## F-statistic: 2.201e+04 on 23 and 43113 DF, p-value: < 2.2e-16
Now let’s predict the testing data.
# Create predictions for the test data using the simple linear model
predict_lm2 <- predict(model_lm2, test,
na.action = na.omit) # exclude NAs
# calculate mean absolute error of model's predictions
mae_lm2 <- round(mae(test$price, predict_lm2), digits = 2)
mae_lm2
## [1] 1109.38
# calculate root mean squared error of model's predictions
rmse_lm2 <- round(rmse(test$price, predict_lm2), digits = 2)
rmse_lm2
## [1] 2238.97
# calculate r squared
SSR_lm2 <-sse(test$price, predict_lm2) # residual sum of squares
SST_lm2 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_lm2 <- 1 - SSR_lm2/SST_lm2
# calculate adjusted r squared
RsqAdj_lm2 <- round(1 - (1-Rsq_lm2)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_lm2
## [1] 0.6854
Interesting! Adding more variables, led to a worse model. It has a higher MAE & RMSE, and a lower adjusted R Square. What if we only used the 4 C’s. #### Price vs 4 C’s
# Set the seed for reproducibility
set.seed(123)
# Create an object called model_lm1 by using the lm fucntion
model_lm3 <- lm(price ~ carat + cut + color + clarity, # The formula is price as a function of all other variables
train, # The data is an object called train
na.action = na.omit) # Exclude NAs
summary(model_lm3) # Stats summary of the model
##
## Call:
## lm(formula = price ~ carat + cut + color + clarity, data = train,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16834.6 -678.1 -195.8 467.3 10424.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3707.235 15.566 -238.163 < 2e-16 ***
## carat 8879.929 13.384 663.469 < 2e-16 ***
## cut.L 709.492 22.659 31.311 < 2e-16 ***
## cut.Q -328.882 19.957 -16.480 < 2e-16 ***
## cut.C 170.384 17.325 9.835 < 2e-16 ***
## cut^4 6.604 13.861 0.476 0.63375
## color.L 1892.293 19.704 96.035 < 2e-16 ***
## color.Q -612.504 17.915 -34.189 < 2e-16 ***
## color.C 174.145 16.778 10.379 < 2e-16 ***
## color^4 31.726 15.426 2.057 0.03973 *
## color^5 84.454 14.584 5.791 7.05e-09 ***
## color^6 -41.302 13.224 -3.123 0.00179 **
## clarity.L 4176.710 34.628 120.616 < 2e-16 ***
## clarity.Q -1833.327 32.434 -56.525 < 2e-16 ***
## clarity.C 900.704 27.727 32.485 < 2e-16 ***
## clarity^4 -367.479 22.111 -16.620 < 2e-16 ***
## clarity^5 208.358 17.975 11.591 < 2e-16 ***
## clarity^6 6.884 15.614 0.441 0.65932
## clarity^7 104.862 13.768 7.616 2.66e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1151 on 43133 degrees of freedom
## Multiple R-squared: 0.9168, Adjusted R-squared: 0.9167
## F-statistic: 2.639e+04 on 18 and 43133 DF, p-value: < 2.2e-16
# Create predictions for the test data using the simple linear model
predict_lm3 <- predict(model_lm3, test,
na.action = na.omit) # exclude NAs
# calculate mean absolute error of model's predictions
mae_lm3 <- round(mae(test$price, predict_lm3), digits = 2)
mae_lm3
## [1] 809.05
# calculate root mean squared error of model's predictions
rmse_lm3 <- round(rmse(test$price, predict_lm3), digits = 2)
rmse_lm3
## [1] 1180
# calculate r squared
SSR_lm3 <-sse(test$price, predict_lm3) # residual sum of squares
SST_lm3 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_lm3 <- 1 - SSR_lm3/SST_lm3
# calculate adjusted r squared
RsqAdj_lm3 <- round(1 - (1-Rsq_lm3)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_lm3
## [1] 0.9126
This is looking better! What about just depth and table? I have a hunch these two variables are going to be the worst predictors.
# Set the seed for reproducibility
set.seed(123)
# Create an object called model_lm1 by using the lm fucntion
model_lm4 <- lm(price ~ depth + table, # The formula is price as a function of all other variables
train, # The data is an object called train
na.action = na.omit) # Exclude NAs
summary(model_lm4) # Stats summary of the model
##
## Call:
## lm(formula = price ~ depth + table, data = train, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7305 -2746 -1485 1351 15742
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14408.80 1124.59 -12.812 < 2e-16 ***
## depth 74.19 13.93 5.326 1.01e-07 ***
## table 239.38 8.92 26.835 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3956 on 43149 degrees of freedom
## Multiple R-squared: 0.01659, Adjusted R-squared: 0.01654
## F-statistic: 363.9 on 2 and 43149 DF, p-value: < 2.2e-16
# Create predictions for the test data using the simple linear model
predict_lm4 <- predict(model_lm4, test,
na.action = na.omit) # exclude NAs
# calculate mean absolute error of model's predictions
mae_lm4 <- round(mae(test$price, predict_lm4), digits = 2)
mae_lm4
## [1] 2984.07
# calculate root mean squared error of model's predictions
rmse_lm4 <- round(rmse(test$price, predict_lm4), digits = 2)
rmse_lm4
## [1] 3955.46
# calculate r squared
SSR_lm4 <-sse(test$price, predict_lm4) # residual sum of squares
SST_lm4 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_lm4 <- 1 - SSR_lm4/SST_lm4
# calculate adjusted r squared
RsqAdj_lm4 <- round(1 - (1-Rsq_lm4)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_lm4
## [1] 0.0182
Just as expected. This is not a good model at all! How about x, y, and z?
# Set the seed for reproducibility
set.seed(123)
# Create an object called model_lm1 by using the lm fucntion
model_lm5 <- lm(price ~ x + y + z, # The formula is price as a function of all other variables
train, # The data is an object called train
na.action = na.omit) # Exclude NAs
summary(model_lm5) # Stats summary of the model
##
## Call:
## lm(formula = price ~ x + y + z, data = train, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18680.9 -1276.4 -183.8 997.9 12058.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14282.28 46.16 -309.441 < 2e-16 ***
## x 1246.88 110.82 11.251 < 2e-16 ***
## y 1588.43 108.43 14.650 < 2e-16 ***
## z 553.47 96.83 5.716 1.1e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1826 on 43133 degrees of freedom
## (15 observations deleted due to missingness)
## Multiple R-squared: 0.7901, Adjusted R-squared: 0.7901
## F-statistic: 5.414e+04 on 3 and 43133 DF, p-value: < 2.2e-16
# Create predictions for the test data using the simple linear model
predict_lm5 <- predict(model_lm5, test,
na.action = na.omit) # exclude NAs
# calculate mean absolute error of model's predictions
mae_lm5 <- round(mae(test$price, predict_lm5), digits = 2)
mae_lm5
## [1] 1724.47
# calculate root mean squared error of model's predictions
rmse_lm5 <- round(rmse(test$price, predict_lm5), digits = 2)
rmse_lm5
## [1] 2634.51
# calculate r squared
SSR_lm5 <-sse(test$price, predict_lm5) # residual sum of squares
SST_lm5 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_lm5 <- 1 - SSR_lm5/SST_lm5
# calculate adjusted r squared
RsqAdj_lm5 <- round(1 - (1-Rsq_lm5)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_lm5
## [1] 0.5645
Not the best, but does it help or hurt if we add it to the 4 C’s?
# Set the seed for reproducibility
set.seed(123)
# Create an object called model_lm1 by using the lm fucntion
model_lm6 <- lm(price ~ carat + cut + color + clarity + x + y + z, # The formula is price as a function of all other variables
train, # The data is an object called train
na.action = na.omit) # Exclude NAs
summary(model_lm6) # Stats summary of the model
##
## Call:
## lm(formula = price ~ carat + cut + color + clarity + x + y +
## z, data = train, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21880.6 -581.5 -182.5 366.7 10844.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 926.4334 95.3963 9.711 < 2e-16 ***
## carat 11624.4260 57.3676 202.630 < 2e-16 ***
## cut.L 633.2127 23.1566 27.345 < 2e-16 ***
## cut.Q -259.1068 20.0620 -12.915 < 2e-16 ***
## cut.C 139.1130 17.1617 8.106 5.37e-16 ***
## cut^4 -0.7948 13.6257 -0.058 0.953487
## color.L 1944.6314 19.1945 101.312 < 2e-16 ***
## color.Q -671.9649 17.4511 -38.506 < 2e-16 ***
## color.C 170.3164 16.3088 10.443 < 2e-16 ***
## color^4 53.5914 14.9982 3.573 0.000353 ***
## color^5 97.9259 14.1777 6.907 5.02e-12 ***
## color^6 -39.3243 12.8542 -3.059 0.002220 **
## clarity.L 4029.5104 33.8768 118.946 < 2e-16 ***
## clarity.Q -1931.5841 31.6425 -61.044 < 2e-16 ***
## clarity.C 959.4578 27.0260 35.501 < 2e-16 ***
## clarity^4 -368.4433 21.5072 -17.131 < 2e-16 ***
## clarity^5 225.5876 17.4910 12.897 < 2e-16 ***
## clarity^6 10.1858 15.1763 0.671 0.502119
## clarity^7 84.5724 13.3871 6.317 2.69e-10 ***
## x -1204.6756 73.2082 -16.455 < 2e-16 ***
## y 717.0309 69.6271 10.298 < 2e-16 ***
## z -1145.2097 65.4820 -17.489 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1119 on 43115 degrees of freedom
## (15 observations deleted due to missingness)
## Multiple R-squared: 0.9213, Adjusted R-squared: 0.9213
## F-statistic: 2.403e+04 on 21 and 43115 DF, p-value: < 2.2e-16
# Create predictions for the test data using the simple linear model
predict_lm6 <- predict(model_lm6, test,
na.action = na.omit) # exclude NAs
# calculate mean absolute error of model's predictions
mae_lm6 <- round(mae(test$price, predict_lm6), digits = 2)
mae_lm6
## [1] 1111.51
# calculate root mean squared error of model's predictions
rmse_lm6 <- round(rmse(test$price, predict_lm6), digits = 2)
rmse_lm6
## [1] 2239.09
# calculate r squared
SSR_lm6 <-sse(test$price, predict_lm6) # residual sum of squares
SST_lm6 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_lm6 <- 1 - SSR_lm6/SST_lm6
# calculate adjusted r squared
RsqAdj_lm6 <- round(1 - (1-Rsq_lm6)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_lm6
## [1] 0.6854
There’s our answer. It hurts, doesn’t help. It looks like the 4 C’s are the best predictors. Now let’s try a ML model other than linear regression.
set.seed(123)
model_dt1 <- rpart(formula = price ~ cut + carat + color + clarity,
data = train,
method = "anova")
print(model_dt1)
## n= 43152
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 43152 686526200000 3926.589
## 2) carat< 0.995 27934 34904930000 1631.023
## 4) carat< 0.625 19871 5283100000 1050.602 *
## 5) carat>=0.625 8063 6429671000 3061.449 *
## 3) carat>=0.995 15218 234217800000 8140.306
## 6) carat< 1.475 10245 47971710000 6129.327
## 12) clarity=I1,SI2,SI1,VS2 7773 14496730000 5363.278 *
## 13) clarity=VS1,VVS2,VVS1,IF 2472 14570490000 8538.104 *
## 7) carat>=1.475 4973 59461550000 12283.170
## 14) carat< 1.915 3210 27894050000 10867.790 *
## 15) carat>=1.915 1763 13428350000 14860.240 *
rpart.plot(x=model_dt1,
yesno = 2, type = 0, extra = 0) # personal preference on how the plot is displayed
# Create predictions for the test data using the simple linear model
predict_dt1 <- predict(model_dt1, test)
# calculate mean absolute error of model's predictions
mae_dt1 <- round(mae(test$price, predict_dt1), digits = 2)
mae_dt1
## [1] 912.34
# calculate root mean squared error of model's predictions
rmse_dt1 <- round(rmse(test$price, predict_dt1), digits = 2)
rmse_dt1
## [1] 1433.26
# calculate r squared
SSR_dt1 <-sse(test$price, predict_dt1) # residual sum of squares
SST_dt1 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_dt1 <- 1 - SSR_dt1/SST_dt1
# calculate adjusted r squared
RsqAdj_dt1 <- round(1 - (1-Rsq_dt1)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_dt1
## [1] 0.8711
set.seed(123)
model_dt2 <- rpart(formula = price ~ .,
data = train,
method = "anova")
print(model_dt2)
## n= 43152
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 43152 686526200000 3926.589
## 2) carat< 0.995 27934 34904930000 1631.023
## 4) y< 5.535 19999 5440120000 1056.695 *
## 5) y>=5.535 7935 6242051000 3078.531 *
## 3) carat>=0.995 15218 234217800000 8140.306
## 6) y< 7.195 10296 47444790000 6123.230
## 12) clarity=I1,SI2,SI1,VS2 7858 15736360000 5389.146 *
## 13) clarity=VS1,VVS2,VVS1,IF 2438 13825520000 8489.281 *
## 7) y>=7.195 4922 57255330000 12359.690
## 14) y< 7.865 3204 27464010000 10959.290 *
## 15) y>=7.865 1718 11789580000 14971.380 *
rpart.plot(x=model_dt2,
yesno = 2, type = 0, extra = 0) # personal preference on how the plot is displayed
# Create predictions for the test data using the simple linear model
predict_dt2 <- predict(model_dt2, test)
# calculate mean absolute error of model's predictions
mae_dt2 <- round(mae(test$price, predict_dt2), digits = 2)
mae_dt2
## [1] 915.34
# calculate root mean squared error of model's predictions
rmse_dt2 <- round(rmse(test$price, predict_dt2), digits = 2)
rmse_dt2
## [1] 1438.64
# calculate r squared
SSR_dt2 <-sse(test$price, predict_dt2) # residual sum of squares
SST_dt2 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_dt2 <- 1 - SSR_dt2/SST_dt2
# calculate adjusted r squared
RsqAdj_dt2 <- round(1 - (1-Rsq_dt2)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_dt2
## [1] 0.8701
Changing the complexity parameter from 0.01 to 0.001.Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. For instance, with anova splitting, this means that the overall R-squared must increase by cp at each step.
set.seed(123)
model_dt3 <- rpart(formula = price ~ .,
data = train,
method = "anova",
control = rpart.control(cp=0.001))
rpart.plot(x=model_dt3,
yesno = 2, type = 0, extra = 0) # personal preference on how the plot is displayed
# Create predictions for the test data using the simple linear model
predict_dt3 <- predict(model_dt3, test)
# calculate mean absolute error of model's predictions
mae_dt3 <- round(mae(test$price, predict_dt3), digits = 2)
mae_dt3
## [1] 589.92
# calculate root mean squared error of model's predictions
rmse_dt3 <- round(rmse(test$price, predict_dt3), digits = 2)
rmse_dt3
## [1] 986.47
# calculate r squared
SSR_dt3 <-sse(test$price, predict_dt3) # residual sum of squares
SST_dt3 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_dt3 <- 1 - SSR_dt3/SST_dt3
# calculate adjusted r squared
RsqAdj_dt3 <- round(1 - (1-Rsq_dt3)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_dt3
## [1] 0.9389
Now let’s try an even lower complexity parameter that makes a decision tree too complex to plot.
set.seed(123)
model_dt4 <- rpart(formula = price ~ .,
data = train,
method = "anova",
control = rpart.control(cp=0.000001))
# Create predictions for the test data using the simple linear model
predict_dt4 <- predict(model_dt4, test)
# calculate mean absolute error of model's predictions
mae_dt4 <- round(mae(test$price, predict_dt4), digits = 2)
mae_dt4
## [1] 326.34
# calculate root mean squared error of model's predictions
rmse_dt4 <- round(rmse(test$price, predict_dt4), digits = 2)
rmse_dt4
## [1] 633.96
# calculate r squared
SSR_dt4 <-sse(test$price, predict_dt4) # residual sum of squares
SST_dt4 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_dt4 <- 1 - SSR_dt4/SST_dt4
# calculate adjusted r squared
RsqAdj_dt4 <- round(1 - (1-Rsq_dt4)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_dt4
## [1] 0.9748
Let’s create a random forest model with the 4 C’s.
set.seed(123)
model_rf1 <- randomForest(price ~ carat + cut + color + clarity,
train)
model_rf1
##
## Call:
## randomForest(formula = price ~ carat + cut + color + clarity, data = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 1982524
## % Var explained: 87.54
plot(model_rf1)
It looks like there 100 trees (iterations) are enough for this model.
# Create predictions for the test data using the simple linear model
predict_rf1 <- predict(model_rf1, test)
# calculate mean absolute error of model's predictions
mae_rf1 <- round(mae(test$price, predict_rf1), digits = 2)
mae_rf1
## [1] 934.28
# calculate root mean squared error of model's predictions
rmse_rf1 <- round(rmse(test$price, predict_rf1), digits = 2)
rmse_rf1
## [1] 1403.13
# calculate r squared
SSR_rf1 <-sse(test$price, predict_rf1) # residual sum of squares
SST_rf1 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_rf1 <- 1 - SSR_rf1/SST_rf1
# calculate adjusted r squared
RsqAdj_rf1 <- round(1 - (1-Rsq_rf1)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_rf1
## [1] 0.8765
It’s good, but not as good as the linear regression. What if we use all the data?
set.seed(123)
model_rf2 <- randomForest(price ~ .,
train,
na.action = na.omit, # remove NAs
ntree=300) # only create 300 trees / iterations
model_rf2
##
## Call:
## randomForest(formula = price ~ ., data = train, ntree = 300, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 300
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 294825.6
## % Var explained: 98.14
plot(model_rf2)
It looks like there 100 trees (iterations) are enough for this model too.
# Create predictions for the test data using the simple linear model
predict_rf2 <- predict(model_rf2, test)
# Create new tibble for test that includes predictions
# and filters out NAs
test_rf2 <- test %>%
mutate(pred = predict_rf2) %>%
filter(is.na(pred)==F)
# calculate mean absolute error of model's predictions
mae_rf2 <- round(mae(test_rf2$price, test_rf2$pred), digits = 2)
mae_rf2
## [1] 275.41
# calculate root mean squared error of model's predictions
rmse_rf2 <- round(rmse(test_rf2$price, test_rf2$pred), digits = 2)
rmse_rf2
## [1] 564.83
# calculate r squared
SSR_rf2 <-sse(test_rf2$price, test_rf2$pred) # residual sum of squares
SST_rf2 <-sse(test_rf2$price, mean(test_rf2$price)) # total sum of squares
Rsq_rf2 <- 1 - SSR_rf2/SST_rf2
# calculate adjusted r squared
RsqAdj_rf2 <- round(1 - (1-Rsq_rf2)*((nrow(test_rf2)-1)/(nrow(test_rf2)-1-1)), digits = 4)
RsqAdj_rf2
## [1] 0.98
This model out-performs the best linear model (Price vs The 4 C’s). The RMSE is 1180 for the linear model, and 564.83 for this random forest model. The adjusted R square is 0.9126361 for the linear model and 0.9799764 for the random forest model. However, the down side to the random forest model is that it can only use complete data. If there are NAs for x, y, and z, as we saw in the raw data, then this random forest model cannot predict price, and the linear model would have to be used.
set.seed(123)
model_gbm1 <- gbm(formula = price ~ carat + cut + color + clarity,
data = train,
distribution = "gaussian",
n.trees = 1000)
Below shows the relative importance of each predictive variable.
summary(model_gbm1)
## var rel.inf
## carat carat 95.2463663
## clarity clarity 3.0094951
## color color 1.5262655
## cut cut 0.2178731
It looks like carat has a lot more predictive power than clarity, color, and cut.
print(model_gbm1)
## gbm(formula = price ~ carat + cut + color + clarity, distribution = "gaussian",
## data = train, n.trees = 1000)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## There were 4 predictors of which 4 had non-zero influence.
Although carat is the best predictor, all four features had a non-zero influence.
# Create predictions for the test data using the simple linear model
predict_gbm1 <- predict(model_gbm1, test)
# calculate mean absolute error of model's predictions
mae_gbm1 <- round(mae(test$price, predict_gbm1), digits = 2)
mae_gbm1
## [1] 694.97
# calculate root mean squared error of model's predictions
rmse_gbm1 <- round(rmse(test$price, predict_gbm1), digits = 2)
rmse_gbm1
## [1] 1065.85
# calculate r squared
SSR_gbm1 <-sse(test$price, predict_gbm1) # residual sum of squares
SST_gbm1 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_gbm1 <- 1 - SSR_gbm1/SST_gbm1
# calculate adjusted r squared
RsqAdj_gbm1 <- round(1 - (1-Rsq_gbm1)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_gbm1
## [1] 0.9287
This isn’t too bad, but not as good as the Random Forest model.
set.seed(123)
model_gbm2 <- gbm(formula = price ~ .,
data = train,
distribution = "gaussian",
n.trees = 1000)
summary(model_gbm2)
## var rel.inf
## y y 67.63546385
## carat carat 15.46431048
## z z 8.59241920
## x x 4.03740092
## clarity clarity 2.66925283
## color color 1.46180531
## depth depth 0.10285109
## cut cut 0.02739915
## table table 0.00909717
y is the best predictor followed by carat, z, and then x. It looks like all the numeric values are the best predictors.
print(model_gbm2)
## gbm(formula = price ~ ., distribution = "gaussian", data = train,
## n.trees = 1000)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## There were 9 predictors of which 9 had non-zero influence.
All features have a non-zero influence.
# Create predictions for the test data using the simple linear model
predict_gbm2 <- predict(model_gbm2, test)
# calculate mean absolute error of model's predictions
mae_gbm2 <- round(mae(test$price, predict_gbm2), digits = 2)
mae_gbm2
## [1] 679.36
# calculate root mean squared error of model's predictions
rmse_gbm2 <- round(rmse(test$price, predict_gbm2), digits = 2)
rmse_gbm2
## [1] 1062.52
# calculate r squared
SSR_gbm2 <-sse(test$price, predict_gbm2) # residual sum of squares
SST_gbm2 <-sse(test$price, mean(test$price)) # total sum of squares
Rsq_gbm2 <- 1 - SSR_gbm2/SST_gbm2
# calculate adjusted r squared
RsqAdj_gbm2 <- round(1 - (1-Rsq_gbm2)*((nrow(test)-1)/(nrow(test)-1-1)), digits = 4)
RsqAdj_gbm2
## [1] 0.9292
This model is only slightly better than the Price vs 4 C’s model.
The following parameters can be tuned to increase the performance of a GBM model:
1. n.trees: number of trees
2. bag.fraction: proportion of observations to be sampled in each tree
3. n.minobsinnode: minimum number of observations in the trees terminal nodes
4. interaction.depth: maximum nodes per tree
5. shrinkage: learning rate
The script below determines the optimal ntree based on Out-of-Bag (OOB) error.
ntree_opt_oob <- gbm.perf(model_gbm2, method = "OOB", oobag.curve = T)
ntree_opt_oob
## [1] 335
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4,
## length(x)/10), 50))
##
## Number of Observations: 1000
## Equivalent Number of Parameters: 40
## Residual Standard Error: 31340
The script below determines the optimal ntree based on the Cross Validation (CV) error with 5 cross validation folds.
set.seed(123)
model_gbm2_cv <- gbm(formula = price ~ .,
data = train,
distribution = "gaussian",
n.trees = 1000,
cv.folds = 5)
ntree_opt_cv <- gbm.perf(model_gbm2_cv, method = "cv")
ntree_opt_cv
## [1] 949
Now time to compare the predictions between the two options for ntree.
pred_oob <- predict(model_gbm2, test, n.trees = ntree_opt_oob)
rmse(test$price, pred_oob)
## [1] 1072.422
pred_cv <- predict(model_gbm2, test, n.trees = ntree_opt_cv)
rmse(test$price, pred_cv)
## [1] 1063.767
Based on the cross validation error, the optimal number of trees when predicting should be 949. However, this RMSE is less than the RMSE of the original model. Therfore, this ntree should be left as is.
The script below combines combines all the metrics into one tibble.
model_compare <- tibble(
ModelName = c(
"model_dt1",
"model_dt2",
"model_dt3",
"model_dt4",
"model_gbm1",
"model_gbm2",
"model_lm1",
"model_lm2",
"model_lm3",
"model_lm4",
"model_lm5",
"model_lm6",
"model_rf1",
"model_rf2"
),
ModelType = c(
"Decision Tree",
"Decision Tree",
"Decision Tree",
"Decision Tree",
"Gradient Boosting Machine",
"Gradient Boosting Machine",
"Linear Regression",
"Linear Regression",
"Linear Regression",
"Linear Regression",
"Linear Regression",
"Linear Regression",
"Random Forest",
"Random Forest"
),
RMSE = c(
rmse_dt1,
rmse_dt2,
rmse_dt3,
rmse_dt4,
rmse_gbm1,
rmse_gbm2,
rmse_lm1,
rmse_lm2,
rmse_lm3,
rmse_lm4,
rmse_lm5,
rmse_lm6,
rmse_rf1,
rmse_rf2
),
MAE = c(
mae_dt1,
mae_dt2,
mae_dt3,
mae_dt4,
mae_gbm1,
mae_gbm2,
mae_lm1,
mae_lm2,
mae_lm3,
mae_lm4,
mae_lm5,
mae_lm6,
mae_rf1,
mae_rf2
),
AdjRSquare = c(
RsqAdj_dt1,
RsqAdj_dt2,
RsqAdj_dt3,
RsqAdj_dt4,
RsqAdj_gbm1,
RsqAdj_gbm2,
RsqAdj_lm1,
RsqAdj_lm2,
RsqAdj_lm3,
RsqAdj_lm4,
RsqAdj_lm5,
RsqAdj_lm6,
RsqAdj_rf1,
RsqAdj_rf2
)
) %>%
arrange(RMSE, MAE, AdjRSquare) %>%
mutate(Rank = c(1:nrow(.))) %>%
select(Rank, everything())
head(model_compare, n = nrow(model_compare))
## # A tibble: 14 x 6
## Rank ModelName ModelType RMSE MAE AdjRSquare
## <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 model_rf2 Random Forest 565. 275. 0.98
## 2 2 model_dt4 Decision Tree 634. 326. 0.975
## 3 3 model_dt3 Decision Tree 986. 590. 0.939
## 4 4 model_gbm2 Gradient Boosting Machine 1063. 679. 0.929
## 5 5 model_gbm1 Gradient Boosting Machine 1066. 695. 0.929
## 6 6 model_lm3 Linear Regression 1180 809. 0.913
## 7 7 model_rf1 Random Forest 1403. 934. 0.876
## 8 8 model_dt1 Decision Tree 1433. 912. 0.871
## 9 9 model_dt2 Decision Tree 1439. 915. 0.870
## 10 10 model_lm1 Linear Regression 1591. 1033. 0.841
## 11 11 model_lm2 Linear Regression 2239. 1109. 0.685
## 12 12 model_lm6 Linear Regression 2239. 1112. 0.685
## 13 13 model_lm5 Linear Regression 2635. 1724. 0.564
## 14 14 model_lm4 Linear Regression 3955. 2984. 0.0182
The best model is model_rf2, a Random Forest model using the formula price vs carat + cut + color + clarity + depth + table + x + y + z. It has an RMSE of 564.83, a MAE of 275.41, and an adjusted R square of 0.98.
Throughout this analysis some assumptions and data cleaning steps have been made. These decisions will impact how each predictive model will perform. A sensitivity analysis may have to be completed for the following data cleaning issues:
1. Issue: x, y, and z values labeled as zeros.
Action took in this analysis: Make all zeros NAs and ignore NAs in model exploration.
2. Issue: Potential outliers for y and z.
Action took in this analysis: Estimate y and z using the depth equation
The best model is model_rf2, a Random Forest model using the formula price vs carat + cut + color + clarity + depth + table + x + y + z. It has an RMSE of 564.83, a MAE of 275.41, and an adjusted R square of 0.98.