# Predicting large and imbalanced data set using the R package tidymodels

**Modeling with R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Introduction

The super easy way, at least for me, to deploy machine learning models is by making use of the R package **tidymodels**, which is a collection of many packages that makes the workflow steps for your project very smooth and tightly connected to each other and easily managable in a well-structured manner.
The core packages contained in tidymodels are:

- rsample: for data splitting and rsampling.
- parsnip: Unified interface to the most common machine learning models.
- recipes: unified interface to the most common pre-processing tools for feature engineering.
- workflows: bundle the workflow steps together.
- tune: for optimization of the hyperparameters.
- yardstick: provides the most common performance metrics.
- broom: converts the outputs into user friendly formats such as tibble.
- dials: provides tools for parameter grids.
- infer: provides tools for statistical inferences.

In addition to the above apackages tidymodels contains also some classical packages such as: dplyr, ggplot2, purrr, tibble. For more detail click here.

In order to widely explore and understand the tidymodels, we should look for a noisy dataset that has large number of variables with missing values. Fortunately, I found an open source dataset that fulfils these requirements and in addition, it is highly imbalanced. This data is about **scania trucks** and can be downloaded from UCI machine learning repository with an extra file for its description.

the target variable of this data is the air pressure system **APS** in the truck that generates the pressurized air that are utilized in various function in the truck. It has two classes: positive **pos** if a component failures due to a failures in the APS system, negative **neg** if a component failures are not related to the APS system. This means that we are dealing with binary classification problem.

## Data exploration

The data is already separated into training and testing set from the source, so let’s call the packages that we need and the data.

ssh <- suppressPackageStartupMessages ssh(library(readr)) ssh(library(caret)) ssh(library(themis)) ssh(library(tidymodels)) train <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_training_set.csv", skip = 20) test <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_test_set.csv", skip = 20)

Notice that the data is a tibble where the first twenty rows are a mix of rows that contain some text descritpion and empty rows. and the 21th row contains the column names. That is why we have set the **skip** argument equals to **20**, and for the 21th column by default has been read as colnames `col_names = TRUE`

.

### Summary of the variables

First let’s check the dimension of the two sets to be aware of what we are dealing with.

dim(train) [1] 60000 171 dim(test) [1] 16000 171

The training set has 60000 rows and 171 variables which is moderately large dataset. Inspecting thus this data by the usual functions such as **summary**, **str** would give heavy and not easily readable outputs. the best alternative However is by extracting the most important information that is required for building any machine learning model in aggregated way, for instance, what type of variables it has, some statistics about the variable values, the missing values..etc.

map_chr(train, typeof) %>% tibble() %>% table() . character double 170 1

Strangely, all the variables but one are characters, which is in contradiction with the description of this data from the file description. To figure out what is going on we display some few rows and some few columns.

train[1:5,1:7] # A tibble: 5 x 7 class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 <chr> <dbl> <chr> <chr> <chr> <chr> <chr> 1 neg 76698 na 2130706438 280 0 0 2 neg 33058 na 0 na 0 0 3 neg 41040 na 228 100 0 0 4 neg 12 0 70 66 0 10 5 neg 60874 na 1368 458 0 0

I think, the problem is that the missing values in the data indicated by **na** are not recognized as missing values, instead they are treated as characters and this what makes the function **read_csv** coerces every variable that has this **na** values to character type. To fix this problem we can either go back and set the **na** argument to **“na”**, or we set the missing values by hand as follows.

train[-1] <- train[-1] %>% modify(~replace(., .=="na", NA)) %>% modify(., as.double)

Now let’s check again

map_chr(train, typeof) %>% tibble() %>% table() . character double 1 170

The first column excluded above is our target variable **class**. We should not forget to do the same transformation to the test set.

test[-1] <- test[-1] %>% modify(~replace(., .=="na", NA)) %>% modify(., as.double)

If we try to apply the **summary** function on the entire variables (170), we will spent a lot of time to read the summary of each variable without much gain. Instead, we try to get an automated way to get only the information needed to build efficiently our model. To decide whether we should normalize the data or not, for instance, we display the standard deviances of all the variable in decreasing order.

**Note**: with tree based models we do not need neither normalize the data nor converting factors to dummies.

map_dbl(train[-1], sd, na.rm=TRUE) %>% tibble(sd = .) %>% arrange(-sd) # A tibble: 170 x 1 sd <dbl> 1 794874918. 2 97484780. 3 42746746. 4 40404413. 5 40404412. 6 40404411. 7 11567771. 8 10886737. 9 10859905. 10 10859904. # ... with 160 more rows

We have very large variability, which means that the data should be normalized for any machine learning model that uses gradient descent or based on class distances.

Another thing we can check is if some variabels have small number of unique values which can hence be converted to factor type.

map(train[-1], unique) %>% lengths(.) %>% sort(.) %>% head(5) cd_000 ch_000 as_000 ef_000 ab_000 2 3 22 29 30

To make things simple we consider only the first two ones to be converted to factor type.

the first one is constant which is of type zero variance because its variance equals to zero, and the second one should be converted to factor type with two levels (for the two sets), but since it has large missing values we will decide about it later on. Notice that we do not apply theses transformations here because they will be combined at once with all the required transformations as what will be shown shortly.

### Missing values

The best way to deal with missing values depends on their number compared to the dataset size. if we have small number then it would be easier to completely remove them from the data, if in contrast we have large number then the best choice is to impute them using one of the common methods designed for this type of issue.

dim(train[!complete.cases(train),]) [1] 59409 171

As we see almost every row contains at least one missing value in some columns. Let’s check the distribution of missing values within columns.

df <- modify(train[-1], is.na) %>% colSums() %>% tibble(names = colnames(train[-1]),missing_values=.) %>% arrange(-missing_values) df # A tibble: 170 x 2 names missing_values <chr> <dbl> 1 br_000 49264 2 bq_000 48722 3 bp_000 47740 4 bo_000 46333 5 ab_000 46329 6 cr_000 46329 7 bn_000 44009 8 bm_000 39549 9 bl_000 27277 10 bk_000 23034 # ... with 160 more rows

I think the best strategy is to first remove columns that have a large number of missing values then we impute the rest, thereby we reduce the number of predictors and the number of missing values at ones. The following script keep the predictors that have a number of missing values less than **10000**.

names <- modify(train[-1], is.na) %>% colSums() %>% tibble(names = colnames(train[-1]), missing_values=.) %>% filter(missing_values < 10000) %>% select(1) train1 <- train[c("class",names$names)] test1 <- test[c("class",names$names)]

An important thing should be noted here is that, if we use imputation methods that use information from all other columns and/or rows to predict the current missing value, therefore the data must be first split between training and testing sets before any imputation, to abide by the crucial rule of machine learning: the test data should never be seen by the model during training process.
Fortunately, our data is already split so that the imputation can be done separately. However, the imputation methods will be implemented later on by the help of the **recipes** package where we bundle all the pre-processing steps together.
**Note**: the above **ch_000** was removed since it did not fulfill the required threshold.

### imbalanced data

Another important issue that we face when predicting this data is the **imbalanced** problem.

prop.table(table(train1$class)) neg pos 0.98333333 0.01666667

This data is highly imbalanced, which tends to make even the worst machine learning model gives very high accuracy rate. In other words, if we do not use any model and predict every class as the largest class label (in our case negative) the accuracy rate will be approximately equal to the proportion of the largest class (in our case 98%), which is very big misleading result. Moreover, this misleading result can be catastrophic if we are more interested to predict the small class (in our case positive) such as detecting fraudulent credit cards. If you would like to get more detail about how to deal with imbalanced data please check this article.

## building the recipe

Our initial model will be the **random forest** wich is the most popular one . So the first step to build our model is by defining our model with the **engine**, which is the method (or the package) used to fit this model, and the **mode** with two possible values **classification** or **regression**. In our case, for instance, there exists two available engines: **randomForest** or **ranger**. Notice that the **parsnip** package who provides these settings. For more detail about all the models available click here.

**Note**: To speed up the computation process we restrict the forest to **100** trees instead of the default **500**.

rf <- rand_forest(trees = 100) %>% set_engine("ranger", num.threads=3, seed = 123) %>% set_mode("classification")

Most machine learning models require pre-processed data with some feature engineering. Traditionally, R has (and some other packages such as dplyr and stringr) provides a wide range of functions such that we can do almost every kind of feature engineering. However, if we have many different transformations to perform then they will be done separately and it will be a little cumbersome to repeat the same scripts again for testing set for instance. Therefore, the **recipes** package provides an easy way to combine all the transformations and other features related to the model, such as selecting the predictors that should be included, identifiers, …etc, as a single block that can be used for any other subset of the data.

For our case we will apply the following transformations:

- Imputing the missing values by the median of the corresponding variable since we have only numeric variables (for simplicity).
- removing variables that have zero variance (variable that has one unique value).
- removing highly correlated predictor using threshold
**0.8**. - Normalizing the data (even we do not need it in this model but we add this step since this recipe will be used with other models that use gradient decent or distances calculations).
- using the subsampling method
**smote**to create a balanced data. Notice that the**smote**method is provided by the package**themis**

To combine all these operations together we call the function **recipe**.

data_rec <- recipe(class~., data=train1) %>% step_medianimpute(all_predictors() , seed_val = 111) %>% step_zv(all_predictors()) %>% step_corr(all_predictors(), threshold = 0.8) %>% step_normalize(all_predictors()) %>% step_smote(class)

As you see everything combined nicely and elegantly. However, this recipe transformed nothing yet, it just recorded the formula, the predictors and the transformations that should be applied. This means that we can update, at any time before fitting our model, the formula, add or remove some steps. The super interesting feature of recipe is that we can apply it to any other data (than that mentioned above, train) provided that has the same variable names. In case you want to apply these transformations to the training data use the **prep** function, and to retrieve the results use the function **juice**, and for other data use **bake** after **prep** to be able to apply some parameters from the training data, for instance, when we normalize the data this function lets us use the mean of predictors computed from the training data rather than from the testing data. However, in our case, we will combine everything until the model fitting step.

For more detail about all the steps available click here.

## Building the workflow

To well organize our workflow in a structured and smoother way, we use the **workflow** package that is one of the tidymodels collection.

rf_wf <- workflow() %>% add_model(rf) %>% add_recipe(data_rec) rf_wf == Workflow ====================================== Preprocessor: Recipe Model: rand_forest() -- Preprocessor ---------------------------------- 5 Recipe Steps * step_medianimpute() * step_zv() * step_corr() * step_normalize() * step_smote() -- Model ----------------------------------------- Random Forest Model Specification (classification) Main Arguments: trees = 100 Engine-Specific Arguments: num.threads = 3 seed = 123 Computational engine: ranger

## random forest model

Now we can run everything at once, the recipe and the model, notice that here we can also update, add or remove some elements before going ahead and fit the model.

### model training

Everything now is ready to run our model with the default values.

model_rf <- rf_wf %>% fit(data = train1)

We can extract the summary of this model as follows

model_rf %>% pull_workflow_fit() parsnip model object Fit time: 49.9s Ranger result Call: ranger::ranger(formula = formula, data = data, num.trees = ~100, num.threads = ~3, seed = ~123, verbose = FALSE, probability = TRUE) Type: Probability estimation Number of trees: 100 Sample size: 118000 Number of independent variables: 95 Mtry: 9 Target node size: 10 Variable importance mode: none Splitrule: gini OOB prediction error (Brier s.): 0.003974456

This model has created 100 trees and has chosen randomly 9 predictors with each tree. with these settings thus we do obtain very low oob error rate which is 0.4% (accuracy rate 99.6% ). However, be cautious with such high accuracy rate, since, in practice, This result may highly related to an overfitting problem. Last thing I want to mention about this output, by looking at the confusion matrix, is the fact that we have now balanced data.

### model evaluation

The best way to evaluate our model is by using the testing set. Notice that the **yardstick** provides bunch of metrics to use, but let’s use the most popular one for classification problems **accuracy**.

model_rf %>% predict( new_data = test1) %>% bind_cols(test1["class"]) %>% accuracy(truth= as.factor(class), .pred_class) # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy binary 0.990

with this model we get high accuracy which is very closer to the previous one. However, we should not forget that we are dealing with imbalanced data, and even though we have used subsampling methods (like smote method used here), they do not completely solve this issue, they can only minimize it at certain level and this is the reason why we have many of these methods. Therefore, it is better to use the confusion matrix from the **caret** package since it gives more information.

caret::confusionMatrix(as.factor(test1$class), predict(model_rf, new_data = test1)$.pred_class) Confusion Matrix and Statistics Reference Prediction neg pos neg 15535 90 pos 62 313 Accuracy : 0.9905 95% CI : (0.9889, 0.9919) No Information Rate : 0.9748 P-Value [Acc > NIR] : < 2e-16 Kappa : 0.7998 Mcnemar's Test P-Value : 0.02853 Sensitivity : 0.9960 Specificity : 0.7767 Pos Pred Value : 0.9942 Neg Pred Value : 0.8347 Prevalence : 0.9748 Detection Rate : 0.9709 Detection Prevalence : 0.9766 Balanced Accuracy : 0.8863 'Positive' Class : neg

As said shortly, the specificity rate related to the minor class **77%** is very low compared to the major class **99%**, and You can think of this as a partial overfitting towards the major class. So if we are more interested to the minor class (which is often the case) then we have go back to our model and try tuning our model or try another subsampling method.

### Model tuning:

For model tuning we try other values for some arguments rather than the default vaues. and leave the tuning for some others to the **dials** package. So let’s try the following argument values:

- num.trees = 100. The default is 500.
- num.threads = 3. The default is 1.

And tune the following:

- mtry = tune(). The default is square root of the number of the variables.
- min_n = tune(). The default is 1.

First, we define the model with these new arguments.

model_tune <- rand_forest(trees= 100, mtry=tune(), min_n = tune()) %>% set_engine("ranger", num.threads=3, seed=123) %>% set_mode("classification")

Since in grid search the two arguments mtry and min_n are data dependent, then we should at least specify their ranges.

grid <- grid_regular(mtry(range = c(9,15)), min_n(range = c(5,40)), levels = 3) grid # A tibble: 9 x 2 mtry min_n <int> <int> 1 9 5 2 12 5 3 15 5 4 9 22 5 12 22 6 15 22 7 9 40 8 12 40 9 15 40

By setting the levels equal to 3 we get 9 combinations and hence 9 models will be trained. The above recipe has steps that should not be repeated many times when tuning the model, we apply therefore the recipe to the training data in order to get the transformed data, and do not forget to apply the recipe to the testing data.

train2 <- prep(data_rec) %>% juice() test2 <- prep(data_rec) %>% bake(test1)

To tune our model we use cross validation technique. since we have large data set we use only 3 folds.

set.seed(111) fold <- vfold_cv(train2, v = 3, strata = class)

Now we bundle our recipe with the specified model.

tune_wf <- workflow() %>% add_model(model_tune) %>% add_formula(class~.)

To fit these models across the folds we use the **tune_grid** function instead of **fit**.

tune_rf <- tune_wf %>% tune_grid(resamples = fold, grid = grid)

For classification problems this function uses two metrics: accuracy and area under the ROC curve. SO we can extract the metric values as follows.

tune_rf %>% collect_metrics() # A tibble: 18 x 7 mtry min_n .metric .estimator mean n std_err <int> <int> <chr> <chr> <dbl> <int> <dbl> 1 9 5 accuracy binary 0.995 3 0.000150 2 9 5 roc_auc binary 1.00 3 0.0000105 3 9 22 accuracy binary 0.994 3 0.000170 4 9 22 roc_auc binary 1.00 3 0.0000195 5 9 40 accuracy binary 0.993 3 0.000461 6 9 40 roc_auc binary 1.00 3 0.00000961 7 12 5 accuracy binary 0.995 3 0.000290 8 12 5 roc_auc binary 1.00 3 0.0000159 9 12 22 accuracy binary 0.994 3 0.000119 10 12 22 roc_auc binary 1.00 3 0.0000257 11 12 40 accuracy binary 0.993 3 0.000239 12 12 40 roc_auc binary 1.00 3 0.00000559 13 15 5 accuracy binary 0.995 3 0.000231 14 15 5 roc_auc binary 1.00 3 0.0000162 15 15 22 accuracy binary 0.994 3 0.000180 16 15 22 roc_auc binary 1.00 3 0.0000132 17 15 40 accuracy binary 0.993 3 0.000274 18 15 40 roc_auc binary 1.00 3 0.0000184

To get the best model we have to choose one of the two metrics, so let’s go ahead with the accuracy rate.

best_param <- tune_rf %>% select_best(metric = "accuracy") best_param # A tibble: 1 x 2 mtry min_n <int> <int> 1 12 5

we can finalize the workflow with the new parameter values.

tune_wf2 <- tune_wf %>% finalize_workflow(best_param) tune_wf2 == Workflow ====================================== Preprocessor: Formula Model: rand_forest() -- Preprocessor ---------------------------------- class ~ . -- Model ----------------------------------------- Random Forest Model Specification (classification) Main Arguments: mtry = 12 trees = 100 min_n = 5 Engine-Specific Arguments: num.threads = 3 seed = 123 Computational engine: ranger

Now we fit the model with the best parameter values to the entire training data.

best_model <- tune_wf2 %>% fit(train2) best_model == Workflow [trained] ============================ Preprocessor: Formula Model: rand_forest() -- Preprocessor ---------------------------------- class ~ . -- Model ----------------------------------------- Ranger result Call: ranger::ranger(formula = formula, data = data, mtry = ~12L, num.trees = ~100, min.node.size = ~5L, num.threads = ~3, seed = ~123, verbose = FALSE, probability = TRUE) Type: Probability estimation Number of trees: 100 Sample size: 118000 Number of independent variables: 95 Mtry: 12 Target node size: 5 Variable importance mode: none Splitrule: gini OOB prediction error (Brier s.): 0.003681062

Let’s get the confusion matrix

caret::confusionMatrix(as.factor(test2$class), predict(best_model, new_data = test2)$.pred_class) Confusion Matrix and Statistics Reference Prediction neg pos neg 15535 90 pos 65 310 Accuracy : 0.9903 95% CI : (0.9887, 0.9918) No Information Rate : 0.975 P-Value [Acc > NIR] : < 2e-16 Kappa : 0.795 Mcnemar's Test P-Value : 0.05389 Sensitivity : 0.9958 Specificity : 0.7750 Pos Pred Value : 0.9942 Neg Pred Value : 0.8267 Prevalence : 0.9750 Detection Rate : 0.9709 Detection Prevalence : 0.9766 Balanced Accuracy : 0.8854 'Positive' Class : neg

As we see we do not get any improvement for the specificity rate. so let’s try another subsampling method, say **Rose** method.

rf_rose <- rand_forest(trees = 100, mtry=15, min_n = 5) %>% set_engine("ranger", num.threads=3, seed = 123) %>% set_mode("classification") data_rec2 <- recipe(class~., data=train1) %>% step_medianimpute(all_predictors() , seed_val = 111) %>% step_zv(all_predictors()) %>% step_corr(all_predictors(), threshold = 0.8) %>% step_normalize(all_predictors()) %>% step_rose(class) rf_rose_wf <- workflow() %>% add_model(rf_rose) %>% add_recipe(data_rec2) model_rose_rf <- rf_rose_wf %>% fit(data = train1) caret::confusionMatrix(as.factor(test1$class), predict(model_rose_rf, new_data = test1)$.pred_class) Confusion Matrix and Statistics Reference Prediction neg pos neg 15328 297 pos 55 320 Accuracy : 0.978 95% CI : (0.9756, 0.9802) No Information Rate : 0.9614 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.6345 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.9964 Specificity : 0.5186 Pos Pred Value : 0.9810 Neg Pred Value : 0.8533 Prevalence : 0.9614 Detection Rate : 0.9580 Detection Prevalence : 0.9766 Balanced Accuracy : 0.7575 'Positive' Class : neg

The rose method is much worse than smote method since the specificity rate has doped down to 52%.

## logistic regression model

The logistic regression is another model to fit data with binary outcome. As before we use the first recipe with smote method.

logit <- logistic_reg() %>% set_engine("glm") %>% set_mode("classification") logit_wf <- workflow() %>% add_model(logit) %>% add_recipe(data_rec) set.seed(123) model_logit <- logit_wf %>% fit(data = train1) caret::confusionMatrix(as.factor(test1$class), predict(model_logit, new_data = test1)$.pred_class) Confusion Matrix and Statistics Reference Prediction neg pos neg 14813 812 pos 27 348 Accuracy : 0.9476 95% CI : (0.944, 0.951) No Information Rate : 0.9275 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.4333 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.9982 Specificity : 0.3000 Pos Pred Value : 0.9480 Neg Pred Value : 0.9280 Prevalence : 0.9275 Detection Rate : 0.9258 Detection Prevalence : 0.9766 Balanced Accuracy : 0.6491 'Positive' Class : neg

with this model we do not get better rate for minority class than random forest model.

## Session information

sessionInfo() R version 3.6.3 (2020-02-29) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18362) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] yardstick_0.0.6 workflows_0.1.1 tune_0.1.0 tibble_2.1.3 [5] rsample_0.0.6 purrr_0.3.3 parsnip_0.1.0 infer_0.5.1 [9] dials_0.0.6 scales_1.1.0 broom_0.5.5 tidymodels_0.1.0 [13] themis_0.1.1 recipes_0.1.9 dplyr_0.8.3 caret_6.0-85 [17] ggplot2_3.3.0 lattice_0.20-40 readr_1.3.1 loaded via a namespace (and not attached): [1] mlr_2.17.1 backports_1.1.5 fastmatch_1.1-0 [4] tidytext_0.2.3 plyr_1.8.6 igraph_1.2.4.2 [7] splines_3.6.3 crosstalk_1.0.0 listenv_0.8.0 [10] SnowballC_0.6.0 rstantools_2.0.0 inline_0.3.15 [13] digest_0.6.23 foreach_1.4.8 htmltools_0.4.0 [16] rsconnect_0.8.16 fansi_0.4.1 magrittr_1.5 [19] checkmate_2.0.0 BBmisc_1.11 unbalanced_2.0 [22] doParallel_1.0.15 globals_0.12.5 gower_0.2.1 [25] matrixStats_0.55.0 xts_0.12-0 hardhat_0.1.2 [28] prettyunits_1.1.1 colorspace_1.4-1 xfun_0.12 [31] callr_3.4.2 crayon_1.3.4 lme4_1.1-21 [34] survival_3.1-8 zoo_1.8-7 iterators_1.0.12 [37] glue_1.3.1 gtable_0.3.0 ipred_0.9-9 [40] pkgbuild_1.0.6 rstan_2.19.3 miniUI_0.1.1.1 [43] Rcpp_1.0.4 xtable_1.8-4 GPfit_1.0-8 [46] stats4_3.6.3 lava_1.6.7 StanHeaders_2.21.0-1 [49] prodlim_2019.11.13 DT_0.12 htmlwidgets_1.5.1 [52] threejs_0.3.3 FNN_1.1.3 ellipsis_0.3.0 [55] pkgconfig_2.0.3 loo_2.2.0 ParamHelpers_1.14 [58] nnet_7.3-13 utf8_1.1.4 tidyselect_1.0.0 [61] rlang_0.4.5 DiceDesign_1.8-1 reshape2_1.4.3 [64] later_1.0.0 munsell_0.5.0 tools_3.6.3 [67] cli_2.0.2 generics_0.0.2 ranger_0.12.1 [70] ggridges_0.5.2 evaluate_0.14 stringr_1.4.0 [73] fastmap_1.0.1 yaml_2.2.1 ModelMetrics_1.2.2.1 [76] processx_3.4.2 knitr_1.28 RANN_2.6.1 [79] future_1.17.0 nlme_3.1-145 mime_0.9 [82] rstanarm_2.19.3 rstudioapi_0.11 tokenizers_0.2.1 [85] compiler_3.6.3 bayesplot_1.7.1 shinythemes_1.1.2 [88] curl_4.3 e1071_1.7-3 tidyposterior_0.0.2 [91] lhs_1.0.2 stringi_1.4.6 ps_1.3.2 [94] blogdown_0.18 Matrix_1.2-18 nloptr_1.2.2 [97] markdown_1.1 shinyjs_1.1 vctrs_0.2.4 [100] pillar_1.4.3 lifecycle_0.2.0 furrr_0.1.0 [103] data.table_1.12.8 httpuv_1.5.2 R6_2.4.1 [106] bookdown_0.18 promises_1.1.0 gridExtra_2.3 [109] janeaustenr_0.1.5 codetools_0.2-16 boot_1.3-24 [112] colourpicker_1.0 MASS_7.3-51.5 gtools_3.8.1 [115] assertthat_0.2.1 ROSE_0.0-3 withr_2.1.2 [118] shinystan_2.5.0 parallel_3.6.3 hms_0.5.3 [121] grid_3.6.3 rpart_4.1-15 timeDate_3043.102 [124] minqa_1.2.4 tidyr_1.0.0 class_7.3-15 [127] rmarkdown_2.1 parallelMap_1.5.0 pROC_1.16.1 [130] tidypredict_0.4.5 shiny_1.4.0 lubridate_1.7.4 [133] base64enc_0.1-3 dygraphs_1.1.1.6

**leave a comment**for the author, please follow the link and comment on their blog:

**Modeling with R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.