Quantcast
Channel: SCN : All Content - SAP BusinessObjects Predictive Analytics
Viewing all articles
Browse latest Browse all 836

Custom R Component - Random Forest Classification

$
0
0

This component adds a Random Forest Classification to SAP Predictive Analysis. A random forest is an ensemble technique, which creates many decisions trees with a random element to achieve a stronger ability to predict. This comes at the cost of the model's reduced interpretability.The classification/target variable in this component can contain two or more levels.

 

A confusion matrix is automatically shown when training or testing the model. When applying the model on data, for which the actual classificationis not known, a frequency plot of the predicted classification is displayed.

 

confusion.JPG

 

Disclaimer

Please note that this component is provided as-is without any guarantee or support.

 

Prerequisites

- R libraries randomForest and ggplot2 have to be installed.

- The column names must not include a minus sign.

- To avoid the R error "New factor levels not present in the training data" add an empty "Filter" component right after the datasource in the analytical workflow. This affects how the levels in the datasets are managed.

 

Limitations

- The algorithm does not support classifiers with more than 32 levels.  For instance a country field with more than 32 different countries cannot be used as input variable.

- The test and prediction datasets must contain the same levels of all input parameters as the training dataset. For instance you must not have any new country in the test or training dataset.

 

Usage

These parameters can be set by the user.

ParameterDescription
Predictor ColumnsNames of the predictor columns.
Classifier ColumnName of the target column.
Number of Trees to growNumber of trees that will be calculated for the random forest. Larger values typically lead to stronger models, but the calculation time will be increased.
Minimum size of terminal nodes

Minimum size of terminal/leaf nodes. Smaller values lead to more complex random forests.

 

Output Columns added by this Component

ColumnDescription
PredictedValueValue predicted by the random forest.

 

How to Implement

The component is attached to this article. Download and unzip the file. You will see a text file. Rename file's .txt extension to .zip and unpack the new file as well. The content of the .zip file is the Custom R Component. These steps are needed as SCN does not allow the attachment of the component's original file type.

 

Then deploy the component as described here. You just need to copy the attached content in a folder described in the article and restart SAP Predictive Analysis.

 

Example

If you want to try this logistic regression on some sample data, you can use the Adult dataset as used in the article on the Naive Bayes Algorithm. Just remember that the column names must not include a minus sign.

rf01.JPG

 

Configure the component appropriately. In this case we want to predict a person's marital status. Remember not to use the "NativeCountry" column as predictor as it contains too many levels (Country names).

rf02.JPG

 

Run the model and you can see the predicted values either a raw data or in the embedded confusion matrix. 88.81% of the records have been correctly classified.

rf03.JPG

 

Now we want to determine how well the model can predict the martial status on data the model has not seen before. Save the trained model. Then add it as additional component into the testing-branch of the analytical flow.

rf04.JPG

 

Execute the component and go in the "Results" panel to the "Custom Chart" and you will see that another confusion matrix has been created. The component was able to identify automatically that the true classification is already known. If the classifier column (that was specified when training the model) exists in the dataset, the component assumes that it is tested on already classified data. Therefore it displays the confusion matrix to help evaluate the model's performance.

 

The trained model was able to accurately predict 83.97% of the previously unseen cases!

rf05.JPG

 

When applying the model on new data, for which the real classification is not known, the component will display a frequency plot of the predictions.

rf06.JPG


Viewing all articles
Browse latest Browse all 836

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>