Overview
Big Data is an incredibly exciting topic – every day at SAP we speak to customers who are pushing the boundaries with the amount of data they generate. Some examples of this include machine and sensor data, biometric data from the quantified self, Customer Relationship Management (CRM) data, behavioral observations, and text generated from Neuro-Linguistic Programming (NLP). It is our mission to help customers generate actionable insights from this data in an efficient and repeatable manner. This gives everyday business users the flexibility to make use of advanced statistical methods, while still allowing data scientists the power to use their expertise to refine predictive models.
The first challenge we need to overcome is size. Predictive models are created by taking a number of input variables and combining them to determine an output variable. As the number of data sources and volume of information expands so does the number of contributing variables. This results in very wide analytical records with 100,000s of variables. To create and score models on these types of records, it makes sense to use Big Data Technologies.
As a first example, we are going to take a 15,000 variable sample dataset used for the Knowledge Discovery and Data Mining (KDD) Challenge in 2009 provided by Orange Telecom, which describes their customers. We are using SAP InfiniteInsight to automatically create the predictive model from the dataset. We are using the Hive SQL Framework on Apache Hadoop to store and process the data.
Our goal is to observe the technology working correctly, but also to see how it performs at this type of data load. The beauty of SAP InfiniteInsight is that it reduces this very wide dataset down to a narrow group of variables, which provide the maximum impact quickly and with minimum data scientist knowledge required. We look forward to continuing to work with our customers to expand the size and scope of the data they work with to provide new and interesting insights.
This sizing test measures the running performance of SAP InfiniteInsight 7.0 on different Apache Hadoop clusters (Native Apache Hadoop, Cloudera, and Hortonworks) with different data sizes. The goal is to demonstrate the modeling capabilities of SAP InfiniteInsight on large data sources in the Hadoop file system.
Data Preparation
Data Description
The test is based on two telecommunication datasets gathered from the telecommunication company, Orange Telecom:
- Dataset 1 = orange_small_train: 230 variables and 50,000 rows
- Dataset 2 = orange_big_train: 15,000 variables and 50,000 rows
These datasets have been divided into different segments for testing:
- 1,000 rows segment
- 3,000 rows segment
- 5,000 rows segment
- 10,000 rows segment
- 15,000 rows segment
- 30,000 rows segment
- 50,000 rows segment
You can download the datasets from the Data section here: http://www.sigkdd.org/kdd-cup-2009-customer-relationship-prediction
Testbed Description
The testing is performed on three testbeds based on different distributions of Apache Hadoop and Hive.
- Cloudera cluster with Hive 12.
- Hortonworks cluster with Hive 13.
- Native Apache Hadoop cluster with Hive 11, Hive 12,
and Hive 13.
This section outlines the recommended hardware and software configurations for each cluster.
Experimental Cluster 1: Cloudera Cluster
Hardware Configuration
We recommend using 5 different servers for the testing with the following configuration:
Server | Node Type | CPU |
|
Hard Disk |
Server 1 | Master Node = Ubuntu 12.04 | 4-core | 16 GB | 200 GB |
Server 2 | Slave Node = Ubuntu 12.04 | 4-core | 16 GB | 200 GB |
Server 3 | Slave Node = Ubuntu 12.04 | 4-core | 16 GB | 200 GB |
Server 4 | Slave Node = Ubuntu 12.04 | 4-core | 16 GB | 200 GB |
Server 5 | Slave Node = Ubuntu 12.04 | 4-core | 16 GB | 200 GB |
Computation capability in total: 20-core CPU, 90 GB Memory and 1TB Hard Disk
Software Configuration
The Apache Hadoop installation is based on Cloudera (CDH 5.1.0) release:
Component | Version | Description |
Apache Hadoop | hadoop-2.3.0+cdh5.1.0+795 | Cloudera distribution of Apache Hadoop 2.3, which adopts MapReduce paradigm to enable |
Apache Hive | hive-0.12.0+cdh5.1.0+369 | Apache Hive 0.12.0, which allows you to query data in Hadoop Distributed File System (HDFS) via Hibernate Query Language (HQL). |
Experimental Cluster 2: Hortonworks Cluster
Hardware Configuration
We recommend using 5 different servers for the testing with the following configuration:
Server | Node Type | CPU |
Memory |
Hard Disk |
Server 1 | Master Node = SUSE11.03 | 4-core | 16 GB | 200 GB |
Server 2 | Slave Node = SUSE11.03 | 4-core | 16 GB | 200 GB |
Server 3 | Slave Node = SUSE11.03 | 4-core | 16 GB | 200 GB |
Server 4 | Slave Node = SUSE11.03 | 4-core | 16 GB | 200 GB |
Server 5 | Slave Node = SUSE11.03 | 4-core | 16 GB | 200 GB |
Computation capability in total: 20-core CPU, 90 GB Memory and 1TB Hard Disk
Software Configuration
The Apache Hadoop installation is based on Hortonworks Hadoop 2.1.1 release.
Component | Version | Description |
Apache Hadoop | hadoop-2.3.0+cdh5.1.0+795 | Native Apache Hadoop 2.3, which adopts MapReduce paradigm to enable large-scale calculation. |
Apache Hive | hive-0.13.0 | Native Apache Hive 0.13.0, which allows you to query data in Hadoop Distributed File System (HDFS) via Hibernate Query Language (HQL). |
Experimental Cluster 3: Apache Hadoop Cluster
Hardware Configuration
We recommend using 5 different servers for the testing with the following configuration:
Server | Node Type | CPU |
Memory |
Hard Disk |
Server 1 | Master Node = Ubuntu 12.04 | 4-core | 16 GB | 200 GB |
Server 2 | Slave Node = Ubuntu 12.04 | 4-core | 16 GB | 200 GB |
Server 3 | Slave Node = Ubuntu 12.04 | 4-core | 16 GB | 200 GB |
Server 4 | Slave Node = Ubuntu 12.04 | 4-core | 16 GB | 200 GB |
Server 5 | Slave Node = Ubuntu 12.04 | 4-core | 16 GB | 200 GB |
Computation capability in total: 20-core CPU, 90 GB Memory and 1TB Hard Disk
Software Configuration
The Apache Hadoop installation is based on Apache Hadoop release: http://hadoop.apache.org/
Component | Version | Description |
Apache Hadoop | hadoop-2.4.0 | Native Apache Hadoop 2.3, which adopts MapReduce paradigm to enable large-scale calculation. |
Apache Hive | hive-0.13.0 | Native Apache Hive 0.13.0, which allows you to query data in Hadoop Distributed File System (HDFS) via Hibernate Query Language (HQL). |
Loading Data
This testing focuses on the running performance of SAP InfiniteInsight with Hive, so all datasets should be loaded into Hadoop Distributed File System (HDFS) system on clusters and a corresponding table created for each dataset in the Hive Metastore.
This can be done by loading data into Hive via console or using Hue for Cloudera Clusters.
Loading Data into Hive via Console:
- Connect to Hadoop cluster using the network file transfer application, PuTTY.
- Once connected, change the directory to the target folder where Hive is running on the Hadoop cluster and start the Hive shell.
- Create a table for large and small datasets in Hive respectively. Note: Tab separation between each column in 'CREATE TABLE' statement produces an error in Hive shell. Use a file without tab separation in the 'CREATE TABLE' statement.
- Ensure table is created by using 'SHOW TABLES' command in Hive.
- Load data file in HDFS. Note: Errors can occur when data is loaded with a header row that contains column names and blank key fields.
- Load data using 'LOAD INPATH' statement.
- Use 'SELECT' statement in Hive to ensure data is loaded in the respective table. Use 'LIMIT <no. of rows>' addition in SELECT statement to restrict the records to a few selected rows.
Test Execution
Test Cases
Performance and Feasibility tests are conducted on all three clusters:
- Cloudera
- Hortonworks
- Native Apache Hadoop
Cloudera and Hortonworks are widely used third-party distributions of Hadoop so these have been included in the testing to allow for comparison. The following drivers are used:
- DataDirect (DD) driver: Default driver released with SAP InfiniteInsight, which is installed automatically when you install the application. This driver does not require database client libraries, which improves performance.
- Hortonworks (HW) driver: Third-party Open Database Connectivity (ODBC) driver that can be installed separately. This driver allows you to access data in the Hortonworks Data Platform from Business Intelligence (BI) applications.
Performance Test
This test should be applied to all three clusters (Cloudera, Hortonworks, and Native Apache Hadoop) using all segments in both datasets (230 variables and 15,000 variables). The test is run 5 times for each dataset.
Clusters | Hive Version | Driver Hortonworks (HW)/ DataDirect (DD) | Dataset 1 (230 |
Dataset 2 (15,000 variables) |
Cloudera | Hive 12 (8 GB)* | DD | Yes | Yes |
HW | ||||
Hortonworks | Hive 13 (8 GB) | DD | Yes | Yes |
HW | ||||
Native | Hive 11 (8 GB) | DD | Yes | Yes |
HW | ||||
Hive 12 (8 GB) | DD | Yes | Yes | |
HW | ||||
Hive 13 (8 GB) | DD | Yes | Yes | |
HW |
*When using Hive 12 with a wide dataset, ensure that you increase the heap size of the Java Virtual Machine (JVM) for the Hive Metastore service.
Feasibility Test
Clusters | Driver Hortonworks (HW)/ DataDirect (DD) | Feasibility (15,000 | Feasibility (230 columns, 50,000 rows) |
Cloudera | DD | No | Yes |
HW | Yes | Yes | |
Hortonworks | DD | No | Yes |
HW | Yes | Yes | |
Native | DD | No | Yes |
HW | Yes | Yes |
Guide to Results:
Yes means that no exceptions occurred and it was possible to generate a model from the test. No means that exceptions occurred and it was not possible to generate a model.
Running Tests
- Create ODBC source in Windows system for DataDirect (DD) and Hortonworks (HW) drivers pointing to Hadoop clusters. Default Rows per block limit is 10,000. For HW driver, go to Advanced Option and set Rows per block limit to 200. For DD driver, configuration of Rows per block limit is not supported.
- Start SAP InfiniteInsight, version 7.0, and using Modeler, connect to ODBC source under Database, which points to the Hadoop server. Select the table that was created using the ‘CREATE TABLE’ statement when you loaded data into Hive via Console.
- Run the Classification algorithm and take note of the running time and any exceptions raised.
Note: When testing with large datasets (15,000 columns), you can follow these guidelines:
- Instead of analyzing data in SAP InfiniteInsight, use a separate data description file and set the values for upselling, churn, and appetency to nominal.
- Set upselling as the only target variable.
- Deselect the Enable Auto-selection checkbox. This simplifies the test and reduces the computation visible.
Capturing Test Results
Record the test results for each test case. For example:
(Click to enlarge image)
Test Analysis
The following graphs demonstrate the results of the sizing test. Note the following key points:
- Linearity displayed for each test case
- Scalability of running the tests on large datasets
- Variation between the different versions of Hive
(Click to enlarge image)
You can see that the results are more stable with the larger dataset (15,000 variables or columns). The performance is not dependent on the version of Apache Hadoop being used. It is the same throughout.
(Click to enlarge image)
The query cost includes the execution time of the Hive query engine and the communication time, which is the time it takes to fetch data from the cluster. It demonstrates the cost difference between running the test on Hive and running it on a local file. By reducing the query cost, the overall modeling cost would be reduced.
(Click to enlarge image)
The sizing tests in this guide were performed to observe how SAP InfiniteInsight works with very wide datasets. The following conclusions can be derived from the results of the sizing tests:
Firstly, the tests show us that SAP InfiniteInsight is able to build a model based on a very wide dataset (15,000 columns) with a linear cost for increasing the size of the data. By using Apache Hadoop with Hive, it overcomes the limit that normal databases have when processing datasets with a large number of columns. It also supports popular third-party Hadoop distributions, such as Cloudera and Hortonworks, with the necessary configuration. The linear trend of time cost indicates that SAP InfiniteInsight
is scalable for large volumes of data.
Secondly, the performance of doing predictive analysis is affected by the different Hive versions. In some tests, Hive 12 runs faster than Hive 13. However, this is not strictly a test to evaluate the performance of different Hive versions. It is more of an indication that the version of Hive could affect the performance of SAP InfiniteInsight significantly.
Thirdly, the query cost is displayed for any version of Hive. The query cost can be improved by tuning the cluster through increasing the heap size for the Hive Metastore. This also shows the potential to improve results by using more efficient big data technologies.
The test results in this sizing guide can be used as a benchmark for big data features that SAP InfiniteInsight will deliver in future releases.