Overview

Big Data is an incredibly exciting topic – every day at SAP we speak to customers who are pushing the boundaries with the amount of data they generate. Some examples of this include machine and sensor data, biometric data from the quantified self, Customer Relationship Management (CRM) data, behavioral observations, and text generated from Neuro-Linguistic Programming (NLP). It is our mission to help customers generate actionable insights from this data in an efficient and repeatable manner. This gives everyday business users the flexibility to make use of advanced statistical methods, while still allowing data scientists the power to use their expertise to refine predictive models.

The first challenge we need to overcome is size. Predictive models are created by taking a number of input variables and combining them to determine an output variable. As the number of data sources and volume of information expands so does the number of contributing variables. This results in very wide analytical records with 100,000s of variables. To create and score models on these types of records, it makes sense to use Big Data Technologies.
As a first example, we are going to take a 15,000 variable sample dataset used for the Knowledge Discovery and Data Mining (KDD) Challenge in 2009 provided by Orange Telecom, which describes their customers. We are using SAP InfiniteInsight to automatically create the predictive model from the dataset. We are using the Hive SQL Framework on Apache Hadoop to store and process the data.

Our goal is to observe the technology working correctly, but also to see how it performs at this type of data load. The beauty of SAP InfiniteInsight is that it reduces this very wide dataset down to a narrow group of variables, which provide the maximum impact quickly and with minimum data scientist knowledge required. We look forward to continuing to work with our customers to expand the size and scope of the data they work with to provide new and interesting insights.

This sizing test measures the running performance of SAP InfiniteInsight 7.0 on different Apache Hadoop clusters (Native Apache Hadoop, Cloudera, and Hortonworks) with different data sizes. The goal is to demonstrate the modeling capabilities of SAP InfiniteInsight on large data sources in the Hadoop file system.

Data Preparation

Data Description

The test is based on two telecommunication datasets gathered from the telecommunication company, Orange Telecom:

Dataset 1 = orange_small_train: 230 variables and 50,000 rows
Dataset 2 = orange_big_train: 15,000 variables and 50,000 rows

These datasets have been divided into different segments for testing:

1,000 rows segment
3,000 rows segment
5,000 rows segment
10,000 rows segment
15,000 rows segment
30,000 rows segment
50,000 rows segment

You can download the datasets from the Data section here: http://www.sigkdd.org/kdd-cup-2009-customer-relationship-prediction

Testbed Description

The testing is performed on three testbeds based on different distributions of Apache Hadoop and Hive.

Cloudera cluster with Hive 12.
Hortonworks cluster with Hive 13.
Native Apache Hadoop cluster with Hive 11, Hive 12,
and Hive 13.

This section outlines the recommended hardware and software configurations for each cluster.

Experimental Cluster 1: Cloudera Cluster

Hardware Configuration

We recommend using 5 different servers for the testing with the following configuration:

Server	Node Type	CPU	Memory	Hard Disk
Server 1	Master Node = Ubuntu 12.04	4-core	16 GB	200 GB
Server 2	Slave Node = Ubuntu 12.04	4-core	16 GB	200 GB
Server 3	Slave Node = Ubuntu 12.04	4-core	16 GB	200 GB
Server 4	Slave Node = Ubuntu 12.04	4-core	16 GB	200 GB
Server 5	Slave Node = Ubuntu 12.04	4-core	16 GB	200 GB

Computation capability in total: 20-core CPU, 90 GB Memory and 1TB Hard Disk

Software Configuration

The Apache Hadoop installation is based on Cloudera (CDH 5.1.0) release:

http://www.cloudera.com/

Component	Version	Description
Apache Hadoop	hadoop-2.3.0+cdh5.1.0+795	Cloudera distribution of Apache Hadoop 2.3, which adopts MapReduce paradigm to enable large-scale calculation.
Apache Hive	hive-0.12.0+cdh5.1.0+369	Apache Hive 0.12.0, which allows you to query data in Hadoop Distributed File System (HDFS) via Hibernate Query Language (HQL).

Experimental Cluster 2: Hortonworks Cluster

Hardware Configuration

We recommend using 5 different servers for the testing with the following configuration:

Server	Node Type	CPU	Memory	Hard Disk
Server 1	Master Node = SUSE11.03	4-core	16 GB	200 GB
Server 2	Slave Node = SUSE11.03	4-core	16 GB	200 GB
Server 3	Slave Node = SUSE11.03	4-core	16 GB	200 GB
Server 4	Slave Node = SUSE11.03	4-core	16 GB	200 GB
Server 5	Slave Node = SUSE11.03	4-core	16 GB	200 GB

Computation capability in total: 20-core CPU, 90 GB Memory and 1TB Hard Disk

Software Configuration

The Apache Hadoop installation is based on Hortonworks Hadoop 2.1.1 release.

Component	Version	Description
Apache Hadoop	hadoop-2.3.0+cdh5.1.0+795	Native Apache Hadoop 2.3, which adopts MapReduce paradigm to enable large-scale calculation.
Apache Hive	hive-0.13.0	Native Apache Hive 0.13.0, which allows you to query data in Hadoop Distributed File System (HDFS) via Hibernate Query Language (HQL).

Experimental Cluster 3: Apache Hadoop Cluster

Hardware Configuration

We recommend using 5 different servers for the testing with the following configuration:

Server	Node Type	CPU	Memory	Hard Disk
Server 1	Master Node = Ubuntu 12.04	4-core	16 GB	200 GB
Server 2	Slave Node = Ubuntu 12.04	4-core	16 GB	200 GB
Server 3	Slave Node = Ubuntu 12.04	4-core	16 GB	200 GB
Server 4	Slave Node = Ubuntu 12.04	4-core	16 GB	200 GB
Server 5	Slave Node = Ubuntu 12.04	4-core	16 GB	200 GB

Computation capability in total: 20-core CPU, 90 GB Memory and 1TB Hard Disk

Software Configuration

The Apache Hadoop installation is based on Apache Hadoop release: http://hadoop.apache.org/

Component	Version	Description
Apache Hadoop	hadoop-2.4.0	Native Apache Hadoop 2.3, which adopts MapReduce paradigm to enable large-scale calculation.
Apache Hive	hive-0.13.0	Native Apache Hive 0.13.0, which allows you to query data in Hadoop Distributed File System (HDFS) via Hibernate Query Language (HQL).

Loading Data

This testing focuses on the running performance of SAP InfiniteInsight with Hive, so all datasets should be loaded into Hadoop Distributed File System (HDFS) system on clusters and a corresponding table created for each dataset in the Hive Metastore.

This can be done by loading data into Hive via console or using Hue for Cloudera Clusters.

Loading Data into Hive via Console:

Connect to Hadoop cluster using the network file transfer application, PuTTY.
Once connected, change the directory to the target folder where Hive is running on the Hadoop cluster and start the Hive shell.
Create a table for large and small datasets in Hive respectively. Note: Tab separation between each column in 'CREATE TABLE' statement produces an error in Hive shell. Use a file without tab separation in the 'CREATE TABLE' statement.
Ensure table is created by using 'SHOW TABLES' command in Hive.
Load data file in HDFS. Note: Errors can occur when data is loaded with a header row that contains column names and blank key fields.
Load data using 'LOAD INPATH' statement.
Use 'SELECT' statement in Hive to ensure data is loaded in the respective table. Use 'LIMIT <no. of rows>' addition in SELECT statement to restrict the records to a few selected rows.

Test Execution

Test Cases

Performance and Feasibility tests are conducted on all three clusters:

Cloudera
Hortonworks
Native Apache Hadoop

Cloudera and Hortonworks are widely used third-party distributions of Hadoop so these have been included in the testing to allow for comparison. The following drivers are used:

DataDirect (DD) driver: Default driver released with SAP InfiniteInsight, which is installed automatically when you install the application. This driver does not require database client libraries, which improves performance.
Hortonworks (HW) driver: Third-party Open Database Connectivity (ODBC) driver that can be installed separately. This driver allows you to access data in the Hortonworks Data Platform from Business Intelligence (BI) applications.

Performance Test

This test should be applied to all three clusters (Cloudera, Hortonworks, and Native Apache Hadoop) using all segments in both datasets (230 variables and 15,000 variables). The test is run 5 times for each dataset.

Clusters	Hive Version	Driver Hortonworks (HW)/ DataDirect (DD)	Dataset 1 (230 variables)	Dataset 2 (15,000 variables)
Cloudera	Hive 12 (8 GB)*	DD	Yes	Yes
		HW
Hortonworks	Hive 13 (8 GB)	DD	Yes	Yes
		HW
Native	Hive 11 (8 GB)	DD	Yes	Yes
		HW
	Hive 12 (8 GB)	DD	Yes	Yes
		HW
	Hive 13 (8 GB)	DD	Yes	Yes
		HW

*When using Hive 12 with a wide dataset, ensure that you increase the heap size of the Java Virtual Machine (JVM) for the Hive Metastore service.

Feasibility Test

Clusters	Driver Hortonworks (HW)/ DataDirect (DD)	Feasibility (15,000 columns, 50,000 rows)	Feasibility (230 columns, 50,000 rows)
Cloudera	DD	No	Yes
Cloudera	HW	Yes	Yes
Hortonworks	DD	No	Yes
Hortonworks	HW	Yes	Yes
Native	DD	No	Yes
Native	HW	Yes	Yes

Guide to Results:

Yes means that no exceptions occurred and it was possible to generate a model from the test. No means that exceptions occurred and it was not possible to generate a model.

Running Tests

Create ODBC source in Windows system for DataDirect (DD) and Hortonworks (HW) drivers pointing to Hadoop clusters. Default Rows per block limit is 10,000. For HW driver, go to Advanced Option and set Rows per block limit to 200. For DD driver, configuration of Rows per block limit is not supported.
Start SAP InfiniteInsight, version 7.0, and using Modeler, connect to ODBC source under Database, which points to the Hadoop server. Select the table that was created using the ‘CREATE TABLE’ statement when you loaded data into Hive via Console.
Run the Classification algorithm and take note of the running time and any exceptions raised.

Note: When testing with large datasets (15,000 columns), you can follow these guidelines:

Instead of analyzing data in SAP InfiniteInsight, use a separate data description file and set the values for upselling, churn, and appetency to nominal.
Set upselling as the only target variable.
Deselect the Enable Auto-selection checkbox. This simplifies the test and reduces the computation visible.

Capturing Test Results

Record the test results for each test case. For example:

(Click to enlarge image)

Test Analysis

The following graphs demonstrate the results of the sizing test. Note the following key points:

Linearity displayed for each test case
Scalability of running the tests on large datasets
Variation between the different versions of Hive

(Click to enlarge image)

You can see that the results are more stable with the larger dataset (15,000 variables or columns). The performance is not dependent on the version of Apache Hadoop being used. It is the same throughout.

(Click to enlarge image)

The query cost includes the execution time of the Hive query engine and the communication time, which is the time it takes to fetch data from the cluster. It demonstrates the cost difference between running the test on Hive and running it on a local file. By reducing the query cost, the overall modeling cost would be reduced.

(Click to enlarge image)

The sizing tests in this guide were performed to observe how SAP InfiniteInsight works with very wide datasets. The following conclusions can be derived from the results of the sizing tests:

Firstly, the tests show us that SAP InfiniteInsight is able to build a model based on a very wide dataset (15,000 columns) with a linear cost for increasing the size of the data. By using Apache Hadoop with Hive, it overcomes the limit that normal databases have when processing datasets with a large number of columns. It also supports popular third-party Hadoop distributions, such as Cloudera and Hortonworks, with the necessary configuration. The linear trend of time cost indicates that SAP InfiniteInsight
is scalable for large volumes of data.

Secondly, the performance of doing predictive analysis is affected by the different Hive versions. In some tests, Hive 12 runs faster than Hive 13. However, this is not strictly a test to evaluate the performance of different Hive versions. It is more of an indication that the version of Hive could affect the performance of SAP InfiniteInsight significantly.

Thirdly, the query cost is displayed for any version of Hive. The query cost can be improved by tuning the cluster through increasing the heap size for the Hive Metastore. This also shows the potential to improve results by using more efficient big data technologies.

The test results in this sizing guide can be used as a benchmark for big data features that SAP InfiniteInsight will deliver in future releases.

Sizing Guide for SAP InfiniteInsight on Apache Hadoop through Apache Hive

Overview

Data Preparation

Data Description

Testbed Description

Experimental Cluster 1: Cloudera Cluster

Experimental Cluster 2: Hortonworks Cluster

Experimental Cluster 3: Apache Hadoop Cluster

Test Execution

Test Cases

Performance Test

Feasibility Test

Running Tests

Capturing Test Results

Test Analysis

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112