This blog describes issues that could occur when installing, configuring or running Native Spark Modeling. It explains the root causes of those issues and if possible provides solutions or workarounds.
The official SAP Predictive Analytics documentation including the "Connecting to your Database Management System on Windows" and "Connecting to your Database Management System on Unix" Guides can be found at SAP Predictive Analytics 2.5 – SAP Help Portal Page .
What is Native Spark Modeling?
Native Spark Modeling builds the Automated predictive models by leveraging the combined data store and processing power of Apache Spark and Hadoop.
Native Spark Modeling is introduced from Predictive Analytics 2.5. The concept is also sometimes called in-database processing or modeling. Note that both Data Manager and Model Apply (scoring) already support in-database functionality. For more details on Native Spark Modeling have a look at :
Big Data : Native Spark Modeling in SAP Predictive Analytics 2.5
Troubleshooting
Configuration
Issue
Native Spark Modeling does not start. For example you should see the "Negotiating resource allocation with YARN" progress message in the Desktop client when Native Spark Modeling is configured correctly.
Solution
Check you have Native Spark Modeling checkbox enabled in the preferences (under Preferences -> Model Training Delegation).
Check you have at least the minimum properties in the configuration files (hadoopConfigDir and hadoopUserName entries in the SparkConnections.ini file for the Hive DSN and the Hadoop client XML files in the folder referenced by hadoopConfigDir property).
Issue
SparkConnections.ini file has limited support for full path names with spaces on Windows.
Solution
Prefer relative paths instead.
e.g. for a ODBC DSN called MY_HIVE_DSN use the following relative path instead of the full path for the hadoopConfigDir parameter
SparkConnection.MY_HIVE_DSN.hadoopConfigDir=../../../SparkConnector/hadoopConfig/MY_HIVE_DSN
Issue
Error message includes "Connection specific Hadoop config folder doesn't exist".
Solution
Check the SparkConnections.ini file contains a valid path to the configuration folder.
Issue
Error message contains "For Input String". For example "Unexpected Java internal error...For Input String "5s"".
Solution
Check the hive-site.xml file for the DSN and remove the property that is causing the issue (search for string in the error message).
Issue
Error message "JNI doesn't find class".
Solution
This can be a JNI (Java Native Interface) classpath issue. Restart the desktop client and double-check the settings in the KJWizard.ini file.
Monitoring and Logging
Issue
The logs in native_spark_log.log can be limited.
Solution
Refer to the logs on the Hadoop cluster for additional logging and troubleshooting information.
For example use the YARN Resource Manager web UI to monitor the Spark and Hive logs to help troubleshoot Hadoop specific issues. The Resource Manager web UI URL is normally http://{resourcemanager-hostname}:8088/cluster/apps.
Support for Multiple Spark versions
Issue
There is a restriction that one spark version (jar file) can be used at one time with Native Spark Modeling.
HortonWorks HDP and Cloudera CDH are running Spark 1.4.1 and Spark 1.5.0 respectively.
Solution
It is possible to switch the configuration to one or the other spark version as appropriate before modeling.
See the “Connecting to your Database Management System” guide in the official documentation (SAP Predictive Analytics 2.5 – SAP Help Portal Page) for more information on switching between cluster types.
Please restart the server or desktop after making this change.
Training Data Content Advice
Issue
There is limitation that the training data set content cannot contain commas in the data values. For example a field containing a value "Dublin, Ireland".
Solution
Please ensure that when creating the table in Hive the data does not contain a header row as the Hive “create table” statement may still include the header information as a data row. Also pre-process the data to cleanse commas or disable Native Spark Modeling for such data sets.
KxIndex Inclusion
Issue
Crash occurs when including KxIndex as an input variable. By default the KxIndex variable is added by Automated Analytics to the training data set description but it is normally an excluded variable. There is a limitation that the KxIndex column cannot be included in the included variable list with Native Spark Modeling.
Solution
Exclude the KxIndex variable (this is the default behaviour).
HadoopConfigDir Subfolder Creation
Issue
The configuration property HadoopConfigDir in Spark.cfg by default uses the temporary directory of the operating system.
This property is used to specify where to copy the Hadoop client configuration XML files (hive-site.xml, yarn-site.xml and core-site.xml).
If this is changed to use a subdirectory (e.g. \tmp\PA_HADOOP_FILES) it is possible to get a race condition that causes the files to be copied before the subdirectory is created.
Solution
Manually create the subdirectory.
Memory Configuration Tuning (Desktop only)
Issue
The Automated Desktop user interface shares the same Java (JVM) process memory with the Spark connection component (Spark Driver).
It is possible to misconfigure one or the other but no specific warnings will issued in this case.
Solution
Modify the configuration parameters to get the correct memory balance for the Desktop user interface and the Spark Driver.
The KJWizard.ini configuration file contains the total memory available to the Automated Desktop user interface and SparkDriver.
The Spark.cfg configuration file contains the optional property DriverMemory. This should be configured to be approximately 25% less than the DriverMemory property.
The SparkConnections.ini configuration file can be further configured to tune the Spark memory.
Please restart the Desktop client after making configuration changes.
e.g. example Automated Desktop memory and Spark configuration settings
In Spark.cfg
Spark.DriverMemory=6144
In KJWizard.ini
vmarg.1=-Xmx8096m
In SparkConnections.ini
SparkConnection.MY_HIVE_DSN.native."spark.driver.maxResultSize"="4g"
Spark/YARN Connectivity
Issue
Virtual Private Network (VPN) connection issue (mainly Desktop).
Native Spark Modeling uses YARN for the connection to the Hadoop cluster. There is a limitation that the connectivity does not work over VPN.
Solution
Revert to non-VPN connection or connect to a terminal/Virtual Machine that can connect to the cluster without the VPN.
Issue
Single SparkContext issue (Desktop only).
A SparkContext is the main entry point for Spark functionality. There is a known limitation in Spark that there can be only one SparkContext per JVM.
For more information see https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/SparkContext.html
This issue may appear when a connection to Spark cannot be created correctly (e.g. due to a configuration issue) and subsequently the SparkContext cannot be restarted. This is an issue that only affects the Desktop installation.
Solution
Restart the Desktop client.
Issue
Get error message
Unexpected Spark internal error. Error detail: Cannot call methods on a stopped SparkContext
Solution
Troubleshoot by looking in the diagnostic messages or logs on the cluster (for example using the web UI).
One possible cause is committing too many CPU resources in the SparkConnections.ini configuration file.
Example of Hadoop web UI error diagnostics showing over commit of resources :
SparkConnections.ini file content with too many cores specified:
Hive
Issue
Hive on Tez execution memory issue
Scope HortonWorks clusters only (with Hive on Tez) and Data Manager functionality
HortonWorks HDP uses Hive on Tez to greatly improve SQL execution performance. The SQL generated by the Data Manager functionality for Analytical Data Sets (ADS) can be complicated. There is a possibility the Tez engine will run out of memory with default settings.
Solution
Increase the memory available to Tez through the Ambari web administrator console.
Go to the tez-configs under Hive and change setting tez.counters.max to 16000. It is also recommended to increase the tez.task.resource.memory.mb setting. It is necessary to restart the Hive and Tez services after this change. If this still does not work it is possible to switch the execution engine to Map Reduce again through Ambari.
Issue
It is possible to set the database name in the ODBC Hive driver connection configuration. For example, instead of using the "default" database, it is possible to configure a different database in the ODBC Administrator dialog on Windows or the ODBC connection file for the UNIX operating system.
Native Spark Modeling requires the default database for the Hive connection.
Solution
Keep the database setting to default for the Hive DSN connection. It is still possible to use a Hive table/view in a different database to default.
Data Manager
Issue
Time-stamped population with user-defined target field is not contained in a Temporal ADS (Analytical Data Set). i.e. when you train your model using Data Manager with "Time-stamped Population" having a target variable, your target variable may not be visible in the list of variables in the modeler.
Solution
If you want to include the target field you can either have it as part of the original data set or define a variable (with relevant equation) in the "Analytical Record" instead.
Metadata Repository
Issue
The metadata repository cannot be in Hive. Also output results cannot be written directly into Hive from In-database Apply (model Scoring) or Model Manager.
Solution
Write the results to local filesystem instead.