How to setup H2O Context in yarn-based spark (CDH 6)?

ragung · February 2023

Hi - reposting (i think accidentally deleted the draft, but if its double posted please delete this one)
I'm trying to deploy H2O in yarn-based spark in CDH (using pyspark/pysparkling)

Below are the script i run in my jupyter notebook:

Establishing sparksession
sparkSession01 = SparkSession.builder.appName(appName01) \ .config("spark.yarn.queue", "redacted") \ .config("spark.driver.memory", "32g") \ .config("spark.executor.memory", "32g") \ .config("spark.executor.cores", "4") \ .config("spark.io.compression.codec", "org.apache.spark.io.SnappyCompressionCodec") \ .config("spark.sql.sources.partitionOverwriteMode","dynamic") \ .config("spark.submit.deployMode", "client") \ .config("spark.blacklist.enabled","false") \ .config("spark.port.maxRetries","100") \ .config("spark.ui.port","4300") \ .config("spark.debug.maxToStringFields", "100") \ .config("spark.driver.maxResultSize", "100g") \ .config("spark.network.timeout", "10000000") \ .config("spark.ext.h2o.network.ip.ping.timeout", "10000000") \ .config("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true") \ .config("spark.hadoop.parquet.overwrite.output.file","true") \ .config("spark.sql.shuffle.partitions", "500") \ .config("spark.sql.autoBroadcastJoinThreshold", "-1") \ .config('spark.yarn.keytab', 'redacted') \ .config("spark.ext.h2o.fail.on.unsupported.spark.param","false") \ .config("spark.executor.extraClassPath=-Dhdp.version","current") \ .config("spark.ext.h2o.client.language","python") \ .config("spark.yarn.dist.archives","/path/to/env/pyspark_conda_env.tar.gz#environment") \ .config("spark.jars","/path/to/assembly/sparkling-water-assembly_2.11-3.38.0.4-1-2.4-all.jar") \ .config('spark.yarn.principal', 'redacted') \ .enableHiveSupport().getOrCreate()

At this point, sparksession established and working

Now i want to deploy h2o context in this sparksession, running this below:
conf = H2OConf() \ .setExternalClusterMode() \ .useAutoClusterStart() \ .setClusterSize(10) \ .setExternalMemory("32G") \ .setYARNQueue("redacted") \ .setClusterStartTimeout(10000000) hc = H2OContext.getOrCreate()

but i got below error :

`Py4JJavaError: An error occurred while calling o231.getOrCreate.: org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index 104 out of bounds for length 5
Serialization trace:
timeoutProp (org.apache.spark.rpc.RpcTimeout)`

do i miss something?

btw i need to run this from notebook - due to some security reason, running spark-submit directly is not possible

installation detail:
pysparkling for spark 2.4 from pip (running in my activated conda env just for this one -s omehow the conda one keep disconnecting)
dependencies pulled when pip installed pysparkling
jar deployed using spark.jars with sparkling water assembly jar downloaded from this site

Thank you - really appreciate if somebody can shared me pointer

How to setup H2O Context in yarn-based spark (CDH 6)?

Categories