在Apache Spark中.如何设置worker/executor的环境变量?

tri*_*oid 9 distributed-computing amazon-s3 amazon-web-services apache-spark

我在EMR上的火花程序经常出现这个错误:

Caused by: javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated
    at sun.security.ssl.SSLSessionImpl.getPeerCertificates(SSLSessionImpl.java:421)
    at org.apache.http.conn.ssl.AbstractVerifier.verify(AbstractVerifier.java:128)
    at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:397)
    at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148)
    at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:149)
    at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121)
    at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:573)
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:425)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:942)
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2148)
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2075)
    at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1093)
    at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:548)
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:172)
    at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
    at org.apache.hadoop.fs.s3native.$Proxy8.retrieveMetadata(Unknown Source)
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:414)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:341)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)
Run Code Online (Sandbox Code Playgroud)

我做了一些研究,发现通过设置环境变量,可以在低安全性情况下禁用此身份验证:

com.amazonaws.sdk.disableCertChecking=true
Run Code Online (Sandbox Code Playgroud)

但是我只能用spark-submit.sh --conf设置它,它只影响驱动程序,而大多数错误都在工作者身上.

有没有办法将它们传播给工人?

非常感谢.

sth*_*lzm 8

刚刚在Spark文档中偶然发现了一些事情:

spark.executorEnv.[EnvironmentVariableName]

将EnvironmentVariableName指定的环境变量添加到Executor进程.用户可以指定其中的多个来设置多个环境变量.

所以在你的情况下,我将Spark配置选项设置spark.executorEnv.com.amazonaws.sdk.disableCertCheckingtrue,看看是否有帮助.

  • 我需要同样的东西,但是在“worker”上:看看是否可用.. (2认同)

Ami*_*aha 7

在现有答案的基础上添加更多内容。

import pyspark


def get_spark_context(app_name):
    # configure
    conf = pyspark.SparkConf()
    conf.set('spark.app.name', app_name)

    # init & return
    sc = pyspark.SparkContext.getOrCreate(conf=conf)

    # Configure your application specific setting

    # Set environment value for the executors
    conf.set(f'spark.executorEnv.SOME_ENVIRONMENT_VALUE', 'I_AM_PRESENT')

    return pyspark.SQLContext(sparkContext=sc)
Run Code Online (Sandbox Code Playgroud)

SOME_ENVIRONMENT_VALUE环境变量将在执行者/工作人员中可用。

在您的 Spark 应用程序中,您可以像这样访问它们:

import os
some_environment_value = os.environ.get('SOME_ENVIRONMENT_VALUE')
Run Code Online (Sandbox Code Playgroud)