Pycharm:Java 网关进程在发送其端口号之前退出

IoT*_*ser 5 python pycharm pyspark

我正在尝试使用自包含的 sparks 应用程序在 python 中执行(使用 Pycharm)一些示例

我使用以下方法安装了 pyspark:

pip install pyspark 
Run Code Online (Sandbox Code Playgroud)

根据示例的网络,它应该足以执行它:

python nameofthefile.py
Run Code Online (Sandbox Code Playgroud)

但我有这个错误:

Exception in thread "main" java.lang.ExceptionInInitializerError
    at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
    at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
    at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
    at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
    at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
    at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
    at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
    at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
    at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
    at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:359)
    at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
    at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:366)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
    at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
    at java.base/java.lang.String.substring(String.java:1874)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:52)
    ... 23 more
Traceback (most recent call last):
  File "C:/Users/.../PycharmProjects/PoC/Databricks.py", line 4, in <module>
    spark = SparkSession.builder.appName("Databricks").getOrCreate()
  File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\sql\session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\context.py", line 349, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\context.py", line 115, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\context.py", line 298, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\java_gateway.py", line 94, in launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
Run Code Online (Sandbox Code Playgroud)

可能有什么问题?


额外的

根据您可以找到解决方案的帖子,对于我的情况,我不得不从 jdk-11 更改为 jdk1.8。

现在我可以运行示例代码,但出现错误(这不会阻止它运行)

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387)
    at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
    at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
    at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
    at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
    at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
    at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
    at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
    at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
    at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
    at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:359)
    at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
    at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:366)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2019-01-24 08:46:16 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Run Code Online (Sandbox Code Playgroud)

是这个问题的解决方案Could not locate executable null\bin\winutils.exe

继续,为了解决第二个问题,您只需要在控制面板中定义 HADOOP_HOME 和 PATH 环境变量,以便任何 Windows 程序都可以使用它们。

小智 6

简短的回答:

我有一个类似的问题,我通过更改我的 JAVA_HOME 环境变量配置解决了这个问题。您可以手动添加一个新的用户环境变量 JAVA_HOME 链接到您的 Java 开发工具包的路径(例如“C:/Progra~1/Java/jdk1.8.0_121”或“C:/Progra~2/Java/jdk1” .8.0_121”(如果它安装在 Windows 上的“Program Files (x86)”中)。

您也可以在 python 代码的开头尝试这样的操作:

import os
os.environ["JAVA_HOME"] = "C:/Progra~1/Java/jdk1.8.0_121"
Run Code Online (Sandbox Code Playgroud)

(或者,如果您的 JDK 安装在“Program Files (x86)”下,则再次“C:/Progra~2/Java/jdk1.8.0_121”


更长的答案:独立于 Pyspark,您是否安装了 Spark 二进制文件(包括 hadoop)?您还需要安装兼容的 Java 开发工具包 (JDK)(Spark 2.3.0 的 Java 8+)。您还需要配置用户环境变量,例如: JAVA_HOME 带有 java 开发工具包的路径 SPARK_HOME 带有 SPARK 二进制文件的路径 HADOOP_HOME 带有 hadoop 二进制文件的路径

你可以从 python 中做这样的事情:

import os
os.environ["JAVA_HOME"] = "C:/Progra~2/Java/jdk1.8.0_121"
os.environ["SPARK_HOME"] = "/path/to/spark-2.3.1-bin-hadoop2.7"
Run Code Online (Sandbox Code Playgroud)

然后我建议使用 findspark(您可以安装至 pip install findspark):https : //github.com/minrk/findspark

然后你可以像这样使用它:

import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()
Run Code Online (Sandbox Code Playgroud)

特别是如果你在 Windows 上,JAVA_HOME 应该是这样的:

C:\Progra~1\Java\jdk1.8.0_121
Run Code Online (Sandbox Code Playgroud)

并且,“如果 JDK 安装在 \Program Files (x86) 下,则将 Progra~1 部分替换为 Progra~2。”

Windows上安装的详细信息可以在这里找到(它是针对jupyter的,但spark和pyspark的安装是一样的):https ://changhsinlee.com/install-pyspark-windows-jupyter/

我希望它有帮助,祝你好运,有一个美好的一天/晚上!