我正在使用独立集群管理器部署 Spark Apache 应用程序。我的架构使用 2 台 Windows 机器:一组作为主机,另一组作为从机(工作程序)。
Master:我在其上运行:\bin>spark-class org.apache.spark.deploy.master.Master这是 Web UI 显示的内容:

Slave:我在其上运行:\bin>spark-class org.apache.spark.deploy.worker.Worker spark://192.*.*.186:7077这就是 Web UI 显示的内容:
问题是worker节点无法连接到master节点,并显示以下错误:
17/09/26 16:05:17 INFO Worker: Connecting to master 192.*.*.186:7077...
17/09/26 16:05:22 WARN Worker: Failed to connect to master 192.*.*.186:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to /192.*.*.186:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at …Run Code Online (Sandbox Code Playgroud) 我有一个df具有以下结构的数据框:
+-----+-----+-----+-------+
| s |col_1|col_2|col_...|
+-----+-----+-----+-------+
| f1 | 0.0| 0.6| ... |
| f2 | 0.6| 0.7| ... |
| f3 | 0.5| 0.9| ... |
| ...| ...| ...| ... |
Run Code Online (Sandbox Code Playgroud)
我想计算这个数据帧的转置,所以它看起来像
+-------+-----+-----+-------+------+
| s | f1 | f2 | f3 | ...|
+-------+-----+-----+-------+------+
|col_1 | 0.0| 0.6| 0.5 | ...|
|col_2 | 0.6| 0.7| 0.9 | ...|
|col_...| ...| ...| ... | ...|
Run Code Online (Sandbox Code Playgroud)
我绑定了这两个解决方案,但它返回的数据帧没有指定的used方法:
方法1:
for x in df.columns:
df = df.pivot(x)
Run Code Online (Sandbox Code Playgroud)
方法2:
df = …Run Code Online (Sandbox Code Playgroud) 我正在使用IPython笔记本运行PySpark,只需将以下内容添加到笔记本中:
import os
os.chdir('../data_files')
import sys
import pandas as pd
%pylab inline
from IPython.display import Image
os.environ['SPARK_HOME']="spark-1.3.1-bin-hadoop2.6"
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'bin') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python/lib/py4j-0.8.2.1-src.zip') )
from pyspark import SparkContext
sc = SparkContext('local')
Run Code Online (Sandbox Code Playgroud)
这适用于一个项目.但在我的第二个项目中,运行几行(每次都不一样)后,我收到以下错误:
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.py", line 425, in start
self.socket.connect((self.address, self.port))
File "/usr/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
---------------------------------------------------------------------------
Py4JNetworkError Traceback (most recent call …Run Code Online (Sandbox Code Playgroud) 在 PySpark 应用程序中,我尝试通过将数据帧转换为 Pandas 来转置数据帧,然后我想将结果写入 csv 文件。这就是我的做法:
df = df.toPandas().set_index("s").transpose()
df.coalesce(1).write.option("header", True).option("delimiter", ",").csv('dataframe')
Run Code Online (Sandbox Code Playgroud)
执行此脚本时,出现以下错误:
'DataFrame' object has no attribute 'coalesce'
Run Code Online (Sandbox Code Playgroud)
问题是什么?我该如何解决?
我想使用 c# 在 IronPython 中运行一些代码。在此代码中我需要使用 numpy。所以我尝试使用以下命令安装它:
ipy -X:Frames -m pip install -U numpy
Run Code Online (Sandbox Code Playgroud)
不幸的是,我收到一个错误和一条返回消息,告诉我安装不成功。错误信息如下:
Using cached https://files.pythonhosted.org/packages/3a/20/c81632328b1a4e1db65f45c0a1350a9c5341fd4bbb8ea66cdd98da56fe2e/numpy-1.15.0.zip
Installing collected packages: numpy
Running setup.py install for numpy ... error
Complete output from command "C:\Program Files\IronPython 2.7\ipy.exe" -u -c "import setuptools, tokenize;__file__='c:\\users\\mbhamida\\appdata\\local\\temp\\pip-build-t61kxu\\numpy\\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\mbhamida\appdata\local\temp\pip-y91bz0-record\install-record.txt --single-version-externally-managed --compile:
Running from numpy source directory.
Note: if you need reliable uninstall behavior, then install
with pip instead of using `setup.py install`:
- `pip install .` (from a git repo or …Run Code Online (Sandbox Code Playgroud) 我有一个用一些空值定义的数据框。一些列是完全空值。
>> df.show()
+---+---+---+----+
| A| B| C| D|
+---+---+---+----+
|1.0|4.0|7.0|null|
|2.0|5.0|7.0|null|
|3.0|6.0|5.0|null|
+---+---+---+----+
Run Code Online (Sandbox Code Playgroud)
就我而言,我想返回一个填充了空值的列名称列表。我的想法是检测常量列(因为整个列包含相同的空值)。
我是这样做的:
nullCoulumns = [c for c, const in df.select([(min(c) == max(c)).alias(c) for c in df.columns]).first().asDict().items() if const]
Run Code Online (Sandbox Code Playgroud)
但这并不将空列视为常量,它仅适用于值。那我该怎么做呢?
apache-spark apache-spark-sql pyspark spark-dataframe pyspark-sql
apache-spark ×4
pyspark ×4
dataframe ×2
python ×2
c# ×1
ipython ×1
ironpython ×1
pyspark-sql ×1
python-2.7 ×1
transpose ×1