由于两者都是流式框架,一次处理事件,这两种技术/流式框架之间的核心架构差异是什么?
还有哪些特定用例,哪一个比另一个更合适?
我在这里有一个非常奇怪的问题。我有两个函数:一个读取使用 h5py 创建的 HDF5 文件,另一个创建一个新的 HDF5 文件,该文件连接前一个函数返回的内容。
def read_file(filename):
with h5py.File(filename+".hdf5",'r') as hf:
group1 = hf.get('group1')
group1 = hf.get('group2')
dataset1 = hf.get('dataset1')
dataset2 = hf.get('dataset2')
print group1.attrs['w'] # Works here
return dataset1, dataset2, group1, group1
Run Code Online (Sandbox Code Playgroud)
和创建文件功能
def create_chunk(start_index, end_index):
for i in range(start_index, end_index):
if i == start_index:
mergedhf = h5py.File("output.hdf5",'w')
mergedhf.create_dataset("dataset1",dtype='float64')
mergedhf.create_dataset("dataset2",dtype='float64')
g1 = mergedhf.create_group('group1')
g2 = mergedhf.create_group('group2')
rd1,rd2,rg1,rg2 = read_file(filename)
print rg1.attrs['w'] #gives me <Closed HDF5 group> message
g1.attrs['w'] = "content"
g1.attrs['x'] = "content"
g2.attrs['y'] = "content"
g2.attrs['z'] …Run Code Online (Sandbox Code Playgroud) 我试图通过spark-submit运行spark作业,尽管由于某些依赖项(特别是jackson-databind)spark无法获取较新版本的依赖项并使用其自己的版本。这导致“NoSuchMethodError ”,因为显然该方法在旧版本中不可用,而旧版本是 Spark 分发的一部分。
为了解决“ NoSuchMethodError”,我通过spark-submit传递了依赖项的jar文件,并且正如其他一些答案中所指出的,我将以下内容添加到我的spark-submit调用中
--conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true
Run Code Online (Sandbox Code Playgroud)
但是当我尝试运行它时,我收到以下异常作为来自 YARN 集群的日志
17/05/10 11:21:27 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 5822357421500897232
java.lang.ClassCastException: cannot assign instance of scala.concurrent.duration.FiniteDuration to field org.apache.spark.rpc.RpcTimeout.duration of type scala.concurrent.duration.FiniteDuration in instance of org.apache.spark.rpc.RpcTimeout
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:262)
at …Run Code Online (Sandbox Code Playgroud) apache-apex ×1
apache-flink ×1
apache-spark ×1
h5py ×1
hdf5 ×1
java ×1
python ×1
python-2.7 ×1