我想将RDD转换为DataFrame并想要缓存RDD的结果:
from pyspark.sql import *
from pyspark.sql.types import *
import pyspark.sql.functions as fn
schema = StructType([StructField('t', DoubleType()), StructField('value', DoubleType())])
df = spark.createDataFrame(
sc.parallelize([Row(t=float(i/10), value=float(i*i)) for i in range(1000)], 4), #.cache(),
schema=schema,
verifySchema=False
).orderBy("t") #.cache()
Run Code Online (Sandbox Code Playgroud)
为什么cache在这种情况下生成一份工作?如何避免生成cache(缓存DataFrame而没有RDD)?
编辑:我对问题进行了更多调查,发现orderBy("t")没有生成任何工作.为什么?
我想使用C++ 11中引入的类型安全的可变参数函数,但不是使用不同的类型.一个例子:
template<typename T>
T maxv(T first, T second) {
return first > second ? first : second;
}
template<typename T, typename ... Rest>
T maxv(T first, T second, T ... rest) {
return maxv(first, maxv(second, rest));
}
Run Code Online (Sandbox Code Playgroud)
所有参数的类型都是相同的,因此可以编写类似的东西:
struct Point { int x,y; };
template<>
Point maxv(Point first, Point second) {
return first.x > second.x ? first : second;
}
maxv({1, 2}, {3, 4}); // no problem
maxv({1, 2}, {3, 4}, {5, 6}); // compile error
Run Code Online (Sandbox Code Playgroud)
它在mingw g ++ …
如果 a 中子项的大小发生wx.BoxSizer变化,则 boxsizer 不会重新布局:
import wx
class MyButton(wx.Button):
def __init__(self, parent):
wx.Button.__init__(self, parent, -1, style=wx.SUNKEN_BORDER, label="ABC")
self.Bind(wx.EVT_BUTTON, self.OnClick)
def OnClick(self, event):
self.SetSize((200, 200))
self.SetSizeHints(200, 200)
class MyFrame(wx.Frame):
def __init__(self, parent, ID, title):
wx.Frame.__init__(self, parent, ID, title, size=(300, 250))
self.button = MyButton(self)
button2 = wx.Button(self, -1, style=wx.SUNKEN_BORDER, label="DEF")
# self.button.Bind(wx.EVT_SIZE, self.OnButtonResize)
box = wx.BoxSizer(wx.HORIZONTAL)
box.Add(self.button, 1, wx.EXPAND)
box.Add(button2, 1, wx.EXPAND)
self.SetAutoLayout(True)
self.SetSizer(box)
self.Layout()
def OnButtonResize(self, event):
event.Skip()
self.Layout()
app = wx.App()
frame = MyFrame(None, -1, "Sizer Test")
frame.Show() …Run Code Online (Sandbox Code Playgroud) 我有一个带有 BinaryType 类型的二进制列的表:
>>> df.show(3)
+--------+--------------------+
| t| bytes|
+--------+--------------------+
|0.145533|[10 50 04 89 00 3...|
|0.345572|[60 94 05 89 80 9...|
|0.545574|[99 50 68 89 00 7...|
+--------+--------------------+
only showing top 3 rows
>>> df.schema
StructType(List(StructField(t,DoubleType,true),StructField(bytes,BinaryType,true)))
Run Code Online (Sandbox Code Playgroud)
如果我提取二进制文件的第一个字节,我会收到来自 Spark 的异常:
>>> df.select(n["t"], df["bytes"].getItem(0)).show(3)
AnalysisException: u"Can't extract value from bytes#477;"
Run Code Online (Sandbox Code Playgroud)
演员阵容ArrayType(ByteType)也不起作用:
>>> df.select(n["t"], df["bytes"].cast(ArrayType(ByteType())).getItem(0)).show(3)
AnalysisException: u"cannot resolve '`bytes`' due to data type mismatch: cannot cast BinaryType to ArrayType(ByteType,true) ..."
Run Code Online (Sandbox Code Playgroud)
如何提取字节?
pyspark ×2
pyspark-sql ×2
apache-spark ×1
boxsizer ×1
c++ ×1
c++11 ×1
python ×1
python-2.7 ×1
templates ×1
wxpython ×1