我尝试了一个简单的例子:
data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/databricks-datasets/samples/population-vs-price/data_geo.csv")
data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("2014 Population estimate", "2015 median sales price").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
Run Code Online (Sandbox Code Playgroud)
它运作良好,但当我尝试非常相似的东西时:
data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load('/mnt/%s/OnlineNewsTrainingAndValidation.csv' % MOUNT_NAME)
data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("timedelta", "shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
display(data)
Run Code Online (Sandbox Code Playgroud)
它引发错误:AnalysisException:u"无法解析'timedelta'给定的输入列:[data_channel_is_tech,...
我当然导入了LabeledPoint和LinearRegression
可能有什么不对?
即使是更简单的情况
df_cleaned = df_cleaned.select("shares")
Run Code Online (Sandbox Code Playgroud)
引发相同的AnalysisException(错误).
*请注意:df_cleaned.printSchema()效果很好.
我正在寻找一种方法来从数据帧列中的字符串中获取最后一个字符并将其放入另一列.
我有一个Spark数据框,如下所示:
animal
======
cat
mouse
snake
Run Code Online (Sandbox Code Playgroud)
我想要这样的东西:
lastchar
========
t
e
e
Run Code Online (Sandbox Code Playgroud)
现在我可以使用看起来像这样的UDF来做到这一点:
def get_last_letter(animal):
return animal[-1]
get_last_letter_udf = udf(get_last_letter, StringType())
df.select(get_last_letter_udf("animal").alias("lastchar")).show()
Run Code Online (Sandbox Code Playgroud)
我很好奇是否有更好的方法在没有UDF的情况下做到这一点.谢谢!
我是PySpark的新手.我用pandas拉了一个csv文件.并使用registerTempTable函数创建了临时表.
from pyspark.sql import SQLContext
from pyspark.sql import Row
import pandas as pd
sqlc = SQLContext(sc)
aa1 = pd.read_csv("D:\mck1.csv")
aa2 = sqlc.createDataFrame(aa1)
aa2.show()
+--------+-------+----------+------------+---------+------------+-------------------+
| City| id|First_Name|Phone_Number|new_date|new code| New_date|
+--------+-------+----------+------------+---------+------------+-------------------+
|KOLKATTA|9000007| AAA| 1111119411| 20080714| 13|2016-08-16 00:00:00|
|KOLKATTA|9000007| BBB| 1111119421| 20080714| 13|2016-08-06 00:00:00|
|KOLKATTA|9000007| CCC| 1111119461| 20080714| 13|2016-08-13 00:00:00|
|KOLKATTA|9000007| DDD| 1111119471| 20080714| 13|2016-08-27 00:00:00|
|KOLKATTA|9000007| EEE| 1111119491| 20080714| 13|2016-08-15 00:00:00|
|KOLKATTA|9111147| FFF| 1111119401| 20080714| 13|2016-08-24 00:00:00|
|KOLKATTA|9585458| FORMULA| 1111110112| 19990930| 13|2016-08-16 00:00:00|
|KOLKATTA|9569878| APPLEII| 1111110132| 19990930| 13|2016-08-06 00:00:00|
aa3 = …Run Code Online (Sandbox Code Playgroud)