我有一个包含2个ArrayType字段的PySpark DataFrame:
>>>df
DataFrame[id: string, tokens: array<string>, bigrams: array<string>]
>>>df.take(1)
[Row(id='ID1', tokens=['one', 'two', 'two'], bigrams=['one two', 'two two'])]
Run Code Online (Sandbox Code Playgroud)
我想将它们组合成一个ArrayType字段:
>>>df2
DataFrame[id: string, tokens_bigrams: array<string>]
>>>df2.take(1)
[Row(id='ID1', tokens_bigrams=['one', 'two', 'two', 'one two', 'two two'])]
Run Code Online (Sandbox Code Playgroud)
使用字符串的语法似乎不起作用:
df2 = df.withColumn('tokens_bigrams', df.tokens + df.bigrams)
Run Code Online (Sandbox Code Playgroud)
谢谢!
我有一个2D Numpy数组,我想放入一个pandas系列(不是DataFrame):
>>> import pandas as pd
>>> import numpy as np
>>> a = np.zeros((5, 2))
>>> a
array([[ 0., 0.],
[ 0., 0.],
[ 0., 0.],
[ 0., 0.],
[ 0., 0.]])
Run Code Online (Sandbox Code Playgroud)
但这会引发错误:
>>> s = pd.Series(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/miniconda/envs/pyspark/lib/python3.4/site-packages/pandas/core/series.py", line 227, in __init__
raise_cast_failure=True)
File "/miniconda/envs/pyspark/lib/python3.4/site-packages/pandas/core/series.py", line 2920, in _sanitize_array
raise Exception('Data must be 1-dimensional')
Exception: Data must be 1-dimensional
Run Code Online (Sandbox Code Playgroud)
有可能是黑客:
>>> s = pd.Series(map(lambda x:[x], a)).apply(lambda …
Run Code Online (Sandbox Code Playgroud) 我有一个带有 MapType 字段的数据框。
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import *
>>> fields = StructType([
... StructField('timestamp', TimestampType(), True),
... StructField('other_field', StringType(), True),
... StructField('payload', MapType(
... keyType=StringType(),
... valueType=StringType()),
... True), ])
>>> import datetime
>>> rdd = sc.parallelize([[datetime.datetime.now(), 'this should be in', {'akey': 'aValue'}]])
>>> df = rdd.toDF(fields)
>>> df.show()
+--------------------+-----------------+-------------------+
| timestamp| other_field| payload|
+--------------------+-----------------+-------------------+
|2018-01-10 12:56:...|this should be in|Map(akey -> aValue)|
+--------------------+-----------------+-------------------+
Run Code Online (Sandbox Code Playgroud)
我想添加other_field
作为字段中的键payload
。
我知道我可以使用 udf:
>>> def _add_to_map(name, value, map_field): …
Run Code Online (Sandbox Code Playgroud) 我想将HDFS中的数据导出到架构中的SQL Server表my_schema
.
我试着--schema
像导入命令:
sqoop export \
--libjars /opt/mapr/sqoop/sqoop-1.4.6/lib/sqljdbc4.jar \
--connect "jdbc:sqlserver://MY-SERVER-DNS;database=my_db;" \
--schema "myschema" \
--table "my_table" \
--export-dir /path/to/my/hdfs/dir
ERROR tool.BaseSqoopTool: Unrecognized argument: --schema
Run Code Online (Sandbox Code Playgroud)
和 --table "schema.table"
sqoop export \
--libjars /opt/mapr/sqoop/sqoop-1.4.6/lib/sqljdbc4.jar \
--connect "jdbc:sqlserver://MY-SERVER-DNS;database=my_db;" \
--table "my_schema.my_table" \
--export-dir /path/to/my/hdfs/dir
INFO manager.SqlManager:
Executing SQL statement: SELECT t.* FROM [my_schema.my_table] AS t WHERE 1=0
ERROR manager.SqlManager: Error executing statement:
com.microsoft.sqlserver.jdbc.SQLServerException:
Invalid object name 'my_schema.my_table'.
Run Code Online (Sandbox Code Playgroud)
有没有办法用sqoop做到这一点?还是另一种技术?
编辑:
sqoop export \
--libjars /opt/mapr/sqoop/sqoop-1.4.6/lib/sqljdbc4.jar \
--connect "jdbc:sqlserver://MY-SERVER-DNS;database=my_db;schema=my_schema;" \ …
Run Code Online (Sandbox Code Playgroud) 我想将汉字转换为unicode格式,例如\'\\uXXXX\'\n但是当我使用str.encode(\'utf-16be\')时,它会显示:
\n\nb\'\\xOO\\xOO\'\n
Run Code Online (Sandbox Code Playgroud)\n\n因此,我编写了一些代码来执行我的请求,如下所示:
\n\ndata="index=\xe7\xb4\xa2\xe5\xbc\x95?"\nprint(data.encode(\'UTF-16LE\'))\n\ndef convert(s):\n returnCode=[]\n temp=\'\'\n for n in s.encode(\'utf-16be\'):\n if temp==\'\':\n if str.replace(hex(n),\'0x\',\'\')==\'0\':\n temp=\'00\'\n continue\n temp+=str.replace(hex(n),\'0x\',\'\')\n else:\n returnCode.append(temp+str.replace(hex(n),\'0x\',\'\'))\n temp=\'\'\n\n return returnCode\n\nprint(convert(data))\n
Run Code Online (Sandbox Code Playgroud)\n\n有人可以给我建议在 python 3.x 中进行此转换吗?
\npython ×4
pyspark ×2
apache-spark ×1
dataframe ×1
hadoop ×1
hdfs ×1
numpy ×1
pandas ×1
python-3.x ×1
sql-server ×1
sqoop ×1
unicode ×1
utf-16 ×1