小编zem*_*eng的帖子

将PySpark DataFrame ArrayType字段组合到单个ArrayType字段中

我有一个包含2个ArrayType字段的PySpark DataFrame:

>>>df
DataFrame[id: string, tokens: array<string>, bigrams: array<string>]
>>>df.take(1)
[Row(id='ID1', tokens=['one', 'two', 'two'], bigrams=['one two', 'two two'])]
Run Code Online (Sandbox Code Playgroud)

我想将它们组合成一个ArrayType字段:

>>>df2
DataFrame[id: string, tokens_bigrams: array<string>]
>>>df2.take(1)
[Row(id='ID1', tokens_bigrams=['one', 'two', 'two', 'one two', 'two two'])]
Run Code Online (Sandbox Code Playgroud)

使用字符串的语法似乎不起作用:

df2 = df.withColumn('tokens_bigrams', df.tokens + df.bigrams)
Run Code Online (Sandbox Code Playgroud)

谢谢!

python dataframe apache-spark apache-spark-sql pyspark

12
推荐指数
2
解决办法
2万
查看次数

将2d阵列放入Pandas系列

我有一个2D Numpy数组,我想放入一个pandas系列(不是DataFrame):

>>> import pandas as pd
>>> import numpy as np
>>> a = np.zeros((5, 2))
>>> a
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])
Run Code Online (Sandbox Code Playgroud)

但这会引发错误:

>>> s = pd.Series(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/miniconda/envs/pyspark/lib/python3.4/site-packages/pandas/core/series.py", line 227, in __init__
    raise_cast_failure=True)
  File "/miniconda/envs/pyspark/lib/python3.4/site-packages/pandas/core/series.py", line 2920, in _sanitize_array
    raise Exception('Data must be 1-dimensional')
Exception: Data must be 1-dimensional
Run Code Online (Sandbox Code Playgroud)

有可能是黑客:

>>> s = pd.Series(map(lambda x:[x], a)).apply(lambda …
Run Code Online (Sandbox Code Playgroud)

python numpy pandas

9
推荐指数
1
解决办法
8907
查看次数

将新的键/值对添加到 Spark MapType 列

我有一个带有 MapType 字段的数据框。

>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import *
>>> fields = StructType([
...         StructField('timestamp',      TimestampType(), True),
...         StructField('other_field',    StringType(), True),
...         StructField('payload',        MapType(
...                                         keyType=StringType(),
...                                         valueType=StringType()),
...                                                     True),   ])
>>> import datetime
>>> rdd = sc.parallelize([[datetime.datetime.now(), 'this should be in', {'akey': 'aValue'}]])
>>> df = rdd.toDF(fields)
>>> df.show()
+--------------------+-----------------+-------------------+
|           timestamp|      other_field|            payload|
+--------------------+-----------------+-------------------+
|2018-01-10 12:56:...|this should be in|Map(akey -> aValue)|
+--------------------+-----------------+-------------------+
Run Code Online (Sandbox Code Playgroud)

我想添加other_field作为字段中的键payload

我知道我可以使用 udf:

>>> def _add_to_map(name, value, map_field): …
Run Code Online (Sandbox Code Playgroud)

python apache-spark-sql pyspark

7
推荐指数
2
解决办法
9854
查看次数

Sqoop导出到SQL Server:模式?

我想将HDFS中的数据导出到架构中的SQL Server表my_schema.

我试着--schema像导入命令:

sqoop export \
--libjars /opt/mapr/sqoop/sqoop-1.4.6/lib/sqljdbc4.jar \
--connect "jdbc:sqlserver://MY-SERVER-DNS;database=my_db;" \
--schema "myschema" \
--table "my_table" \
--export-dir /path/to/my/hdfs/dir

ERROR tool.BaseSqoopTool: Unrecognized argument: --schema
Run Code Online (Sandbox Code Playgroud)

--table "schema.table"

sqoop export \
--libjars /opt/mapr/sqoop/sqoop-1.4.6/lib/sqljdbc4.jar \
--connect "jdbc:sqlserver://MY-SERVER-DNS;database=my_db;" \
--table "my_schema.my_table" \
--export-dir /path/to/my/hdfs/dir

INFO manager.SqlManager: 
Executing SQL statement: SELECT t.* FROM [my_schema.my_table] AS t WHERE 1=0 

ERROR manager.SqlManager: Error executing statement: 
com.microsoft.sqlserver.jdbc.SQLServerException:
Invalid object name 'my_schema.my_table'.
Run Code Online (Sandbox Code Playgroud)

有没有办法用sqoop做到这一点?还是另一种技术?

编辑:

sqoop export \
--libjars /opt/mapr/sqoop/sqoop-1.4.6/lib/sqljdbc4.jar \
--connect "jdbc:sqlserver://MY-SERVER-DNS;database=my_db;schema=my_schema;" \ …
Run Code Online (Sandbox Code Playgroud)

sql-server hadoop hdfs sqoop

3
推荐指数
1
解决办法
5422
查看次数

utf-8 转换为 utf-16

我想将汉字转换为unicode格式,例如\'\\uXXXX\'\n但是当我使用str.encode(\'utf-16be\')时,它会显示:

\n\n
b\'\\xOO\\xOO\'\n
Run Code Online (Sandbox Code Playgroud)\n\n

因此,我编写了一些代码来执行我的请求,如下所示:

\n\n
data="index=\xe7\xb4\xa2\xe5\xbc\x95?"\nprint(data.encode(\'UTF-16LE\'))\n\ndef convert(s):\n    returnCode=[]\n    temp=\'\'\n    for n in s.encode(\'utf-16be\'):\n        if temp==\'\':\n            if str.replace(hex(n),\'0x\',\'\')==\'0\':\n                temp=\'00\'\n                continue\n            temp+=str.replace(hex(n),\'0x\',\'\')\n        else:\n            returnCode.append(temp+str.replace(hex(n),\'0x\',\'\'))\n            temp=\'\'\n\n    return returnCode\n\nprint(convert(data))\n
Run Code Online (Sandbox Code Playgroud)\n\n

有人可以给我建议在 python 3.x 中进行此转换吗?

\n

python unicode utf-16 python-3.x

1
推荐指数
1
解决办法
1万
查看次数