Oct*_*nWR 2 python string split bigdata dask
我有 3400 万行,只有一列。我想将字符串拆分为 4 列。
这是我的示例数据集 (df):
Log
0 Apr 4 20:30:33 100.51.100.254 dns,packet user: --- got query from 10.5.14.243:30648:
1 Apr 4 20:30:33 100.51.100.254 dns,packet user: id:78a4 rd:1 tc:0 aa:0 qr:0 ra:0 QUERY 'no error'
2 Apr 4 20:30:33 100.51.100.254 dns,packet user: question: tracking.intl.miui.com:A:IN
3 Apr 4 20:30:33 dns user: query from 9.5.10.243: #4746190 tracking.intl.miui.com. A
Run Code Online (Sandbox Code Playgroud)
我想使用以下代码将其拆分为四列:
df1 = df['Log'].str.split(n=3, expand=True)
df1.columns=['Month','Date','Time','Log']
df1.head()
Run Code Online (Sandbox Code Playgroud)
这是我预期的结果
Month Date Time Log
0 Apr 4 20:30:33 100.51.100.254 dns,packet user: --- go...
1 Apr 4 20:30:33 100.51.100.254 dns,packet user: id:78a...
2 Apr 4 20:30:33 100.51.100.254 dns,packet user: questi...
3 Apr 4 20:30:33 dns transjakarta: query from 9.5.10.243: #474...
4 Apr 4 20:30:33 100.51.100.254 dns,packet user: --- se...
Run Code Online (Sandbox Code Playgroud)
但回应是这样的:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-36-c9b2023fbf3e> in <module>
----> 1 df1 = df['Log'].str.split(n=3, expand=True)
2 df1.columns=['Month','Date','Time','Log']
3 df1.head()
TypeError: split() got an unexpected keyword argument 'expand'
Run Code Online (Sandbox Code Playgroud)
是否有使用 dask 拆分字符串的解决方案?
Dask 数据框确实支持 str.split 方法的 expand= 关键字,如果您还提供一个n=关键字来告诉它期望的拆分次数。
看起来 dask 数据帧的str.split方法没有实现 expand= 关键字。如果一个问题尚不存在,您可能会提出问题。
作为短期解决方法,您可以创建一个 Pandas 函数,然后使用map_partitions方法在您的 dask 数据帧中进行缩放
def f(df: pandas.DataFrame) -> pandas.DataFrame:
""" This is your code from above, as a function """
df1 = df['Log'].str.split(n=3, expand=True)
df1.columns=['Month','Date','Time','Log']
return df
ddf = ddf.map_partitions(f) # apply to all pandas dataframes within dask dataframe
Run Code Online (Sandbox Code Playgroud)
因为 Dask 数据框只是 Pandas 数据框的集合,所以当 Dask 数据框不支持它们时,自己构建东西相对容易。