如何在dask DataFrame上调用unique?
如果我尝试以与常规pandas数据帧相同的方式调用它,则会出现以下错误:
In [27]: len(np.unique(ddf[['col1','col2']].values))
AttributeError Traceback (most recent call last)
<ipython-input-27-34c0d3097aab> in <module>()
----> 1 len(np.unique(ddf[['col1','col2']].values))
/dir/anaconda2/lib/python2.7/site-packages/dask/dataframe/core.pyc in __getattr__(self, key)
1924 return self._constructor_sliced(merge(self.dask, dsk), name,
1925 meta, self.divisions)
-> 1926 raise AttributeError("'DataFrame' object has no attribute %r" % key)
1927
1928 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute 'values'
Run Code Online (Sandbox Code Playgroud)
对于Pandas和Dask.dataframe,您应该使用drop_duplicates方法
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 1, 2], 'y': [10, 10, 20]})
In [3]: df.drop_duplicates()
Out[3]:
x y
0 1 10
2 2 20
In [4]: import dask.dataframe as dd
In [5]: ddf = dd.from_pandas(df, npartitions=2)
In [6]: ddf.drop_duplicates().compute()
Out[6]:
x y
0 1 10
2 2 20
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4041 次 |
| 最近记录: |