操作 使用连通分量基于距离和标签对点进行聚类。
问题 NetworkX 节点存储属性和 Pandas DataFrame 之间的来回切换
尝试 使用不同的函数,如 Scikit NearestNeighbours,但导致数据的来回移动相同。
问题 是否有更简单的方法来执行此连接组件操作?
例子
import numpy as np
import pandas as pd
import dask.dataframe as dd
import networkx as nx
from scipy import spatial
#generate example dataframe
pdf = pd.DataFrame({'x':[1.0,2.0,3.0,4.0,5.0],
'y':[1.0,2.0,3.0,4.0,5.0],
'z':[1.0,2.0,3.0,4.0,5.0],
'label':[1,2,1,2,1]},
index=[1, 2, 3, 4, 5])
df = dd.from_pandas(pdf, npartitions = 2)
object_id = 0
def cluster(df, object_id=object_id):
# create kdtree
tree = spatial.cKDTree(df[['x', 'y', 'z']])
# get neighbours within distance for every point, store …Run Code Online (Sandbox Code Playgroud) 操作将 两个csv(data.csv和label.csv)读取到单个数据帧.
df = dd.read_csv(data_files, delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b'])
df_label = dd.read_csv(label_files, delimiter=' ', header=None, names=['label'])
Run Code Online (Sandbox Code Playgroud)
问题 列的连接需要已知的划分.但是,设置索引会对数据进行排序,这是我明确不想要的,因为两个文件的顺序都是匹配的.
df = dd.concat([df, df_label], axis=1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-e6c2e1bdde55> in <module>()
----> 1 df = dd.concat([df, df_label], axis=1)
/uhome/hemmest/.local/lib/python3.5/site-packages/dask/dataframe/multi.py in concat(dfs, axis, join, interleave_partitions)
573 return concat_unindexed_dataframes(dfs)
574 else:
--> 575 raise ValueError('Unable to concatenate DataFrame with unknown '
576 'division specifying axis=1')
577 else:
ValueError: Unable to concatenate DataFrame with unknown …Run Code Online (Sandbox Code Playgroud)