小编and*_*yan的帖子

DBSCAN 消除图中的噪声

使用 DBSCAN，

(DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine')

Run Code Online (Sandbox Code Playgroud)

我对纬度和经度对的列表进行了聚类，然后使用 matplotlib 对其进行了绘制。绘图时，它包括“噪声”坐标，这些点未分配给创建的 270 个簇之一。我想从图中消除噪音，只绘制满足指定要求的簇，但我不知道该怎么做。我该如何排除噪音（同样，那些未分配给集群的点）？

下面是我用来聚类和绘图的代码：

df = pd.read_csv('xxx.csv')

# define the number of kilometers in one radiation
# which will be used to convert esp from km to radiation
kms_per_rad = 6371.0088

# define a function to calculate the geographic coordinate
# centroid of a cluster of geographic points
# it will be used later to calculate the centroids of DBSCAN cluster
# because Scikit-learn DBSCAN cluster class does not come with centroid …

Run Code Online (Sandbox Code Playgroud)

python cluster-analysis matplotlib dbscan

and*_*yan

2017 04-04

5
推荐指数

1
解决办法

3821
查看次数

使用Python进行轨迹聚类/聚合

我正在使用地理定位的社交媒体帖子，并使用DBSCAN对其位置（纬度/经度）进行聚类。在我的数据集中，我有很多用户发布了多次，这使我可以得出他们的轨迹（位置到位置的时间顺序序列）。例如：

3945641 [[38.9875, -76.94], [38.91711157, -77.02435118], [38.8991, -77.029], [38.8991, -77.029], [38.88927534, -77.04858468])

Run Code Online (Sandbox Code Playgroud)

我已经导出了整个数据集的轨迹，下一步是对轨迹进行聚类或聚合，以识别位置之间运动密集的区域。关于如何在Python中解决轨迹聚类/聚合的任何想法？

这是我一直在使用的一些代码，用于将轨迹创建为线字符串/ JSON字典：

import pandas as pd
import numpy as np
import ujson as json
import time

# Import Data
data = pd.read_csv('filepath.csv', delimiter=',', engine='python')
#print len(data),"rows"
#print data

# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude','cluster_labels])
#print data.head()

# Get a list of unique user_id values
uniqueIds = np.unique(data['user_id'].values)

# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,data.loc[data['user_id']==id].sort_values(by='timestamp')[['latitude','longitude']].values.tolist()] for id in uniqueIds]

# …

Run Code Online (Sandbox Code Playgroud)

python gps graph cluster-analysis

and*_*yan

2017 04-08

5
推荐指数

1
解决办法

805
查看次数

Python / NetworkX：按边缘出现频率为边缘添加权重

我创建了一个MultiDiGraph，networkx尝试在其中添加权重到边缘，然后根据边缘出现的频率/次数分配新的权重。我使用以下代码创建图形并添加权重，但是我不确定如何处理基于计数的权重分配：

g = nx.MultiDiGraph()

df = pd.read_csv('G:\cluster_centroids.csv', delimiter=',')
df['pos'] = list(zip(df.longitude,df.latitude))
dict_pos = dict(zip(df.cluster_label,df.pos))
#print dict_pos


for row in csv.reader(open('G:\edges.csv', 'r')):
    if '[' in row[1]:       #
        g.add_edges_from(eval(row[1]))

for u, v, d in g.edges(data=True):
    d['weight'] = 1
for u,v,d in g.edges(data=True):
    print u,v,d

Run Code Online (Sandbox Code Playgroud)

编辑

我能够通过以下操作为原始问题的第一部分成功分配权重：

for u, v, d in g.edges(data=True):
    d['weight'] = 1
for u,v,d in g.edges(data=True):
    print u,v,d

Run Code Online (Sandbox Code Playgroud)

但是，我仍然无法根据出现一条边的次数重新分配权重（图形中的一条边可以多次出现）？我需要完成此操作，以可视化计数更高的边缘与计数较低的边缘（使用边缘颜色或宽度）。我不确定如何根据计数重新分配权重，请告知。以下是示例数据，以及指向我完整数据集的链接。

数据

样本质心（节点）：

cluster_label,latitude,longitude
0,39.18193382,-77.51885109
1,39.18,-77.27
2,39.17917928,-76.6688633
3,39.1782,-77.2617
4,39.1765,-77.1927
5,39.1762375,-76.8675441
6,39.17468,-76.8204499
7,39.17457332,-77.2807235
8,39.17406072,-77.274685
9,39.1731621,-77.2716502
10,39.17,-77.27

Run Code Online (Sandbox Code Playgroud)

样本边缘：

user_id,edges
11011,"[[340, 269], …

Run Code Online (Sandbox Code Playgroud)

python graph networkx

and*_*yan

2017 04-28

5
推荐指数

1
解决办法

2391
查看次数

按列数过滤Pandas df并写入数据

我有一个地理定位的社交媒体帖子数据集，我试图以user_id大于1 的频率（发布2次或更多次的用户）进行过滤。我想对此进行过滤，以便进一步清理正在创建的轨迹数据。

样例代码：

# Import Data
data = pd.read_csv('path', delimiter=',', engine='python')
#print len(data),"rows"
#print data

# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude'])
#print data.head()

# Get a list of unique user_id values
uniqueIds = np.unique(data['user_id'].values)

# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,data.loc[data['user_id']==id].sort_values(by='timestamp')['latitude','longitude'].values.tolist()] for id in uniqueIds]

# Save outputs
outputs = pd.DataFrame(output)
#print outputs
outputs.to_csv('path', index=False, header=False)

Run Code Online (Sandbox Code Playgroud)

我尝试使用df[].value_counts()来获取user_id的计数，然后在该行中传递> 1 output = [[......data['user_id']==id>1].....，但这没有用。是否可以将频率频率user_id作为附加参数添加到代码中，并仅为那些用户提取信息？

样本数据：

user_id, timestamp, latitude, …

Run Code Online (Sandbox Code Playgroud)

python social-media geolocation pandas

and*_*yan

lucky-day

3
推荐指数

1
解决办法

4504
查看次数

将 df.value_counts 写入新文件

我有一个使用 DBSCAN 生成的集群标签数据框，我正在计算集群标签的频率。我可以使用打印频率df['cluster_labels'].value_counts()，但是当我将其写入新文件时，我只得到簇的计数，而不是它们对应的标签。我怎样才能把它写到一个带有簇标签和频率的新文件中？下面是截图和代码。

打印时：

写作时：

df['cluster_labels'] = cluster_labels
cluster_counts = df['cluster_labels'].value_counts()
print cluster_counts
cluster_counts.to_csv('G:\Programming Projects\GGS 681\dmv_tweets_20170309_20170314_cluster_counts.csv', index=False, header=True)

df_filtered = df[cluster_labels>-1]
cluster_outputs = pd.DataFrame(df_filtered)
#cluster_outputs.to_csv('G:\Programming Projects\GGS 
681\dmv_tweets_20170309_20170314_cluster_outputs.csv', index=False, header=True)

Run Code Online (Sandbox Code Playgroud)

将新标头传递给文件时出错

python pandas

and*_*yan

2017 04-09

2
推荐指数

1
解决办法

3645
查看次数

我有一个包含位置（一个CSV latitude，longitude对于给定的用户通过表示）id字段，在给定时间（timestamp）。我需要为每个用户计算点与连续点之间的距离和速度。例如，对于ID 1，我需要找到点1和点2，点2和点3，点3和点4之间的距离和速度，依此类推。鉴于我正在使用地球上的坐标，因此我知道Haversine度量标准将用于距离计算，但是，鉴于时间和用户订单方面的问题，我不确定如何遍历我的文件。有了这个，python我如何遍历文件以按用户和时间对事件进行排序，然后计算每个事件之间的距离和速度？

理想情况下，输出将是第二个csv，如下所示：ID#, start_time, start_location, end_time, end_location, distance, velocity。

以下示例数据：

ID,timestamp,latitude,longitude
3,6/9/2017 22:20,38.7953326,77.0088833  
1,5/5/2017 13:10,38.8890106,77.0500613
2,2/10/2017 16:23,40.7482494,73.9841913
1,5/5/2017 12:35,38.9206015,77.2223287
3,6/10/2017 10:00,42.3662109,71.0209426
1,5/5/2017 20:00,38.8974155,77.0368333
2,2/10/2017 7:30,38.8514261,77.0422981
3,6/9/2017 10:20,38.9173461,77.2225527
2,2/10/2017 19:51,40.7828687,73.9675438
3,6/10/2017 6:42,38.9542676,77.4496951
1,5/5/2017 16:35,38.8728748,77.0077629
2,2/10/2017 10:00,40.7769311,73.8761546

Run Code Online (Sandbox Code Playgroud)

python gps distance haversine

and*_*yan

2018 01-25

2
推荐指数

1
解决办法

2872
查看次数