我有两个列表如下。
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
Run Code Online (Sandbox Code Playgroud)
我想得到两个列表中公共元素的平均值,如下所示。
myoutput = [["chocolate", 0.5], ["egg", 0.45]]
Run Code Online (Sandbox Code Playgroud)
我目前的代码如下
for item1 in mylist1:
for item2 in mylist2:
if item1[0] == item2[0]:
print(np.mean([item1[1], item2[1]]))
Run Code Online (Sandbox Code Playgroud)
然而,由于有两个for循环(O(n^2)复杂性),这对于很长的列表来说效率很低。我想知道在 Python 中是否有更标准/更有效的方法来做到这一点。
我有一个概念列表(myconcepts)和一个句子列表(sentences)如下.
concepts = [['natural language processing', 'text mining', 'texts', 'nlp'], ['advanced data mining', 'data mining', 'data'], ['discourse analysis', 'learning analytics', 'mooc']]
sentences = ['data mining and text mining', 'nlp is mainly used by discourse analysis community', 'data mining in python is fun', 'mooc data analysis involves texts', 'data and data mining are both very interesting']
Run Code Online (Sandbox Code Playgroud)
概括地说,我想找到concepts在sentences.更具体地说,给定concepts(例如['natural language processing', 'text mining', 'texts', 'nlp'])中的列表,我想在句子中识别这些概念并用它的第一个元素(即natural language processing)替换它们.
例子: 那么,如果我们考虑这句话 …
我有以下三个列表。
mylist = [[5274919, ["my cat", "little dog", "fish", "rat"]],
[5274920, ["my cat", "parrot", "little dog"]],
[5274991, ["little dog", "fish", "duck"]]]
myconcepts = ["my cat", "little dog"]
hatedconcepts = ["rat", "parrot"]
Run Code Online (Sandbox Code Playgroud)
对于 中的每个概念myconcepts,我想使用mylist. 然后hatedconcepts从中取出。所以,我的输出应该如下所示。
{"my cat": [("my cat", 2), ("little dog", 2), ("fish", 1)],
"little dog": [("little dog", 3), ("my cat", 2), ("fish", 2), ("duck", 1)]}
Run Code Online (Sandbox Code Playgroud)
我正在使用此代码来做到这一点。
import collections
myoutput = []
for concept in myconcepts:
mykeywords = []
for item in mylist:
if …Run Code Online (Sandbox Code Playgroud) 对于我的数据集的每个概念,我存储了相应的维基百科类别.例如,请考虑以下5个概念及其相应的维基百科类别.
['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']['Category:Surgery stubs', 'Category:Surgical procedures and techniques']['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']如您所见,前三个概念属于医学领域(而其余两个术语不是医学术语).
更准确地说,我想把我的概念分为医学和非医学.但是,仅使用类别来划分概念是非常困难的.例如,尽管这两个概念enzyme inhibitor并且bypass surgery处于医学领域,但它们的类别彼此非常不同.
因此,我想知道是否有办法获得parent category类别(例如,类别enzyme inhibitor和bypass surgery属于medical父类别)
我目前正在使用pymediawiki和pywikibot.但是,我不仅限于这两个库,并且很乐意使用其他库来解决问题.
编辑
正如@IlmariKaronen所建议的,我也使用了categories of categories,我得到的结果如下(靠近的小字体 …
I am calculating triad census as follows for my undirected network.
import networkx as nx
G = nx.Graph()
G.add_edges_from(
[('A', 'B'), ('A', 'C'), ('D', 'B'), ('E', 'C'), ('E', 'F'),
('B', 'H'), ('B', 'G'), ('B', 'F'), ('C', 'G')])
from itertools import combinations
#print(len(list(combinations(G.nodes, 3))))
triad_class = {}
for nodes in combinations(G.nodes, 3):
n_edges = G.subgraph(nodes).number_of_edges()
triad_class.setdefault(n_edges, []).append(nodes)
print(triad_class)
Run Code Online (Sandbox Code Playgroud)
It works fine with small networks. However, now I have a bigger network with approximately 4000-8000 nodes. When I try to run …
好的,我正试图从维基数据中获取有关电影的信息,以此电影为例:https://www.wikidata.org/wiki/Q24871
在页面上,数据以可读格式清晰显示,但是当您尝试通过API提取数据时,您会得到以下信息:https://www.wikidata.org/w/api.php?action = wbgetentities &ids = Q24871
这是一个部分:
"P272": [
{
"id": "q24871$4721C959-0FCF-49D4-9265-E4FAC217CB6E",
"mainsnak": {
"snaktype": "value",
"property": "P272",
"datatype": "wikibase-item",
"datavalue": {
"value": {
"entity-type": "item",
"numeric-id": 775450
},
"type": "wikibase-entityid"
}
},
"type": "statement",
"rank": "normal"
},
{
"id": "q24871$31777445-1068-4C38-9B4B-96362577C442",
"mainsnak": {
"snaktype": "value",
"property": "P272",
"datatype": "wikibase-item",
"datavalue": {
"value": {
"entity-type": "item",
"numeric-id": 3041294
},
"type": "wikibase-entityid"
}
},
"type": "statement",
"rank": "normal"
},
{
"id": "q24871$08009F7A-8E54-48C3-92D9-75DEF4CF3E8D",
"mainsnak": {
"snaktype": …Run Code Online (Sandbox Code Playgroud) 我正在使用 DBSCAN 进行聚类。然而,现在我想从每个簇中选取一个点来代表它,但我意识到 DBSCAN 没有像 kmeans 中那样具有质心。
然而,我观察到 DBSCAN 有一个叫做core points. 我在想是否可以使用这些核心点或任何其他替代方案来从每个簇中获取代表点。
我在下面提到了我使用过的代码。
import numpy as np
from math import pi
from sklearn.cluster import DBSCAN
#points containing time value in minutes
points = [100, 200, 600, 659, 700]
def convert_to_radian(x):
return((x / (24 * 60)) * 2 * pi)
rad_function = np.vectorize(convert_to_radian)
points_rad = rad_function(points)
#generate distance matrix from each point
dist = points_rad[None,:] - points_rad[:, None]
#Assign shortest distances from each point
dist[((dist > pi) & (dist …Run Code Online (Sandbox Code Playgroud) 我有一个如下所示的字符串,需要删除相似的连续单词。
mystring = "my friend's new new new new and old old cats are running running in the street"
Run Code Online (Sandbox Code Playgroud)
我的输出应如下所示。
myoutput = "my friend's new and old cats are running in the street"
Run Code Online (Sandbox Code Playgroud)
我正在使用以下python代码来做到这一点。
mylist = []
for i, w in enumerate(mystring.split()):
for n, l in enumerate(mystring.split()):
if l != w and i == n-1:
mylist.append(w)
mylist.append(mystring.split()[-1])
myoutput = " ".join(mylist)
Run Code Online (Sandbox Code Playgroud)
然而,我的代码是O(ñ ²),真正低效的,因为我有一个巨大的数据集。我想知道在Python中是否有更有效的方法。
如果需要,我很乐意提供更多详细信息。
I have a time-series dataset with two lables (0 and 1). I am using Dynamic Time Warping (DTW) as a similarity measure for classification using k-nearest neighbour (kNN) as described in these two wonderful blog posts:
http://alexminnaar.com/2014/04/16/Time-Series-Classification-and-Clustering-with-Python.html
Arguments
---------
n_neighbors : int, optional (default = 5)
Number of neighbors to use by default for KNN
max_warping_window : int, optional (default = infinity)
Maximum warping window allowed by the DTW dynamic
programming function
subsample_step : int, optional (default …Run Code Online (Sandbox Code Playgroud)我有一个清单清单如下。
sentences = [
["my", "first", "question", "in", "stackoverflow", "is", "my", "favorite"],
["my", "favorite", "language", "is", "python"]
]
Run Code Online (Sandbox Code Playgroud)
我想获取sentences列表中每个单词的计数。因此,我的输出应如下所示。
{
'stackoverflow': 1,
'question': 1,
'is': 2,
'language': 1,
'first': 1,
'in': 1,
'favorite': 2,
'python': 1,
'my': 3
}
Run Code Online (Sandbox Code Playgroud)
我目前正在做如下。
frequency_input = [item for sublist in sentences for item in sublist]
frequency_output = dict(
(x,frequency_input.count(x))
for x in set(frequency_input)
)
Run Code Online (Sandbox Code Playgroud)
但是,对于长列表而言,它根本没有效率。我的名单很长,名单上有大约一百万个句子。我花了两天时间来运行它,并且它仍在运行。
在这种情况下,我想提高程序效率。我当前的第一行代码是O(n^2),第二行是O(n)。请让我知道python中是否有更有效的方法。如果我可以用比现在少的时间运行它,那将是非常理想的。我不担心空间的复杂性。
如果需要,我很乐意提供更多详细信息。
python ×9
list ×3
scikit-learn ×2
dbscan ×1
graph-theory ×1
json ×1
knn ×1
mean ×1
mediawiki ×1
networkx ×1
sparql ×1
time-series ×1
wikidata ×1
wikidata-api ×1
wikipedia ×1