Baz*_*Baz 18 python algorithm statistics combinations linguistics
我有一个集合,sentences其中包含字符串形式的英语句子.我想创建的一个子集sentences,sentences2,其中包含只含有20个独特的单词的句子.当然,有很多很多这样的子集,但我正在寻找"最好"的子集,而"最好的"我指的是所有单词中具有最高可能表示的子集sentences2.
以下示例将进一步阐明"最佳"的含义:
如果我要过滤sentences这组词:
(i,you,do,think,yes,dont,can,it,good,cant,but,am,why,where,now,no,know,here,feel,are)
Run Code Online (Sandbox Code Playgroud)
我会得到以下内容:
sentences2 = set(("where are you now", "here i am", "can you do it", "yes i can", "but can i do it", "no you cant", "do you feel good", "yes i do", "why are you here", "i dont know", "i think i know why", "you dont think", "yes i do", "no you dont", "i dont think you think", "i feel good", "but i am good", "i cant do it now", "yes you can", "but i cant", "where do you think i am"))
Run Code Online (Sandbox Code Playgroud)
在这里,每个单词至少表示两次,我们可以看到,如果我们在句子2上使用计数器:
c = collections.Counter({'i': 13, 'you': 10, 'do': 6, 'think': 5, 'dont': 4, 'can': 4, 'good': 3, 'but': 3, 'am': 3, 'it': 3, 'cant': 3, 'yes': 3, 'know': 2, 'no': 2, 'here': 2, 'why': 2, 'feel': 2, 'are': 2, 'now': 2, 'where': 2})
Run Code Online (Sandbox Code Playgroud)
如果每个单词至少表示两次,我们可以说这组20个单词的得分为2.
score = min(c.values())
Run Code Online (Sandbox Code Playgroud)
但是,以下设置:
(i,you,he,do,think,yes,dont,can,it,good,cant,but,am,why,where,now,no,here,she,are)
Run Code Online (Sandbox Code Playgroud)
得分为5,因为如果我用它来过滤sentences,我会得到一个sentences2每个单词至少代表五次的地方.
所以我追求所有可能的20个单词组合的最高得分.
以下是我尝试解决此问题的方法:
sentences = ... # all the sentences in my text
common_words = ... # the hundred most common words in the text
result_size = 20
highest_score = 0
for sample in itertools.combinations(common_words, result_size):
sentences2 = list(filter(lambda s: set(s).issubset(sample), sentences))
c = Counter([j for i in sentences2 for j in i])
if len(c.values()) and min(c.values()) > highest_score:
# this is the set with the highest score to date
print(c)
highest_score = min(c.values())
Run Code Online (Sandbox Code Playgroud)
但是,如果我没有弄错的话,这个算法将需要永远计算,使用5.3598337040381E +20组合.你能建议我如何用更快的算法来解决这个问题吗?
请注意,结果集可以包含少于20个单词,这是完全正常的.例如,c.values()在我的算法中不必匹配大小result_size.
另请注意,我希望结果集中的单词可以在前100个单词中找到(common_words包含100个值).这也是设计的.
免责声明:您尚未指定数据特征,因此我的答案将假设它不是太大(超过1,000,000个句子,每个最多1,000个).描述也有点复杂,我可能还没有完全理解这个问题.
解决方案:
为什么不dict为你最常用的100个单词创建一个hashMap(在python中),然后遍历每个句子中的每个单词,增加其相应的值(如果是已经在dict中了.)
最后,只需根据每个单词(键)的出现次数(值)对该hashMap进行排序,然后使用最频繁的20.
复杂性:
快速查看算法,给出:
遍历N个句子,遍历每个M单词,增加hashMap值.最后排序一组(单词,出现)对.这是可以忽略不计的(hashMap大小是常量,100个经常使用的单词),并且首先提取20.
时间复杂度:O(N*M)
空间复杂度:O(1)(我们不需要存储句子,我们只需要hashMap)
示例代码:
这是一个快速的伪代码:
word_occur_dict = {#initialized with frequent words as keys, and zero as value for all}
for sentence in sentences: #for each sentence
sentence_words = sentence.split(" ") #construct the word list
for word in sentence_words: #for each word
if word in word_occur_dict: #if it is a frequent word, increase value
word_occur_dict[word]++
final_result = sort_dict(word_occur_dict)[:20] #returns list of tuples
Run Code Online (Sandbox Code Playgroud)
Python代码:
import operator
common_words = ["do","think","yes","dont","can","it","good","cant","but","am","why","where","now","no","know","here","feel","are","i","you","he","she"]
common_words_dict = {}
sentences = ["where are you now", "here i am", "can you do it", "yes i can", "but can i do it", "no you cant", "do you feel good", "yes i do", "why are you here", "i dont know", "i think i know why", "you dont think", "yes i do", "no you dont", "i dont think you think", "i feel good", "but i am good", "i cant do it now", "yes you can", "but i cant", "where do you think i am"]
for w in common_words: #initialize the dict
common_words_dict[w] = 0
for sentence in sentences: #for each sentence
sentence_words = sentence.split(" ") #construct the word list
for word in sentence_words: #for each word
if word in common_words_dict: #if it is a frequent word, increase value
common_words_dict[word] = common_words_dict[word]+1
sorted_word_dict = sorted(common_words_dict.items(), key=operator.itemgetter(1))
print sorted_word_dict[::-1][:20]
Run Code Online (Sandbox Code Playgroud)
顺便说一句,'他'和'她'没有出现在句子的任何地方,但你说下面的单词组合得分为5
(我,你,他,做,想,是的,不,可以的话,好,斜面,但是,上午,为什么,在哪里,现在没有,在这里,她是)
我误解了这个问题吗?
信用到期:StackOverflow:按值对Python字典进行排序
第1 步应该是创建一个数据结构,该结构只包含出现在common_words中的句子中的单词.该结构还可以具有该单词出现的次数以及一组引用该单词所在的句子的整数.
counts[..., {
word:string,
count:number,
ids:Set<number>
}, ...]
Run Code Online (Sandbox Code Playgroud)
一些伪代码
countsMap = Map()
For i = 0 To sentences.Size - 1
sentence = sentences[i]
For Each word in sentence
If Not countsMap.Contains(word) Then
countsMap.Add(word, {word:word, count:0, ids:Set()})
End If
value = wordMap.Get(word)
If Not value.ids.Contains(i) Then
value.Count++
value.ids.Add(i)
countsMap[word] = value
End If
Next
Next
counts = countsMap.Values
Run Code Online (Sandbox Code Playgroud)
理想主义第2步如果你很幸运,你的计数数据类型包含<40个条目,你可以在一个合理的时间内用一台计算机C(38,20)〜= 330亿进行C(n,20)组合的详尽搜索.这将涉及迭代组合并将ID组相交,最终的设置大小是您的最低分数.
一些伪代码
bestScore = 0
bestCombo = null
For Each combo in Combinations(counts, 20)
score = combo.Reduce((prev, curr) => prev.ids.Intersect(curr.ids)).Size
If bestScore < score Then
bestScore = score
bestCombo = combo
End If
Next
Run Code Online (Sandbox Code Playgroud)
逼真的步骤2在大多数情况下,你的计数将包含40多个独特的单词,在这种情况下,你必须满足于最佳猜测/近似值.我可能会做类似的事情,使用上面的代码而不是Pick 20使用Pick 2,按分数降序排序你的结果并取10.
一些伪代码
list = []
For Each combo in Combinations(counts, 2)
score = combo[0].ids.Intersect(combo[1].ids).Size
list.Add( { score:score, words:[ combo[0].word, combo[1].word ] } )
Next
// sort descending by score
list.Sort((a, b) => b.score - a.score)
// grab the 20 best words
result = Set()
i = 0
While result.Size < 20
result.Add(list[i].words[0])
result.Add(list[i].words[1])
i = i + 1
End While
Run Code Online (Sandbox Code Playgroud)
你会得到大于1的最终分数吗?统计上,这将取决于有多少独特的单词和句子,但可能不是.
编辑实施说明和更正.将单词出现的句子ID集合相交会得到最小分数减1(零索引).例如,"狗"在句子1和2中; "Cat"在句子2和3中;"Frog"在句子4中; [1,2] /\[2,3] /\[4] = []的交集但是最小分数是1的结果.大小()+ 1.同样地,"狗"和"猫"[1,2] /\[2,3] = [2]的集合大小为1,但最小分数为2.
通过从 CLIQUE 减少来实现 NP 困难(假设我们用参数替换 20)。给定一个我们正在寻找 k-clique 的图,为每个顶点分配一个唯一的单词,使两个单词的句子对应于每个边,并尝试选择 k 选择包含每个单词 k - 1 次的 2 个句子。
需要考虑是否存在参数化复杂度合理的算法。