相关疑难解决方法(0)

import fileinput

text = "sample file.txt"
fields = {"pattern 1": "replacement text 1", "pattern 2": "replacement text 2"}

for line in fileinput.input(text, inplace=True):
    line = line.rstrip()
    for i in fields:
         for field in fields:
             field_value = fields[field]

             if field in line:
                  line = line.replace(field, field_value)


             print line

Run Code Online (Sandbox Code Playgroud)

python dictionary in-place python-2.7

Dav*_*ing

2017 05-23

6
推荐指数

1
解决办法

4940
查看次数

使用列表理解过滤字符串列表

>>> li = ["a b self", "mpilgrim", "foo c", "b", "c", "b", "d", "d"]
>>> condition = ["b", "c", "d"]
>>> [elem for elem in li if elem in condition]
['b', 'c', 'b', 'd', 'd']

Run Code Online (Sandbox Code Playgroud)

但是有没有办法返回

['a b self','foo c','b', 'c', 'b', 'd', 'd']

Run Code Online (Sandbox Code Playgroud)

由于 b 和 c 包含在'a b self'and 中'foo c'，我希望代码也返回这两个。

python list-comprehension

son*_*089

lucky-day

6
推荐指数

1
解决办法

3007
查看次数

在文本中找到很多字符串 - Python

我正在寻找解决这个问题的最佳算法:拥有一个小句子的列表(或一个字典,一组),在更大的文本中找到所有出现的句子.列表中的句子(或词典或集合)约为600k,但平均形成3个单词.该文本平均长度为25个字.我刚刚格式化了文本(删除标点符号,全部小写并继续这样).

这是我尝试过的(Python):

to_find_sentences = [
    'bla bla',
    'have a tea',
    'hy i m luca',
    'i love android',
    'i love ios',
    .....
]

text = 'i love android and i think i will have a tea with john'

def find_sentence(to_find_sentences, text):
    text = text.split()
    res = []
    w = len(text)
    for i in range(w):
        for j in range(i+1,w+1):
            tmp = ' '.join(descr[i:j])
            if tmp in to_find_sentences:
                res.add(tmp)
    return res


print find_sentence(to_find_sentence, text)

Run Code Online (Sandbox Code Playgroud)

日期:

['i love android', 'have a tea']

Run Code Online (Sandbox Code Playgroud)

在我的情况下,我使用了一套加速in …

python string

Luc*_*llo

2017 04-26

5
推荐指数

1
解决办法

220
查看次数

Pythonic计算字符串列表中出现次数的方法

从目标字符串中的列表中查找字符串出现次数的最佳方法是什么？具体来说,我有一个清单:

string_list = [
    "foo",
    "bar",
    "baz"
]

target_string = "foo bar baz bar"

# Trying to write this function!
count = occurrence_counter(target_string) # should return 4

Run Code Online (Sandbox Code Playgroud)

我想优化以最小化速度和内存使用,如果这有所不同.在大小方面,我预计string_list最终可能包含数百个子串.

python algorithm

Kev*_*ell

lucky-day

5
推荐指数

2
解决办法

1600
查看次数

检查数百万个搜索查询中是否存在大量单词的有效方法

我有一个包含 5000 万个搜索查询的字符串列表。[每个查询中 1-500 多个单词]。
我还有一个包含 500 个单词和短语的字符串列表，我需要返回包含任何单词或短语 (2) 的搜索查询 (1) 的索引。

目标是只保留与某个主题（电影）相关的查询，然后使用 NLP 对这些过滤后的查询进行聚类（词干 -> tf_idf -> pca -> kmeans）。

我尝试使用嵌套循环过滤查询，但需要 10 多个小时才能完成。

filtered = []
with open('search_logs.txt', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        query, timestamp = line.strip().split('\t')
        for word in key_words:
            if word in query:
                filtered.append(i)

Run Code Online (Sandbox Code Playgroud)

我研究了使用正则表达式 (word1|word2|...|wordN) 的解决方案，但问题是我无法将查询组合成一个大字符串，因为我需要过滤不相关的查询。

更新：日志和关键字的示例

search_logs.txt
'query  timestamp\n'
'the dark knight    2019-02-17 19:05:12\n'
'how to do a barrel roll    2019-02-17 19:05:13\n'
'watch movies   2019-02-17 19:05:13\n'
'porn   2019-02-17 19:05:13\n'
'news   2019-02-17 …

Run Code Online (Sandbox Code Playgroud)

python regex nlp

Sup*_*man

2019 04-22

5
推荐指数

1
解决办法

154
查看次数

使用动态正则表达式匹配字符串中的整个单词

我期待看一个单词是否出现在使用正则表达式的句子中.单词用空格分隔,但两边可能都有标点符号.如果单词位于字符串的中间,则以下匹配有效(它可防止部分单词匹配,允许单词两侧的标点符号).

match_middle_words = " [^a-zA-Z\d ]{0,}" + word + "[^a-zA-Z\d ]{0,} "

Run Code Online (Sandbox Code Playgroud)

然而,这不会匹配第一个或最后一个单词,因为没有尾随/前导空格.所以,对于这些情况,我也一直在使用:

match_starting_word = "^[^a-zA-Z\d]{0,}" + word + "[^a-zA-Z\d ]{0,} "
match_end_word = " [^a-zA-Z\d ]{0,}" + word + "[^a-zA-Z\d]{0,}$"

Run Code Online (Sandbox Code Playgroud)

然后结合

 match_string = match_middle_words  + "|" + match_starting_word  +"|" + match_end_word

Run Code Online (Sandbox Code Playgroud)

有没有一种简单的方法可以避免需要三个匹配项.具体来说,是否有一种方法可以指定'以太空格或文件的开头(即"^")和类似的',是空格还是文件末尾(即"$")？

python regex python-2.7

kyr*_*nia

2018 08-25

3
推荐指数

1
解决办法

6651
查看次数

在熊猫数据框中使用正则表达式匹配组的性能

我有一个 ~350k 行的 pandas 系列，我想使用由 ~100 个子字符串组成的正则表达式来应用pandas.Series.str.extract函数，例如：

提取速度太慢：在我的 jupyter notebook (Python 3.9) 中需要 1 分钟。为什么这么慢，如何加快速度？

编辑 1我以“itemX”为例，但它可以被任何子字符串替换。正则表达式可能类似于

'(carrageenan|dihydro|basketball|etc...)'

Run Code Online (Sandbox Code Playgroud)

编辑 2对一些评论的回答：

我正在寻找完全匹配
我已经使用预编译正则表达式 re.compile()

python regex performance pandas

Bra*_*ess

2021 06-28

2
推荐指数

1
解决办法

69
查看次数

标签统计

python ×9

regex ×5

python-2.7 ×2

algorithm ×1

capturing-group ×1

concurrency ×1

dictionary ×1

in-place ×1

javascript ×1

list-comprehension ×1

multithreading ×1

nlp ×1

pandas ×1

performance ×1

python-multithreading ×1

regex-group ×1

string ×1

token ×1

我目前的做法

标签 统计

标签统计