e-s*_*tis 8 python memory-leaks
此方法迭代数据库中的术语列表,检查术语是否在作为参数传递的文本中,如果是,则将其替换为带有术语作为参数的搜索页面的链接.
术语数量很高(大约100000),所以这个过程非常慢,但这是好的,因为它是作为一个cron作业执行的.但是,它导致脚本内存消耗飙升,我无法找到原因:
class SearchedTerm(models.Model):
[...]
@classmethod
def add_search_links_to_text(cls, string, count=3, queryset=None):
"""
Take a list of all researched terms and search them in the
text. If they exist, turn them into links to the search
page.
This process is limited to `count` replacements maximum.
WARNING: because the sites got different URLS schemas, we don't
provides direct links, but we inject the {% url %} tag
so it must be rendered before display. You can use the `eval`
tag from `libs` for this. Since they got different namespace as
well, we enter a generic 'namespace' and delegate to the
template to change it with the proper one as well.
If you have a batch process to do, you can pass a query set
that will be used instead of getting all searched term at
each calls.
"""
found = 0
terms = queryset or cls.on_site.all()
# to avoid duplicate searched terms to be replaced twice
# keep a list of already linkified content
# added words we are going to insert with the link so they won't match
# in case of multi passes
processed = set((u'video', u'streaming', u'title',
u'search', u'namespace', u'href', u'title',
u'url'))
for term in terms:
text = term.text.lower()
# no small word and make
# quick check to avoid all the rest of the matching
if len(text) < 3 or text not in string:
continue
if found and cls._is_processed(text, processed):
continue
# match the search word with accent, for any case
# ensure this is not part of a word by including
# two 'non-letter' character on both ends of the word
pattern = re.compile(ur'([^\w]|^)(%s)([^\w]|$)' % text,
re.UNICODE|re.IGNORECASE)
if re.search(pattern, string):
found += 1
# create the link string
# replace the word in the description
# use back references (\1, \2, etc) to preserve the original
# formatin
# use raw unicode strings (ur"string" notation) to avoid
# problems with accents and escaping
query = '-'.join(term.text.split())
url = ur'{%% url namespace:static-search "%s" %%}' % query
replace_with = ur'\1<a title="\2 video streaming" href="%s">\2</a>\3' % url
string = re.sub(pattern, replace_with, string)
processed.add(text)
if found >= 3:
break
return string
Run Code Online (Sandbox Code Playgroud)
你可能也想要这个代码:
class SearchedTerm(models.Model):
[...]
@classmethod
def _is_processed(cls, text, processed):
"""
Check if the text if part of the already processed string
we don't use `in` the set, but `in ` each strings of the set
to avoid subtring matching that will destroy the tags.
This is mainly an utility function so you probably won't use
it directly.
"""
if text in processed:
return True
return any(((text in string) for string in processed))
Run Code Online (Sandbox Code Playgroud)
我真的只有两个参考对象可能是这里的嫌疑人:terms和processed.但我看不出任何理由让他们不被垃圾收集.
编辑:
我想我应该说这个方法是在Django模型方法本身内部调用的.我不知道它是否相关,但这里是代码:
class Video(models.Model):
[...]
def update_html_description(self, links=3, queryset=None):
"""
Take a list of all researched terms and search them in the
description. If they exist, turn them into links to the search
engine. Put the reset into `html_description`.
This use `add_search_link_to_text` and has therefor, the same
limitations.
It DOESN'T call save().
"""
queryset = queryset or SearchedTerm.objects.filter(sites__in=self.sites.all())
text = self.description or self.title
self.html_description = SearchedTerm.add_search_links_to_text(text,
links,
queryset)
Run Code Online (Sandbox Code Playgroud)
我可以想象自动Python正则表达式缓存占用了一些内存.但它应该只做一次,每次调用时内存消耗都会增加update_html_description.
问题不仅在于它消耗了大量内存,问题在于它不会释放它:每次调用占用ram的大约3%,最终填满它并使用"无法分配内存"来崩溃脚本.
| 归档时间: |
|
| 查看次数: |
1010 次 |
| 最近记录: |