使用 Python 3 正则表达式提取 CVE 信息

Bri*_*ian 1 regex python-3.x cve

我经常需要供应商的安全公告页面上列出的 CVE 列表。有时复制起来很简单,但通常它们会与一堆文本混合在一起。

\n\n

我已经有一段时间没有接触过 Python 了,所以我认为这将是一个很好的练习,可以弄清楚如何提取该信息 \xe2\x80\x93 特别是因为我一直发现自己手动执行此操作。

\n\n

这是我当前的代码:

\n\n
#!/usr/bin/env python3\n\n# REQUIREMENTS\n#   python3\n#   BeautifulSoup (pip3 install beautifulsoup)\n#   python 3 certificates (Applications/Python 3.x/ Install Certificates.command) <-- this one took me forever to figure out!\n\nimport sys\nif sys.version_info[0] < 3:\n    raise Exception("Use Python 3:  python3 " + sys.argv[0])\nfrom urllib.request import urlopen\nfrom bs4 import BeautifulSoup\nimport re\n\n#specify/get the url to scrape\n#url =\'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html\'\n#url = \'https://source.android.com/security/bulletin/2020-02-01.html\'\nurl = input("What is the URL?  ") or \'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html\'\nprint("Checking URL: " + url)\n\n# CVE regular expression\ncve_pattern = \'CVE-\\d{4}-\\d{4,7}\'\n\n# query the website and return the html\npage = urlopen(url).read()\n\n# parse the html returned using beautiful soup\nsoup = BeautifulSoup(page, \'html.parser\')\n\ncount = 0\n\n############################################################\n# ANDROID === search for CVE references within <td> tags ===\n\n# find all <td> tags\nall_tds = soup.find_all("td")\n\n#print all_tds\n\nfor td in all_tds:\n    if "cve" in td.text.lower():\n        print(td.text)\n\n\n############################################################\n# CHROME === search for CVE reference within <span> tags ===\n\n# find all <span> tags\nall_spans = soup.find_all("span")\n\nfor span in all_spans:\n    # this code returns results in triplicate\n    for i in re.finditer(cve_pattern, span.text):\n        count += 1\n        print(count, i.group())\n\n\n    # this code works, but only returns the first match\n#   match = re.search(cve_pattern,span.text)\n#   if match:\n#       print(match.group(0))\n
Run Code Online (Sandbox Code Playgroud)\n\n

我为 Android URL 所做的工作工作正常;我遇到的问题是 Chrome URL。他们的标签内有 CVE 信息<span>,我正在尝试利用正则表达式将其提取出来。

\n\n

使用该re.finditer方法,我最终得到一式三份的结果。\n使用该re.search方法,它错过了 CVE-2019-19925 \xe2\x80\x93,他们在同一行上列出了两个 CVE。

\n\n

您能否就实现此功能的最佳方式提供任何建议?

\n

Bri*_*ian 5

我终于自己解决了。不需要BeautifulSoup;现在一切都是正则表达式。为了解决我之前看到的重复/三重结果,我将 re.findall 列表结果转换为字典(保留唯一值的顺序)并返回列表。

import sys
if sys.version_info[0] < 3:
    raise Exception("Use Python 3:  python3 " + sys.argv[0])
import requests
import re

# Specify/get the url to scrape (included a default for easier testing)
### there is no input validation taking place here ###
url = input("What is the URL?  ") #or 'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
print()

# CVE regular expression
cve_pattern = r'CVE-\d{4}-\d{4,7}'

# query the website and return the html
page = requests.get(url)

# initialize count to 0
count = 0

#search for CVE references using RegEx
cves = re.findall(cve_pattern, page.text)

# after several days of fiddling, I was still getting double and sometimes triple results on certain pages.  This next line
# converts the list of objects returned from re.findall to a dictionary (which retains order) to get unique values, then back to a list.
# (thanks to /sf/answers/3361964581/)
# I found order to be important sometimes, as the most severely rated CVEs are often listed first on the page
cves = list(dict.fromkeys(cves))

# print the results to the screen
for cve in cves:
    print(cve)
    count += 1

print()
print(str(count) + " CVEs found at " + url)
print()
Run Code Online (Sandbox Code Playgroud)