Bri*_*ian 1 regex python-3.x cve
我经常需要供应商的安全公告页面上列出的 CVE 列表。有时复制起来很简单,但通常它们会与一堆文本混合在一起。
\n\n我已经有一段时间没有接触过 Python 了,所以我认为这将是一个很好的练习,可以弄清楚如何提取该信息 \xe2\x80\x93 特别是因为我一直发现自己手动执行此操作。
\n\n这是我当前的代码:
\n\n#!/usr/bin/env python3\n\n# REQUIREMENTS\n# python3\n# BeautifulSoup (pip3 install beautifulsoup)\n# python 3 certificates (Applications/Python 3.x/ Install Certificates.command) <-- this one took me forever to figure out!\n\nimport sys\nif sys.version_info[0] < 3:\n raise Exception("Use Python 3: python3 " + sys.argv[0])\nfrom urllib.request import urlopen\nfrom bs4 import BeautifulSoup\nimport re\n\n#specify/get the url to scrape\n#url =\'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html\'\n#url = \'https://source.android.com/security/bulletin/2020-02-01.html\'\nurl = input("What is the URL? ") or \'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html\'\nprint("Checking URL: " + url)\n\n# CVE regular expression\ncve_pattern = \'CVE-\\d{4}-\\d{4,7}\'\n\n# query the website and return the html\npage = urlopen(url).read()\n\n# parse the html returned using beautiful soup\nsoup = BeautifulSoup(page, \'html.parser\')\n\ncount = 0\n\n############################################################\n# ANDROID === search for CVE references within <td> tags ===\n\n# find all <td> tags\nall_tds = soup.find_all("td")\n\n#print all_tds\n\nfor td in all_tds:\n if "cve" in td.text.lower():\n print(td.text)\n\n\n############################################################\n# CHROME === search for CVE reference within <span> tags ===\n\n# find all <span> tags\nall_spans = soup.find_all("span")\n\nfor span in all_spans:\n # this code returns results in triplicate\n for i in re.finditer(cve_pattern, span.text):\n count += 1\n print(count, i.group())\n\n\n # this code works, but only returns the first match\n# match = re.search(cve_pattern,span.text)\n# if match:\n# print(match.group(0))\nRun Code Online (Sandbox Code Playgroud)\n\n我为 Android URL 所做的工作工作正常;我遇到的问题是 Chrome URL。他们的标签内有 CVE 信息<span>,我正在尝试利用正则表达式将其提取出来。
使用该re.finditer方法,我最终得到一式三份的结果。\n使用该re.search方法,它错过了 CVE-2019-19925 \xe2\x80\x93,他们在同一行上列出了两个 CVE。
您能否就实现此功能的最佳方式提供任何建议?
\n我终于自己解决了。不需要BeautifulSoup;现在一切都是正则表达式。为了解决我之前看到的重复/三重结果,我将 re.findall 列表结果转换为字典(保留唯一值的顺序)并返回列表。
import sys
if sys.version_info[0] < 3:
raise Exception("Use Python 3: python3 " + sys.argv[0])
import requests
import re
# Specify/get the url to scrape (included a default for easier testing)
### there is no input validation taking place here ###
url = input("What is the URL? ") #or 'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
print()
# CVE regular expression
cve_pattern = r'CVE-\d{4}-\d{4,7}'
# query the website and return the html
page = requests.get(url)
# initialize count to 0
count = 0
#search for CVE references using RegEx
cves = re.findall(cve_pattern, page.text)
# after several days of fiddling, I was still getting double and sometimes triple results on certain pages. This next line
# converts the list of objects returned from re.findall to a dictionary (which retains order) to get unique values, then back to a list.
# (thanks to /sf/answers/3361964581/)
# I found order to be important sometimes, as the most severely rated CVEs are often listed first on the page
cves = list(dict.fromkeys(cves))
# print the results to the screen
for cve in cves:
print(cve)
count += 1
print()
print(str(count) + " CVEs found at " + url)
print()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2577 次 |
| 最近记录: |