E19*_*504 5 python csv selenium web-scraping pandas
有这个小程序可以访问一个词汇表,打印该页面上的所有单词,然后单击按钮转到下一页并再次打印该页面上的所有词汇表。
我使用了一个循环来重复这个过程并循环遍历分布在多个页面上的所有单词。
#Create csv
outfile = open("Vocab.csv","w",newline='')
writer = csv.writer(outfile)
#Define the dataframe
df = pd.DataFrame(columns=['rating'])
PATH="C:\Program Files (x86)\chromedriver.exe"
driver= webdriver.Chrome(PATH)
driver.get("https://sq.m.wiktionary.org/w/index.php?title=Kategoria:Shqip&pagefrom=agall%C3%ABk#mw-pages")
for x in range(3):
rating_element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "#mw-pages > div > div > div > ul"))
)
rating=rating_element.text
print(rating)
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.LINK_TEXT, "faqja pasardhëse"))
)
element.click()
df2 = pd.DataFrame([rating],columns=['rating'])
df = df.append(df2,ignore_index=True)
Run Code Online (Sandbox Code Playgroud)
代码本身运行良好,但是当我尝试实现将所有数据解析为 DataFrame 的功能时,我只得到一个空的 Csv 文件。我试图只有一列包含数千个单词。
您可以迭代每个单词以附加到列:
\nfrom selenium import webdriver\nfrom selenium.webdriver.common.by import By\nfrom selenium.webdriver.support.ui import WebDriverWait\nfrom selenium.webdriver.support import expected_conditions as EC\nimport selenium.common.exceptions\nimport os\nimport pandas as pd\n\nchrome_options = webdriver.ChromeOptions()\nchrome_options.add_argument("--window-size=1920x1080")\n# chrome_options.add_argument("--headless")\nchrome_driver = os.getcwd() + "\\\\chromedriver.exe"\ndriver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)\n\n# Define the dataframe\ndf = pd.DataFrame(columns=[\'rating\'])\n\ndriver.get("https://sq.m.wiktionary.org/w/index.php?title=Kategoria:Shqip&pagefrom=agall%C3%ABk#mw-pages")\n\nfor x in range(200):\n rating_element = WebDriverWait(driver, 10).until(\n EC.presence_of_element_located((By.CSS_SELECTOR, "#mw-pages > div > div > div > ul"))\n )\n rating = rating_element.text\n\n for word in rating.split(\'\\n\'):\n df2 = pd.DataFrame([word], columns=[\'rating\'])\n df = df.append(df2, ignore_index=True)\n\n try:\n element = WebDriverWait(driver, 10).until(\n EC.presence_of_element_located((By.LINK_TEXT, "faqja pasardh\xc3\xabse"))\n )\n element.click()\n \n except selenium.common.exceptions.TimeoutException:\n break\n\nprint(df)\ndf.to_csv(\'word_list.csv\', encoding=\'utf-8\', index=False)\n
Run Code Online (Sandbox Code Playgroud)\n rating\n0 agall\xc3\xabk\n1 agar\n2 agave\n3 agde\n4 agesh\xc3\xab\n.. ...\n595 ankim\n596 ankimor\n597 ankohem\n598 ankoj\n599 ankoj\xc3\xab\n\n[600 rows x 1 columns]\n
Run Code Online (Sandbox Code Playgroud)\n添加了写入文件的选项。
\n 归档时间: |
|
查看次数: |
48 次 |
最近记录: |