far*_*een 0 python csv json python-3.x pandas
我有一个 JSON 文件,其中包含超过 46k 条英语和其他语言的推文,我想将其保存为 csv 文件。以下是 json 文件的一部分。
\n\n [{"user_id": 938118866135343104, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @PTISPOfficial: \xd9\xbe\xd8\xa7\xda\xa9\xd8\xb3\xd8\xaa\xd8\xa7\xd9\x86 \xd8\xaa\xd8\xad\xd8\xb1\xdb\x8c\xda\xa9 \xd8\xa7\xd9\x86\xd8\xb5\xd8\xa7\xd9\x81 \xda\xa9\xdb\x92 \xd9\x88\xd8\xa7\xd8\xa6\xd8\xb3 \xda\x86\xdb\x8c\xd8\xa6\xd8\xb1\xd9\x85\xdb\x8c\xd9\x86 \xd8\xb4\xd8\xa7\xdb\x81 \xd9\x85\xd8\xad\xd9\x85\xd9\x88\xd8\xaf \xd9\x82\xd8\xb1\xdb\x8c\xd8\xb4\xdb\x8c \xd8\xa8\xd8\xba\xdb\x8c\xd8\xb1 \xda\xa9\xd8\xb3\xdb\x8c \xd9\xbe\xd8\xb1\xd9\x88\xd9\xb9\xd9\x88\xda\xa9\xd9\x88\xd9\x84 \xda\xa9\xdb\x92 \xd9\xbe\xd8\xa7\xda\xa9\xd8\xb3\xd8\xaa\xd8\xa7\xd9\x86 \xd8\xb3\xd9\xbe\xd8\xb1 \xd9\x84\xdb\x8c\xda\xaf \xda\xa9\xd8\xa7 \xd9\x85\xdb\x8c\xda\x86 \xd8\xaf\xdb\x8c\xda\xa9\xda\xbe\xd9\x86\xdb\x92 \xda\xa9\xdb\x92 \xd9\x84\xd8\xa6\xdb\x92 \xd8\xa7\xd8\xb3\xd9\xb9\xdb\x8c\xda\x88\xdb\x8c\xd9\x85 \xd9\x85\xe2\x80\xa6", "tweet_id": 976166125502427136}\n{"user_id": 959235642, "date_time": "03/20/2018 18:38:35", "tweet_content": "At last, Pakistan Have Witnessed The Most Thrilling Match Of Cricket In Pakistan, The Home. \\n\\n#PZvQG \\n#ABC", "tweet_id": 976166125535973378}\n{"user_id": 395163528, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @thePSLt20: SIX! 19.4 Liam Dawson to Anwar Ali\\nWatch ball by ball highlights at (link removed)\\n\\n#PZvQG #HBLPSL #PSL2018 @_crici\xe2\x80\xa6", "tweet_id": 976166126202839040}\n{"user_id": 3117825702, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @JeremyMcLellan: Rumor has it Amir Liaquat isn\xe2\x80\x99t allowed to play in #PSL2018 because he keeps switching teams every week.", "tweet_id": 976166126483902466}\n{"user_id": 3310967346, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @daniel86cricket: Peshawar beat Quetta by 1 run in one of the best T20 thrillers. PSL played in front of full house in Lahore Pakistan i\xe2\x80\xa6", "tweet_id": 976166126559354880}\n{"user_id": 701494826194354179, "date_time": "03/20/2018 18:38:35", "tweet_content": "I wanted a super over\\n#PZvQG", "tweet_id": 976166126836178944}\n{"user_id": 347132028, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @hinaparvezbutt: Congratulations Peshawar Zalmi over great win but Quetta Gladiators won our hearts \xe2\x99\xa5\xef\xb8\x8f #PZvQG", "tweet_id": 976166126685171713}\n{"user_id": 3461853618, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @walterMiitty: It\'s harder than I thought to tell the truth\\nIt\'s gonna leave you in pieces\\nAll alone with your demons\\nAnd I know that we\xe2\x80\xa6", "tweet_id": 976166126924201986}]\nRun Code Online (Sandbox Code Playgroud)\n\n我按照此解决方案将其转换为 CSV,但在乌尔都语推文上出现无效语法错误。\n我也尝试过此操作:
\n\n import json\nwith open("PeshVsQuetta.json") as f:\nall_tweets = []\nfor line in f:\n text_dict = json.loads(line)\n all_tweets.append(text_dict)\n\nprint(all_tweets[0][\'tweet_content\'])\nRun Code Online (Sandbox Code Playgroud)\n\n这给了我以下错误。
\n\n UnicodeDecodeError: \'charmap\' codec can\'t decode byte 0x8f in position 148: character maps to <undefined>\nRun Code Online (Sandbox Code Playgroud)\n\n我什至将 json 文件保存为 txt 文件并尝试了以下操作:
\n\n import pandas as pd\n from ast import literal_eval\n columns = [\'Tweet ID\',\'Author ID\',\'Tweet\',\'Time\']\n df1 = pd.DataFrame(columns = columns)\n f = open(\'PeshvsQuetta.txt\',encoding = \'utf-8\')\n counter = 1\n for line in f:\n if(counter != 1):\n s1 = literal_eval(line)\n ser = pd.Series([s1[\'tweet_id\'],s1[\'user_id\'],s1[\'tweet_content\'],s1["date_time"]],index=[\'Tweet ID\',\'Author ID\',\'Tweet\',\'Time\'])\n df1 = df1.append(ser,ignore_index=True)\n counter = counter + 1\n df1.to_csv(\'PeshVsQuetta1.csv\', encoding=\'utf-8\',index=False,columns = columns)\nRun Code Online (Sandbox Code Playgroud)\n\n但生成的 csv 文件将每个系列保存在一个单元格中,并且有很多空行,并且一些推文保存在多行中。下面是图像。
\n\n\n您应该能够按如下方式使用 Pandas:
\n\nimport pandas as pd\n\nwith open(\'PeshVsQuetta.json\', encoding=\'utf-8-sig\') as f_input:\n df = pd.read_json(f_input)\n\ndf.to_csv(\'PeshVsQuetta.csv\', encoding=\'utf-8\', index=False)\nRun Code Online (Sandbox Code Playgroud)\n\n这假设您的 JSON 文件在开头包含 BOM。对于您上面给出的数据,这会生成以下 CSV 文件:
\n\n\n\ndate_time,tweet_content,tweet_id,user_id\n2018-03-20 18:38:35,RT @PTISPOfficial: \xd9\xbe\xd8\xa7\xda\xa9\xd8\xb3\xd8\xaa\xd8\xa7\xd9\x86 \xd8\xaa\xd8\xad\xd8\xb1\xdb\x8c\xda\xa9 \xd8\xa7\xd9\x86\xd8\xb5\xd8\xa7\xd9\x81 \xda\xa9\xdb\x92 \xd9\x88\xd8\xa7\xd8\xa6\xd8\xb3 \xda\x86\xdb\x8c\xd8\xa6\xd8\xb1\xd9\x85\xdb\x8c\xd9\x86 \xd8\xb4\xd8\xa7\xdb\x81 \xd9\x85\xd8\xad\xd9\x85\xd9\x88\xd8\xaf \xd9\x82\xd8\xb1\xdb\x8c\xd8\xb4\xdb\x8c \xd8\xa8\xd8\xba\xdb\x8c\xd8\xb1 \xda\xa9\xd8\xb3\xdb\x8c \xd9\xbe\xd8\xb1\xd9\x88\xd9\xb9\xd9\x88\xda\xa9\xd9\x88\xd9\x84 \xda\xa9\xdb\x92 \xd9\xbe\xd8\xa7\xda\xa9\xd8\xb3\xd8\xaa\xd8\xa7\xd9\x86 \xd8\xb3\xd9\xbe\xd8\xb1 \xd9\x84\xdb\x8c\xda\xaf \xda\xa9\xd8\xa7 \xd9\x85\xdb\x8c\xda\x86 \xd8\xaf\xdb\x8c\xda\xa9\xda\xbe\xd9\x86\xdb\x92 \xda\xa9\xdb\x92 \xd9\x84\xd8\xa6\xdb\x92 \xd8\xa7\xd8\xb3\xd9\xb9\xdb\x8c\xda\x88\xdb\x8c\xd9\x85 \xd9\x85\xe2\x80\xa6,976166125502427136,938118866135343104\n2018-03-20 18:38:35,"At last, Pakistan Have Witnessed The Most Thrilling Match Of Cricket In Pakistan, The Home. \n\n#PZvQG \n#ABC",976166125535973378,959235642\n2018-03-20 18:38:35,"RT @thePSLt20: SIX! 19.4 Liam Dawson to Anwar Ali\nWatch ball by ball highlights at (link removed)\n\n#PZvQG #HBLPSL #PSL2018 @_crici\xe2\x80\xa6",976166126202839040,395163528\n2018-03-20 18:38:35,RT @JeremyMcLellan: Rumor has it Amir Liaquat isn\xe2\x80\x99t allowed to play in #PSL2018 because he keeps switching teams every week.,976166126483902466,3117825702\n2018-03-20 18:38:35,RT @daniel86cricket: Peshawar beat Quetta by 1 run in one of the best T20 thrillers. PSL played in front of full house in Lahore Pakistan i\xe2\x80\xa6,976166126559354880,3310967346\n2018-03-20 18:38:35,"I wanted a super over\n#PZvQG",976166126836178944,701494826194354179\n2018-03-20 18:38:35,RT @hinaparvezbutt: Congratulations Peshawar Zalmi over great win but Quetta Gladiators won our hearts \xe2\x99\xa5\xef\xb8\x8f #PZvQG,976166126685171713,347132028\n2018-03-20 18:38:35,"RT @walterMiitty: It\'s harder than I thought to tell the truth\nIt\'s gonna leave you in pieces\nAll alone with your demons\nAnd I know that we\xe2\x80\xa6",976166126924201986,3461853618\nRun Code Online (Sandbox Code Playgroud)\n\n注意:某些字段包含换行符,因此输出可能看起来有点奇怪。不过,读取此内容的应用程序将正确处理它(只要您在导入时告诉它编码是 UTF-8)
\n| 归档时间: |
|
| 查看次数: |
31915 次 |
| 最近记录: |