me.*_*mes 5 python regex string dataframe pandas
我想删除我的所有标志,dataframe将其保留为两种格式之一:100-200或200
因此,如果给出了一系列工资,则工资之间应该有一个连字符,否则是一个干净的单个数字。
\n我有以下数据:
\nimport pandas as pd\nimport re\ndf = {'salary':['\xc2\xa326,768 - \xc2\xa330,136/annum Attractive benefits package',\n '\xc2\xa326,000 - \xc2\xa328,000/annum plus bonus',\n '\xc2\xa321,000/annum',\n '\xc2\xa326,768 - \xc2\xa330,136/annum Attractive benefits package',\n '\xc2\xa333/hour', \n '\xc2\xa318,500 - \xc2\xa320,500/annum Inc Bonus - Study Support + Bens',\n '\xc2\xa327,500 - \xc2\xa330,000/annum \xc2\xa327,500 to \xc2\xa330,000 + Study',\n '\xc2\xa335,000 - \xc2\xa340,000/annum',\n '\xc2\xa324,000 - \xc2\xa327,000/annum Study Support (ACCA / CIMA)',\n '\xc2\xa319,000 - \xc2\xa324,000/annum Study Support',\n '\xc2\xa330,000 - \xc2\xa335,000/annum', \n '\xc2\xa344,000 - \xc2\xa366,000/annum + 15% Bonus + Excellent Benefits. L',\n '\xc2\xa375 - \xc2\xa390/day \xc2\xa375-\xc2\xa390 Per Day']}\ndata = pd.DataFrame(df)\nRun Code Online (Sandbox Code Playgroud)\n以下是我尝试删除的一些标志:
\nsalary = []\nfor i in data.salary:\n space = re.sub(" ",'',i)\n lower = re.sub("[a-z]",'',space)\n upper = re.sub("[A-Z]",'',lower)\n bracket = re.sub("/",'',upper)\n comma = re.sub(",", '', bracket)\n plus = re.sub("\\+",'',comma)\n percentage = re.sub("\\%",'', plus)\n dot = re.sub("\\.",'', percentage)\n bracket1 = re.sub("\\(",'',dot)\n bracket2 = re.sub("\\)",'',bracket1)\n salary.append(bracket2)\nRun Code Online (Sandbox Code Playgroud)\n这给了我:
\n'\xc2\xa326768-\xc2\xa330136',\n '\xc2\xa326000-\xc2\xa328000',\n '\xc2\xa321000',\n '\xc2\xa326768-\xc2\xa330136',\n '\xc2\xa333',\n '\xc2\xa318500-\xc2\xa320500-',\n '\xc2\xa327500-\xc2\xa330000\xc2\xa327500\xc2\xa330000',\n '\xc2\xa335000-\xc2\xa340000',\n '\xc2\xa324000-\xc2\xa327000',\n '\xc2\xa319000-\xc2\xa324000',\n '\xc2\xa330000-\xc2\xa335000',\n '\xc2\xa344000-\xc2\xa36600015',\n '\xc2\xa375-\xc2\xa390\xc2\xa375-\xc2\xa390'\nRun Code Online (Sandbox Code Playgroud)\n但是,我有一些重复的数字,本质上我想要删除第一个值范围后的任何内容,以及两个数字之间除连字符之外的任何符号。
\n预期输出:
\n '26768-30136',\n '26000-28000',\n '21000',\n '26768-30136',\n '33',\n '18500-20500',\n '27500-30000',\n '35000-40000',\n '24000-27000',\n '19000-24000',\n '30000-35000',\n '44000-66000',\n '75-90\nRun Code Online (Sandbox Code Playgroud)\n
pandas.Series.str.partition另一种使用with 的方法replace:
data["salary"].str.partition("/")[0].str.replace("[^\d-]+", "", regex=True)
Run Code Online (Sandbox Code Playgroud)
输出:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
Name: 0, dtype: object
Run Code Online (Sandbox Code Playgroud)
解释:
它假设您只对以下部分感兴趣/;它提取所有内容,直到/, 然后删除除数字和连字符之外的任何内容
| 归档时间: |
|
| 查看次数: |
151 次 |
| 最近记录: |