删除符号和重复数字

me.*_*mes 5 python regex string dataframe pandas

我想删除我的所有标志,dataframe将其保留为两种格式之一:100-200200

\n

因此,如果给出了一系列工资,则工资之间应该有一个连字符,否则是一个干净的单个数字。

\n

我有以下数据:

\n
import pandas as pd\nimport re\ndf = {'salary':['\xc2\xa326,768 - \xc2\xa330,136/annum Attractive benefits package',\n           '\xc2\xa326,000 - \xc2\xa328,000/annum plus bonus',\n           '\xc2\xa321,000/annum',\n           '\xc2\xa326,768 - \xc2\xa330,136/annum Attractive benefits package',\n           '\xc2\xa333/hour', \n           '\xc2\xa318,500 - \xc2\xa320,500/annum Inc Bonus - Study Support + Bens',\n           '\xc2\xa327,500 - \xc2\xa330,000/annum \xc2\xa327,500 to \xc2\xa330,000 + Study',\n           '\xc2\xa335,000 - \xc2\xa340,000/annum',\n           '\xc2\xa324,000 - \xc2\xa327,000/annum Study Support (ACCA / CIMA)',\n           '\xc2\xa319,000 - \xc2\xa324,000/annum Study Support',\n           '\xc2\xa330,000 - \xc2\xa335,000/annum', \n           '\xc2\xa344,000 - \xc2\xa366,000/annum + 15% Bonus + Excellent Benefits. L',\n           '\xc2\xa375 - \xc2\xa390/day \xc2\xa375-\xc2\xa390 Per Day']}\ndata = pd.DataFrame(df)\n
Run Code Online (Sandbox Code Playgroud)\n

以下是我尝试删除的一些标志:

\n
salary = []\nfor i in data.salary:\n    space = re.sub(" ",'',i)\n    lower = re.sub("[a-z]",'',space)\n    upper = re.sub("[A-Z]",'',lower)\n    bracket = re.sub("/",'',upper)\n    comma = re.sub(",", '', bracket)\n    plus = re.sub("\\+",'',comma)\n    percentage = re.sub("\\%",'', plus)\n    dot = re.sub("\\.",'', percentage)\n    bracket1 = re.sub("\\(",'',dot)\n    bracket2 = re.sub("\\)",'',bracket1)\n    salary.append(bracket2)\n
Run Code Online (Sandbox Code Playgroud)\n

这给了我:

\n
'\xc2\xa326768-\xc2\xa330136',\n '\xc2\xa326000-\xc2\xa328000',\n '\xc2\xa321000',\n '\xc2\xa326768-\xc2\xa330136',\n '\xc2\xa333',\n '\xc2\xa318500-\xc2\xa320500-',\n '\xc2\xa327500-\xc2\xa330000\xc2\xa327500\xc2\xa330000',\n '\xc2\xa335000-\xc2\xa340000',\n '\xc2\xa324000-\xc2\xa327000',\n '\xc2\xa319000-\xc2\xa324000',\n '\xc2\xa330000-\xc2\xa335000',\n '\xc2\xa344000-\xc2\xa36600015',\n '\xc2\xa375-\xc2\xa390\xc2\xa375-\xc2\xa390'\n
Run Code Online (Sandbox Code Playgroud)\n

但是,我有一些重复的数字,本质上我想要删除第一个值范围后的任何内容,以及两个数字之间除连字符之外的任何符号。

\n

预期输出:

\n
 '26768-30136',\n '26000-28000',\n '21000',\n '26768-30136',\n '33',\n '18500-20500',\n '27500-30000',\n '35000-40000',\n '24000-27000',\n '19000-24000',\n '30000-35000',\n '44000-66000',\n '75-90\n
Run Code Online (Sandbox Code Playgroud)\n

Chr*_*ris 5

pandas.Series.str.partition另一种使用with 的方法replace

data["salary"].str.partition("/")[0].str.replace("[^\d-]+", "", regex=True)
Run Code Online (Sandbox Code Playgroud)

输出:

0     26768-30136
1     26000-28000
2           21000
3     26768-30136
4              33
5     18500-20500
6     27500-30000
7     35000-40000
8     24000-27000
9     19000-24000
10    30000-35000
11    44000-66000
12          75-90
Name: 0, dtype: object
Run Code Online (Sandbox Code Playgroud)

解释:

它假设您只对以下部分感兴趣/;它提取所有内容,直到/, 然后删除除数字和连字符之外的任何内容