ham*_*x0r 8 python replace pandas
I have 2 dataframes: one (A) with some whitelist hostnames in regex form (ie (.*)microsoft.com, (*.)go.microsoft.com...) and another (B) with actual full hostnames of sites. I want to add a new column to this 2nd dataframe with the regex text of the Whitelist (1st) dataframe. However, it appears that Pandas' .replace() method doesn't care about what order items are in for its to_replace and value args.
My data looks like this:
In [1] A
Out[1]:
wildcards \
42 (.*)activation.playready.microsoft.com
35 (.*)v10.vortex-win.data.microsoft.com
40 (.*)settings-win.data.microsoft.com
43 (.*)smartscreen.microsoft.com
39 (.*).playready.microsoft.com
38 (.*)go.microsoft.com
240 (.*)i.microsoft.com
238 (.*)microsoft.com
regex
42 re.compile('^(.*)activation.playready.microsof...
35 re.compile('^(.*)v10.vortex-win.data.microsoft...
40 re.compile('^(.*)settings-win.data.microsoft.c...
43 re.compile('^(.*)smartscreen.microsoft.com$')
39 re.compile('^(.*).playready.microsoft.com$')
38 re.compile('^(.*)go.microsoft.com$')
240 re.compile('^(.*)i.microsoft.com$')
238 re.compile('^(.*)microsoft.com$')
In [2] B.head()
Out[2]:
server_hostname
146 mobile.pipe.aria.microsoft.com
205 settings-win.data.microsoft.com
341 nav.smartscreen.microsoft.com
406 v10.vortex-win.data.microsoft.com
667 www.microsoft.com
Run Code Online (Sandbox Code Playgroud)
Notice that A has a column of compiled regexes in similar form to the wildcards column. I want to add a wildcard column to B like this:
B.loc[:,'wildcards'] = B['server_hostname'].replace(A['regex'].tolist(), A['wildcards'].tolist())
But the problem is, all of B's wildcard values become (.*)microsoft.com. This happens no matter the order of A's wildcard values. It appears .replace() aims to use the to_replace regex's by shortest value first rather than the order provided.
How can I provide a list of to_replace values so that I ultimately get the most details hostname wildcards value associated with B's server_hostname values?
这是使用双重列表理解和函数来执行此操作的方法re.sub():
import re
A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
'(.*)v10.vortex-win.data.microsoft.com',
'(.*)i.microsoft.com', '(.*)microsoft.com'],
'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
re.compile('^(.*)v10.vortex-win.data.microsoft.com$'),
re.compile('^(.*)i.microsoft.com$'),
re.compile('^(.*)microsoft.com$')]})
B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
'www.microsoft.com']})
# For each server_hostname we try each regex and keep the longest matching one
B['wildcards'] = [max([re.sub(to_replace, value, x) for to_replace, value
in A[['regex', 'wildcards']].values
if re.sub(to_replace, value, x)!=x], key=len)
for x in B['server_hostname']]
Run Code Online (Sandbox Code Playgroud)
Output :
server_hostname wildcards
0 v10.vortex-win.data.microsoft.com (.*)v10.vortex-win.data.microsoft.com
1 www.microsoft.com (.*)microsoft.com
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
292 次 |
| 最近记录: |