在兑现列表顺序时如何在正则表达式列表中使用pandas .replace（）？

Question

在兑现列表顺序时如何在正则表达式列表中使用pandas .replace（）？

I have 2 dataframes: one (A) with some whitelist hostnames in regex form (ie (.*)microsoft.com, (*.)go.microsoft.com...) and another (B) with actual full hostnames of sites. I want to add a new column to this 2nd dataframe with the regex text of the Whitelist (1st) dataframe. However, it appears that Pandas' .replace() method doesn't care about what order items are in for its to_replace and value args.

My data looks like this:

In [1] A
Out[1]: 
                                  wildcards  \
42   (.*)activation.playready.microsoft.com   
35    (.*)v10.vortex-win.data.microsoft.com   
40      (.*)settings-win.data.microsoft.com   
43            (.*)smartscreen.microsoft.com   
39             (.*).playready.microsoft.com   
38                     (.*)go.microsoft.com   
240                     (.*)i.microsoft.com   
238                       (.*)microsoft.com   
                                                 regex  
42   re.compile('^(.*)activation.playready.microsof...  
35   re.compile('^(.*)v10.vortex-win.data.microsoft...  
40   re.compile('^(.*)settings-win.data.microsoft.c...  
43       re.compile('^(.*)smartscreen.microsoft.com$')  
39        re.compile('^(.*).playready.microsoft.com$')  
38                re.compile('^(.*)go.microsoft.com$')  
240                re.compile('^(.*)i.microsoft.com$')  
238                  re.compile('^(.*)microsoft.com$')  


In [2] B.head()
Out[2]: 
                       server_hostname
146     mobile.pipe.aria.microsoft.com
205    settings-win.data.microsoft.com
341      nav.smartscreen.microsoft.com
406  v10.vortex-win.data.microsoft.com
667                  www.microsoft.com

Run Code Online (Sandbox Code Playgroud)

Notice that A has a column of compiled regexes in similar form to the wildcards column. I want to add a wildcard column to B like this:

B.loc[:,'wildcards'] = B['server_hostname'].replace(A['regex'].tolist(), A['wildcards'].tolist())

But the problem is, all of B's wildcard values become (.*)microsoft.com. This happens no matter the order of A's wildcard values. It appears .replace() aims to use the to_replace regex's by shortest value first rather than the order provided.

How can I provide a list of to_replace values so that I ultimately get the most details hostname wildcards value associated with B's server_hostname values?

Answer 1

vle*_*tre 0

这是使用双重列表理解和函数来执行此操作的方法re.sub()：

import re

A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
                                 '(.*)v10.vortex-win.data.microsoft.com',
                                 '(.*)i.microsoft.com', '(.*)microsoft.com'],
                  'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
                             re.compile('^(.*)v10.vortex-win.data.microsoft.com$'), 
                             re.compile('^(.*)i.microsoft.com$'), 
                             re.compile('^(.*)microsoft.com$')]})

B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
                                       'www.microsoft.com']})
# For each server_hostname we try each regex and keep the longest matching one
B['wildcards'] = [max([re.sub(to_replace, value, x) for to_replace, value
                       in A[['regex', 'wildcards']].values
                       if re.sub(to_replace, value, x)!=x], key=len) 
                  for x in B['server_hostname']]

Run Code Online (Sandbox Code Playgroud)

Output : 
                     server_hostname                              wildcards
0  v10.vortex-win.data.microsoft.com  (.*)v10.vortex-win.data.microsoft.com
1                  www.microsoft.com                      (.*)microsoft.com

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，6 月前
查看次数：	292 次
最近记录：	6 年，3 月前