在多个模式之前查找数字序列的正则表达式，放入新列（Python、Pandas）

Question

在多个模式之前查找数字序列的正则表达式，放入新列（Python、Pandas）

这是我的示例数据：

import pandas as pd
import re
  
cars = pd.DataFrame({'Engine Information': {0: 'Honda 2.4L 4 cylinder 190 hp 162 ft-lbs',
          1: 'Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs',
          2: 'Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs',
          3: 'MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs',
          4: 'Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV',
          5: 'GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs'},
         'HP': {0: None, 1: None, 2: None, 3: None, 4: None, 5: None}})

Run Code Online (Sandbox Code Playgroud)

这是我想要的输出：

我创建了一个名为“HP”的新列，我想在其中从原始列（“发动机信息”）中提取马力数字

这是我尝试执行此操作的代码：

cars['HP'] = cars['Engine Information'].apply(lambda x: re.match(r'\\d+(?=\\shp|hp)', str(x)))

Run Code Online (Sandbox Code Playgroud)

这个想法是我想正则表达式匹配模式：'在'hp'或'hp'之前出现的数字序列。这是因为某些单元格在数字和“hp”之间没有“空格”，如我的示例所示。

我确定正则表达式是正确的，因为我已经在 R 中成功完成了类似的过程。但是，我尝试过诸如str.extract, re.findall, re.search, 之类的函数re.match。返回错误或“无”值（如示例中所示）。所以在这里我有点失落。

谢谢！

Answer 1

Wik*_*żew 4

您可以使用str.extract：

cars['HP'] = cars['Engine Information'].str.extract(r'(\d+)\s*hp\b', flags=re.I)

Run Code Online (Sandbox Code Playgroud)

细节

(\d+)\s*hp\b- 匹配并捕获第 1 组中的一个或多个数字，然后仅匹配 0 个或多个空格 ( \s*) 并hp（由于flags=re.I）作为整个单词（以不区分大小写的方式）（因为\b标记了单词边界）
str.extract仅当模式中存在捕获组时才返回捕获的值，因此hp和空格不是结果的一部分。

Python演示结果：

>>> cars
                               Engine Information   HP
0         Honda 2.4L 4 cylinder 190 hp 162 ft-lbs  190
1  Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs  420
2          Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs  390
3          MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs  118
4       Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV  360
5           GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs  352

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，4 月前
查看次数：	300 次
最近记录：	5 年，4 月前