如何从包含文本的熊猫数据框中的列中提取年份（或日期时间）

Question

如何从包含文本的熊猫数据框中的列中提取年份（或日期时间）

Mat*_*zar 1 python regex datetime parsing pandas

假设我有一个熊猫数据框：

Id    Book                      
1     Harry Potter (1997)
2     Of Mice and Men (1937)
3     Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story

Run Code Online (Sandbox Code Playgroud)

如何从列中提取年份？

输出应该是：

Id    Book Title               Year
1     Harry Potter             1997
2     Of Mice and Men          1937
3     Babe Ruth Story, The     1948

Run Code Online (Sandbox Code Playgroud)

到目前为止，我已经尝试过：

movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()

Run Code Online (Sandbox Code Playgroud)

和

books['year'] = books['title'].str[-5:-1]

Run Code Online (Sandbox Code Playgroud)

我搞砸了一些其他的事情，还没有让它发挥作用。有什么建议？

Answer 1

Ste*_*ley 5

一个简单的正则表达式怎么样：

text = 'Harry Potter (1997)'
re.findall('\((\d{4})\)', text)
# ['1997'] Note that this is a list of "all" the occurrences.

Run Code Online (Sandbox Code Playgroud)

使用 Dataframe，它可以像这样完成：

text = 'Harry Potter (1997)'
df = pd.DataFrame({'Book': text}, index=[1])
pattern = '\((\d{4})\)'
df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series

df
#                  Book   year
# 1  Harry Potter (1997)  1997

Run Code Online (Sandbox Code Playgroud)

最后，如果您真的想将标题和数据分开（在另一个答案中从 Philip 那里获取数据帧重建）：

df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])

sep = df['Book'].str.extract('(.*)\((\d{4})\)', expand=False)

sep # A new df, separated into title and year
#                       0      1                           
# 0          Harry Potter   1997 
# 1       Of Mice and Men   1937
# 2  Babe Ruth Story, The   1948

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，2 月前
查看次数：	858 次
最近记录：	7 年，2 月前