正则表达式 – 根据其他列向Panda数据框添加新列

我正在尝试向Panda数据集添加新列.
这个新列df [‘Year_Prod’]来自另一个df [‘title’],我从中提取年份.

数据示例：

country    designation     title
Italy      Vulkà Bianco    Nicosia 2013 Vulkà Bianco (Etna)         
Portugal   Avidagos        Quinta dos Avidagos 2011 Avidagos Red (Douro)

码：

import re

import pandas as pd

df=pd.read_csv(r'test.csv',index_col=0)

df['Year_Prod']=re.findall('\\d+',df['title'])

print(df.head(10))

我收到以下错误：

File "C:\Python37\lib\site-packages\pandas\core\frame.py",line 3119,in __setitem__self._set_item(key,value)

  File "C:\Python37\lib\site-packages\pandas\core\frame.py",line 3194,in _set_item value = self._sanitize_column(key,line 3391,in _sanitize_column value = _sanitize_index(value,self.index,copy=False)

  File "C:\Python37\lib\site-packages\pandas\core\series.py",line 4001,in _sanitize_index raise ValueError('Length of values does not match length of ' 'index')

**ValueError: Length of values does not match length of index**

请告诉我你对此的看法,谢谢.

解决方法

您可以使用pandas str.extract

df['Year_Prod']= df.title.str.extract('(\d{4})')

    country     designation     title                                          Year_Prod
0   Italy       Vulkà Bianco    Nicosia 2013 Vulkà Bianco (Etna)                2013
1   Portugal    Avidagos        Quinta dos Avidagos 2011 Avidagos Red (Douro)   2011

编辑：正如@Paul H.在评论中建议的那样,你的代码不起作用的原因是re.findall需要一个字符串,但你传递的是一个系列.它可以使用apply来完成,在每一行,传递的值是一个字符串,但没有多大意义,因为str.extract更有效.

df.title.apply(lambda x: re.findall('\d{4}',x)[0])

正则表达式 – 根据其他列向Panda数据框添加新列

解决方法

猜你在找的正则表达式相关文章