我试图通过pandas.read_csv()的parse_dates解析几个日期时遇到了这个bug.在下面的代码片段中,我试图解析格式为dd / mm / yy的日期,这导致我转换不正确.在某些情况下,日期字段被视为月份,反之亦然.
为了简单起见,在某些情况下,dd / mm / yy会转换为YYYY-DD-mm而不是yyyy-mm-dd.
情况1:
04/10/96 is parsed as 1996-04-10,which is wrong.
@H_502_12@案例2:
15/07/97 is parsed as 1997-07-15,which is correct.
@H_502_12@案例3:
10/12/97 is parsed as 1997-10-12,which is wrong.
@H_502_12@代码示例
import pandas as pd df = pd.read_csv('date_time.csv') print 'Data in csv:' print df print df['start_date'].dtypes print '----------------------------------------------' df = pd.read_csv('date_time.csv',parse_dates = ['start_date']) print 'Data after parsing:' print df print df['start_date'].dtypes
@H_502_12@电流输出
---------------------- Data in csv: ---------------------- start_date 0 04/10/96 1 15/07/97 2 10/12/97 3 06/03/99 4 //1994 5 /02/1967 object ---------------------- Data after parsing: ---------------------- start_date 0 1996-04-10 1 1997-07-15 2 1997-10-12 3 1999-06-03 4 1994-01-01 5 1967-02-01 datetime64[ns]
@H_502_12@预期产出
---------------------- Data in csv: ---------------------- start_date 0 04/10/96 1 15/07/97 2 10/12/97 3 06/03/99 4 //1994 5 /02/1967 object ---------------------- Data after parsing: ---------------------- start_date 0 1996-10-04 1 1997-07-15 2 1997-12-10 3 1999-03-06 4 1994-01-01 5 1967-02-01 datetime64[ns]
@H_502_12@更多评论:
我可以使用date_parser或pandas.to_datetime()来指定日期的正确格式.但在我的情况下,我有几个日期字段,如[‘// 1997′,’/ 02/1967′]我需要转换[’01 / 01/1997′,’01/02/1967’]. parse_dates帮助我将这些类型的日期字段转换为预期的格式,而不会让我编写额外的代码行.
这有什么解决方案吗?
Bug Link @GitHub:https://github.com/pydata/pandas/issues/13063
最佳答案
在版本pandas 0.18.0中,您可以添加参数dayfirst = True然后它可以工作:
import pandas as pd import io temp=u"""start_date 04/10/96 15/07/97 10/12/97 06/03/99 //1994 /02/1967 """ #after testing replace io.StringIO(temp) to filename df = pd.read_csv(io.StringIO(temp),parse_dates = ['start_date'],dayfirst=True) start_date 0 1996-10-04 1 1997-07-15 2 1997-12-10 3 1999-03-06 4 1994-01-01 5 1967-02-01
@H_502_12@另一种方案:
你可以用
to_datetime
解析不同的参数格式和错误=’coerce’然后combine_first
:date1 = pd.to_datetime(df['start_date'],format='%d/%m/%y',errors='coerce') print date1 0 1996-10-04 1 1997-07-15 2 1997-12-10 3 1999-03-06 4 NaT 5 NaT Name: start_date,dtype: datetime64[ns] date2 = pd.to_datetime(df['start_date'],format='/%m/%Y',errors='coerce') print date2 0 NaT 1 NaT 2 NaT 3 NaT 4 NaT 5 1967-02-01 Name: start_date,dtype: datetime64[ns] date3 = pd.to_datetime(df['start_date'],format='//%Y',errors='coerce') print date3 0 NaT 1 NaT 2 NaT 3 NaT 4 1994-01-01 5 NaT Name: start_date,dtype: datetime64[ns]
@H_502_12@print date1.combine_first(date2).combine_first(date3) 0 1996-10-04 1 1997-07-15 2 1997-12-10 3 1999-03-06 4 1994-01-01 5 1967-02-01 Name: start_date,dtype: datetime64[ns]
@H_502_12@