给定一个机场在特定时间的交易数据:
> dt t_id airport thisTime 1: 1 a 5.1 2: 3 a 5.1 3: 2 a 6.2
(注意t_ids 1和3有相同的机场和时间)
以及从机场出发的航班查询表:
> dt_lookup f_id airport thisTime 1: 1 a 6 2: 2 a 6 3: 1 b 7 4: 1 c 8 5: 2 d 7 6: 1 d 9 7: 2 e 8 > tables() NAME NROW NCOL MB COLS KEY [1,] dt 3 3 1 t_id,airport,thisTime airport,thisTime [2,] dt_lookup 7 3 1 f_id,thisTime
我想将所有交易与从该机场出发的所有下一个航班进行匹配,以便:
t_id airport thisTime f_id 1 a 6 1 1 a 6 2 3 a 6 1 3 a 6 2
所以我以为这样会工作:
> dt[dt_lookup,nomatch=0,roll=Inf] t_id airport thisTime f_id 1: 3 a 6 1 2: 3 a 6 2
但是它没有返回事务t_id == 1.
Usually,there should be no duplicates in x’s key,…
但是,我的’x键'(即机场& thisTime)中有重复的,并且看不到/理解t_id = 1是否从输出中移除.
任何人都可以看出,为什么t_id = 1没有返回,当我有重复的时候怎么能让加入工作?
数据
library(data.table) dt <- data.table(t_id = seq(1:3),airport = c("a","a","a"),thisTime = c(5.1,6.2,5.1),key=c( "airport","thisTime")) dt_lookup <- data.table(f_id = c(rep(1,4),rep(2,3)),"b","c","d","e"),thisTime = c(6,7,8,9,6,8),key=c("airport","thisTime"))
解决方法
Applies to the last join column,generally a date but can be any
ordered variable,irregular and including gaps. If roll=TRUE and i’s
row matches to all but the last x join column,and its value in the
last i join column falls in a gap (including after the last
observation in x for that group),then the prevailing value in x is
rolled forward. This operation is particularly fast using a modified
binary search. The operation is also known as last observation carried
forward (LOCF).
我们考虑一些更大的数据集:
> DT t_id airport thisTime 1: 1 a 5.1 2: 4 a 5.1 3: 3 a 5.1 4: 2 d 6.2 5: 5 d 6.2 > DT_LU f_id airport thisTime 1: 1 a 6 2: 2 a 6 3: 2 a 8 4: 1 b 7 5: 1 c 8 6: 2 d 7 7: 1 d 9
当您按照您的问题进行滚动加盟:
DT[DT_LU,roll=Inf]
你得到:
t_id airport thisTime f_id 1: 3 a 6 1 2: 3 a 6 2 3: 3 a 8 2 4: 5 d 7 2 5: 5 d 9 1
如您所见,从组合键a,5.1和d,6.2中,最后一行用于连接的数据.因为您使用Inf作为滚动值,所有将来的值都将并入到生成的数据表中.使用时:
DT[DT_LU,roll=1]
您将看到仅包含将来的第一个值:
t_id airport thisTime f_id 1: 3 a 6 1 2: 3 a 6 2 3: 5 d 7 2
如果你想要f_id为机场& thisTime其中DT $thisTime低于DT_LU $thisTime,您可以通过使用ceiling函数创建一个新变量(或替换现有的thisTime)来实现.一个例子,我创建一个新的变量thisTime2,然后用DT_LU做正常的连接:
DT[,thisTime2 := ceiling(thisTime)] setkey(DT,thisTime2)[DT_LU,nomatch=0]
这使:
t_id airport thisTime thisTime2 f_id 1: 1 a 5.1 6 1 2: 4 a 5.1 6 1 3: 3 a 5.1 6 1 4: 1 a 5.1 6 2 5: 4 a 5.1 6 2 6: 3 a 5.1 6 2 7: 2 d 6.2 7 2 8: 5 d 6.2 7 2
适用于您提供的数据:
> dt[,thisTime2 := ceiling(thisTime)] > setkey(dt,thisTime2)[dt_lookup,nomatch=0] t_id airport thisTime thisTime2 f_id 1: 1 a 5.1 6 1 2: 3 a 5.1 6 1 3: 1 a 5.1 6 2 4: 3 a 5.1 6 2
当您想要包含未来的值而不是仅包含第一个值时,您需要一种稍微不同的方法,您将需要使用i.col功能(尚未记录):
1:首先设置机场专栏的关键:
setkey(DT,airport) setkey(DT_LU,airport)
2:使用j中的i.col功能(它还没有记录下来)得到你想要的内容如下:
DT1 <- DT_LU[DT,.(tid = i.t_id,tTime = i.thisTime,fTime = thisTime[i.thisTime < thisTime],fid = f_id[i.thisTime < thisTime]),by=.EACHI]
这给你:
> DT1 airport tid tTime fTime fid 1: a 1 5.1 6 1 2: a 1 5.1 6 2 3: a 1 5.1 8 2 4: a 4 5.1 6 1 5: a 4 5.1 6 2 6: a 4 5.1 8 2 7: a 3 5.1 6 1 8: a 3 5.1 6 2 9: a 3 5.1 8 2 10: d 2 6.2 7 2 11: d 2 6.2 9 1 12: d 5 6.2 7 2 13: d 5 6.2 9 1
一些解释:如果你加入了使用相同列名的两个数据表,你可以在i中引用列的名字之间的i可以引用datatable中的列.现在可以将DT中的thisTime与DT_LU的thisTime进行比较.使用by = .EACHI,您可以确保使用条件成立的所有组合都包含在生成的数据表中.
或者,您可以实现相同的:
DT2 <- DT_LU[DT,.(airport=i.airport,tid=i.t_id,tTime=i.thisTime,fTime=thisTime[i.thisTime < thisTime],fid=f_id[i.thisTime < thisTime]),allow.cartesian=TRUE]
其结果相同:
> identical(DT1,DT2) [1] TRUE
当您只想在某个边界内包含未来的值时,可以使用:
DT1 <- DT_LU[DT,{ idx = i.thisTime < thisTime & thisTime - i.thisTime < 2 .(tid = i.t_id,fTime = thisTime[idx],fid = f_id[idx]) },by=.EACHI]
这使:
> DT1 airport tid tTime fTime fid 1: a 1 5.1 6 1 2: a 1 5.1 6 2 3: a 4 5.1 6 1 4: a 4 5.1 6 2 5: a 3 5.1 6 1 6: a 3 5.1 6 2 7: d 2 6.2 7 2 8: d 5 6.2 7 2
当您与之前的结果进行比较时,您会看到现在已经删除了行3,10和12.
数据:
DT <- data.table(t_id = c(1,4,2,3,5),"d"),5.1,6.2),"thisTime")) DT_LU <- data.table(f_id = c(rep(1,"thisTime"))