如何在R数据帧上执行Oracle ROW_NUMBER(),RANK()或DENSE_RANK()函数(见
http://www.orafaq.com/node/55)等分析函数? CRAN包“plyr”非常接近但仍然不同.
解决方法
data.table包,特别是从1.8.1版开始,提供了sql术语中分区的大部分功能. R中的rank(x,ties.method =“min”)类似于Oracle RANK(),并且有一种使用因子(如下所述)来模拟DENSE_RANK()函数的方法.一个模仿ROW_NUMBER的方法应该是显而易见的.
以下是一个示例:从R-Forge加载最新版本的data.table:
install.packages("data.table",repos= c("http://R-Forge.R-project.org",getOption("repos"))) library(data.table)
创建一些示例数据:
set.seed(10) DT<-data.table(ID=seq_len(4*3),group=rep(1:4,each=3),value=rnorm(4*3),info=c(sample(c("a","b"),4*2,replace=TRUE),sample(c("c","d"),4,replace=TRUE)),key="ID") > DT ID group value info 1: 1 1 0.01874617 a 2: 2 1 -0.18425254 b 3: 3 1 -1.37133055 b 4: 4 2 -0.59916772 a 5: 5 2 0.29454513 b 6: 6 2 0.38979430 a 7: 7 3 -1.20807618 b 8: 8 3 -0.36367602 a 9: 9 3 -1.62667268 c 10: 10 4 -0.25647839 d 11: 11 4 1.10177950 c 12: 12 4 0.75578151 d
通过减小组内的值排序每个ID(注意 – 在值前面,表示递减顺序):
> DT[,valRank:=rank(-value),by="group"] ID group value info valRank 1: 1 1 0.01874617 a 1 2: 2 1 -0.18425254 b 2 3: 3 1 -1.37133055 b 3 4: 4 2 -0.59916772 a 3 5: 5 2 0.29454513 b 2 6: 6 2 0.38979430 a 1 7: 7 3 -1.20807618 b 2 8: 8 3 -0.36367602 a 1 9: 9 3 -1.62667268 c 3 10: 10 4 -0.25647839 d 3 11: 11 4 1.10177950 c 1 12: 12 4 0.75578151 d 2
对于DENSE_RANK(),该值的值被排序,您可以将该值转换为一个因子,然后返回基础整数值.例如,根据组内的信息(比较infoRank与infoRankDense)对每个ID排序:
DT[,infoRank:=rank(info,ties.method="min"),by="group"] DT[,infoRankDense:=as.integer(factor(info)),by="group"] R> DT ID group value info valRank infoRank infoRankDense 1: 1 1 0.01874617 a 1 1 1 2: 2 1 -0.18425254 b 2 2 2 3: 3 1 -1.37133055 b 3 2 2 4: 4 2 -0.59916772 a 3 1 1 5: 5 2 0.29454513 b 2 3 2 6: 6 2 0.38979430 a 1 1 1 7: 7 3 -1.20807618 b 2 2 2 8: 8 3 -0.36367602 a 1 1 1 9: 9 3 -1.62667268 c 3 3 3 10: 10 4 -0.25647839 d 3 2 2 11: 11 4 1.10177950 c 1 1 1 12: 12 4 0.75578151 d 2 2 2
附:马修·道尔
LEAD和LAG
为了模仿LEAD和LAG,从here提供的答案开始.我将根据组内的ID顺序创建一个等级变量.这对于假冒数据并不是必需的,但是如果ID在组内不是顺序的,那么这会使生活更困难.所以这里有一些新的假数据与非顺序ID:
set.seed(10) DT<-data.table(ID=sample(seq_len(4*3)),key="ID") DT[,idRank:=rank(ID),by="group"] setkey(DT,group,idRank) > DT ID group value info idRank 1: 4 1 -0.36367602 b 1 2: 5 1 -1.62667268 b 2 3: 7 1 -1.20807618 b 3 4: 1 2 1.10177950 a 1 5: 2 2 0.75578151 a 2 6: 12 2 -0.25647839 b 3 7: 3 3 0.74139013 c 1 8: 6 3 0.98744470 b 2 9: 9 3 -0.23823356 a 3 10: 8 4 -0.19515038 c 1 11: 10 4 0.08934727 c 2 12: 11 4 -0.95494386 c 3
然后获取前一个记录的值,使用组和idRank变量,并从idRank中减去1,并使用multi =’last’参数.要从上面记录的两个条目获取值,减去2.
DT[,prev:=DT[J(group,idRank-1),value,mult='last']] DT[,prev2:=DT[J(group,idRank-2),mult='last']] ID group value info idRank prev prev2 1: 4 1 -0.36367602 b 1 NA NA 2: 5 1 -1.62667268 b 2 -0.36367602 NA 3: 7 1 -1.20807618 b 3 -1.62667268 -0.3636760 4: 1 2 1.10177950 a 1 NA NA 5: 2 2 0.75578151 a 2 1.10177950 NA 6: 12 2 -0.25647839 b 3 0.75578151 1.1017795 7: 3 3 0.74139013 c 1 NA NA 8: 6 3 0.98744470 b 2 0.74139013 NA 9: 9 3 -0.23823356 a 3 0.98744470 0.7413901 10: 8 4 -0.19515038 c 1 NA NA 11: 10 4 0.08934727 c 2 -0.19515038 NA 12: 11 4 -0.95494386 c 3 0.08934727 -0.1951504
对于LEAD,将适当的偏移添加到idRank变量并切换到multi =’first’:
DT[,nex:=DT[J(group,idRank+1),mult='first']] DT[,nex2:=DT[J(group,idRank+2),mult='first']] ID group value info idRank prev prev2 nex nex2 1: 4 1 -0.36367602 b 1 NA NA -1.62667268 -1.2080762 2: 5 1 -1.62667268 b 2 -0.36367602 NA -1.20807618 NA 3: 7 1 -1.20807618 b 3 -1.62667268 -0.3636760 NA NA 4: 1 2 1.10177950 a 1 NA NA 0.75578151 -0.2564784 5: 2 2 0.75578151 a 2 1.10177950 NA -0.25647839 NA 6: 12 2 -0.25647839 b 3 0.75578151 1.1017795 NA NA 7: 3 3 0.74139013 c 1 NA NA 0.98744470 -0.2382336 8: 6 3 0.98744470 b 2 0.74139013 NA -0.23823356 NA 9: 9 3 -0.23823356 a 3 0.98744470 0.7413901 NA NA 10: 8 4 -0.19515038 c 1 NA NA 0.08934727 -0.9549439 11: 10 4 0.08934727 c 2 -0.19515038 NA -0.95494386 NA 12: 11 4 -0.95494386 c 3 0.08934727 -0.1951504 NA NA