d = data.table(a = 1:5,value = 2:6,key = "a") d[J(3),value] # a value # 3 4 d[J(3)][,value] # 4
我希望两者都产生相同的输出(第二个),我相信它们应该.
为了清除这不是J语法问题,同样的期望适用于以下(与上面相同)表达式:
t = data.table(a = 3,key = "a") d[t,value] d[t][,value]
我希望以上两个都返回完全相同的输出.
那么让我重新解释一下这个问题 – 为什么(data.table设计得如此),关键列在d [t,value]中自动打印出来?
更新(根据下面的答案和评论):谢谢@Arun等人,我理解设计 – 为什么现在.上面打印密钥的原因是因为每次通过X [Y]语法进行data.table合并时都存在隐藏状态,而by是按键.它以这种方式设计的原因似乎如下 – 因为必须在合并时执行by操作,人们可以利用它而不是通过合并的键来执行另一个操作.
现在说,我相信这是一个语法设计缺陷.我读取data.table语法d [i,j,by = b]的方式是
take
d
,apply thei
operation (be that subsetting or merging or whatnot),and then do thej
expression “by” b
逐个打破这个阅读,并介绍一个人必须具体思考的案例(我合并i,仅仅是合并的关键等).我相信这应该是data.table的工作 – 在一个特定的合并情况下,当by等于密钥时,使得data.table更快的值得称道的努力应该以另一种方式完成(例如通过检查如果by表达式实际上是合并的键,则在内部.
解决方法
1.12 What is the difference between X[Y] and merge(X,Y)?
X[Y] is a join,looking up X’s rows using Y (or Y’s key if it has one) as an index. Y[X] is a join,looking up Y’s rows using X (or X’s key if it has one) as an index. merge(X,Y)1 does both ways at the same time. The number of rows of X[Y] and Y[X] usually dier; whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same. BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data,only to use a small subset of them afterwards?
You may suggestmerge(X[,ColsNeeded1],Y[,ColsNeeded2])
,but that takes copies of the subsets of data,and it requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for you. When you writeX[Y,sum(foo*bar)]
,data.table
automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses,and Y columns enjoy standard R recycling rules within the context of each group. Let’s say foo is in X,and bar is in Y (along with 20 other columns in Y). Isn’tX[Y,sum(foo*bar)]
quicker to program and quicker to run than a merge followed by a subset?
没有回答OP的问题的老答案(来自OP的评论),保留在这里,因为我相信它确实如此).
当你在data.table中给出像d [,4]或d [,value]这样的j的值时,j被计算为表达式.从data.table FAQ 1.1访问DT [,5](第一个常见问题解答):
Because,by default,unlike a data.frame,the 2nd argument is an expression which is evaluated within the scope of DT. 5 evaluates to 5.
因此,首先要了解的是,在您的情况下:
d[,value] # produces a "vector" # [1] 2 3 4 5 6
当i的查询是基本索引时,这没有什么不同:
d[3,value] # produces a vector of length 1 # [1] 4
但是,当我本身就是data.table时,这是不同的.来自data.table简介(第6页):
d[J(3)] # is equivalent to d[data.table(a = 3)]
在这里,您正在执行加入.如果您只是执行d [J(3)],那么您将获得与该连接相对应的所有列.如果你这样做,
d[J(3),value] # which is equivalent to d[J(3),list(value)]
既然你说这个答案没有回答你的问题,我会指出你的“改写”问题的答案在哪里,我相信:—>然后你只得到那个列,但是由于你正在执行连接,因此也会输出键列(因为它是基于键列的两个表之间的连接).
编辑:在你的第二次编辑之后,如果你的问题是为什么呢?那么我不情愿(或者说是无知)回答,Matthew Dowle设计的是区分data.table基于连接的子集和基于索引的子集操作.
您的第二种语法相当于:
d[J(3)][,value] # is equivalent to: dd <- d[J(3)] dd[,value]
再次,在dd [,value]中,j被计算为表达式,因此得到一个向量.
回答第3个修改过的问题:第3次,这是因为它是基于键列的两个data.tables之间的JOIN.如果我加入两个data.tables,我期待一个data.table
从data.table简介,再次:
Passing a data.table into a data.table subset is analogous to A[B] Syntax in base R where A is a matrix and B is a 2-column matrix. In fact,the A[B] Syntax in base R inspired the data.table package.