假设我有两个字符a和b的向量:
set.seed(123) categ <- c("Control","Gr","Or","PMT","P450") genes <- paste(categ,rep(1:40,each=length(categ)),sep="_") a0 <- paste(genes,"_",rep(1:50,each=length(genes)),sep="") b0 <- paste (a0,"1",sep="") ite <- 200 lg <- 2000 b <- b0[1:lg] a <- (a0[1:lg])[sample(seq(lg),ite)]
我想应用grep函数以找到b中a的每个值的匹配.
我当然可以这样做:
sapply(a,grep,b)
但是我想知道是否有更高效的东西,因为我必须在模拟中对更大的向量运行这么多次(注意我不想使用mclapply,因为我已经使用它来运行我的模拟的每次迭代):
system.time(lapply(seq(100000),function(x) sapply(a,b))) library(parallel) system.time(mclapply(seq(100000),b),mc.cores=8))
由于您不使用正则表达式但希望在较长字符串中查找子字符串,因此可以使用fixed = TRUE.它要快得多.
library(microbenchmark) microbenchmark(lapply(a,# original lapply(paste0("^",a),# @flodel lapply(a,b,fixed = TRUE)) Unit: microseconds expr min lq median uq max neval lapply(a,b) 112.633 114.2340 114.9390 116.0990 326.857 100 lapply(paste0("^",b) 119.949 121.7380 122.7425 123.9775 191.851 100 lapply(a,fixed = TRUE) 21.004 22.5885 23.8580 24.6110 33.608 100
使用较长的矢量进行测试(原始长度的1000倍).
ar <- rep(a,1000) br <- rep(b,1000) library(microbenchmark) microbenchmark(lapply(ar,br),# original lapply(paste0("^",ar),# @flodel lapply(ar,br,fixed = TRUE)) Unit: seconds expr min lq median uq max neval lapply(ar,br) 32.288139 32.564223 32.726149 32.97529 37.818299 100 lapply(paste0("^",br) 24.997339 25.343401 25.531138 25.71615 28.238802 100 lapply(ar,fixed = TRUE) 2.461934 2.494759 2.513931 2.55375 4.194093 100
(这花了很长时间……)