有没有办法在不导入文件的情况下获取文件中的行数?
到目前为止,这就是我正在做的事情
myfiles <- list.files(pattern="*.dat") myfilesContent <- lapply(myfiles,read.delim,header=F,quote="\"") for (i in 1:length(myfiles)){ test[[i]] <- length(myfilesContent[[i]]$V1) }
但由于每个文件都很大,所以太耗时了.
解决方法
如果你:
>仍然希望避免系统调用system2(“wc”…将导致
>是在BSD / Linux或OS X上(我没有在Windows上测试以下内容)
>不介意使用完整的文件名路径
>使用内联包很舒服
那么下面的内容应该尽可能快(它几乎是内联R C函数中wc的’行数’部分):
library(inline) wc.code <- " uintmax_t linect = 0; uintmax_t tlinect = 0; int fd,len; u_char *p; struct statfs fsb; static off_t buf_size = SMALL_BUF_SIZE; static u_char small_buf[SMALL_BUF_SIZE]; static u_char *buf = small_buf; PROTECT(f = AS_CHARACTER(f)); if ((fd = open(CHAR(STRING_ELT(f,0)),O_RDONLY,0)) >= 0) { if (fstatfs(fd,&fsb)) { fsb.f_iosize = SMALL_BUF_SIZE; } if (fsb.f_iosize != buf_size) { if (buf != small_buf) { free(buf); } if (fsb.f_iosize == SMALL_BUF_SIZE || !(buf = malloc(fsb.f_iosize))) { buf = small_buf; buf_size = SMALL_BUF_SIZE; } else { buf_size = fsb.f_iosize; } } while ((len = read(fd,buf,buf_size))) { if (len == -1) { (void)close(fd); break; } for (p = buf; len--; ++p) if (*p == '\\n') ++linect; } tlinect += linect; (void)close(fd); } SEXP result; PROTECT(result = NEW_INTEGER(1)); INTEGER(result)[0] = tlinect; UNPROTECT(2); return(result); "; setCMethod("wc",signature(f="character"),wc.code,includes=c("#include <stdlib.h>","#include <stdio.h>","#include <sys/param.h>","#include <sys/mount.h>","#include <sys/stat.h>","#include <ctype.h>","#include <err.h>","#include <errno.h>","#include <fcntl.h>","#include <locale.h>","#include <stdint.h>","#include <string.h>","#include <unistd.h>","#include <wchar.h>","#include <wctype.h>","#define SMALL_BUF_SIZE (1024 * 8)"),language="C",convention=".Call") wc("FULLPATHTOFILE")
它作为一个包更好,因为它实际上必须首次编译.但是,如果你确实需要“速度”,它可以作为参考.对于我躺在的189,955行文件,我得到(来自一堆运行的平均值):
user system elapsed 0.007 0.003 0.010