我正在尝试使用R来解析这些类型的文件,以解析信息并将数据放入数据帧中,如格式:
last_run current_run seconds ------------------------------- ------------------------------- ----------- Jul 4 2016 7:17AM Jul 4 2016 7:21AM 226 Engine Utilization (Tick %) User Busy System Busy I/O Busy Idle ------------------------- ------------ ------------ ---------- ---------- ThreadPool : syb_default_pool Engine 0 5.0 % 0.4 % 22.4 % 72.1 % Engine 1 3.9 % 0.5 % 22.8 % 72.8 % Engine 2 5.6 % 0.3 % 22.5 % 71.6 % Engine 3 5.1 % 0.4 % 22.7 % 71.8 % ------------------------- ------------ ------------ ---------- ---------- Pool Summary Total 336.1 % 25.6 % 1834.6 % 5803.8 % Average 4.2 % 0.3 % 22.9 % 72.5 % ------------------------- ------------ ------------ ---------- ---------- Server Summary Total 336.1 % 25.6 % 1834.6 % 5803.8 % Average 4.2 % 0.3 % 22.9 % 72.5 % Transaction Profile ------------------- Transaction Summary per sec per xact count % of total ------------------------- ------------ ------------ ---------- ---------- Committed Xacts 137.3 n/a 41198 n/a Average Runnable Tasks 1 min 5 min 15 min % of total ------------------------- ------------ ------------ ---------- ---------- ThreadPool : syb_default_pool Global Queue 0.0 0.0 0.0 0.0 % Engine 0 0.0 0.1 0.1 0.6 % Engine 1 0.0 0.0 0.0 0.0 % Engine 2 0.2 0.1 0.1 2.6 % ------------------------- ------------ ------------ ---------- Pool Summary Total 7.2 5.9 6.1 Average 0.1 0.1 0.1 ------------------------- ------------ ------------ ---------- Server Summary Total 7.2 5.9 6.1 Average 0.1 0.1 0.1 Device Activity Detail ---------------------- Device: /dev/vx/rdsk/sybaserdatadg/datadev_125 datadev_125 per sec per xact count % of total ------------------------- ------------ ------------ ---------- ---------- Total I/Os 0.0 0.0 0 n/a ------------------------- ------------ ------------ ---------- ---------- Total I/Os 0.0 0.0 0 0.0 % ----------------------------------------------------------------------------- Device: /dev/vx/rdsk/sybaserdatadg/datadev_126 datadev_126 per sec per xact count % of total ------------------------- ------------ ------------ ---------- ---------- Total I/Os 0.0 0.0 0 n/a ------------------------- ------------ ------------ ---------- ---------- Total I/Os 0.0 0.0 0 0.0 % ----------------------------------------------------------------------------- Device: /dev/vx/rdsk/sybaserdatadg/datadev_127 datadev_127 per sec per xact count % of total ------------------------- ------------ ------------ ---------- ---------- Reads APF 0.0 0.0 5 0.4 % Non-APF 0.0 0.0 1 0.1 % Writes 3.8 0.0 1128 99.5 % ------------------------- ------------ ------------ ---------- ---------- Total I/Os 3.8 0.0 1134 0.1 % Mirror Semaphore Granted 3.8 0.0 1134 100.0 % Mirror Semaphore Waited 0.0 0.0 0 0.0 % ----------------------------------------------------------------------------- Device: /sybaser/database/sybaseR/dev/sybaseR.datadev_000 GPS_datadev_000 per sec per xact count % of total ------------------------- ------------ ------------ ---------- ---------- Reads APF 7.9 0.0 2372 55.9 % Non-APF 5.5 0.0 1635 38.6 % Writes 0.8 0.0 233 5.5 % ------------------------- ------------ ------------ ---------- ---------- Total I/Os 14.1 0.0 4240 0.3 % Mirror Semaphore Granted 14.1 0.0 4239 100.0 % Mirror Semaphore Waited 0.0 0.0 2 0.0 %
我需要捕获“2016年7月4日上午7:21”作为日期,
来自“引擎利用率(勾选%)行,服务器摘要 – >平均值”4.2%“
所以,我的数据框应该是这样的:
Date cpu Count Jul 4 2016 7:21AM 4.2 41198
我尝试过这样的事情:
read.table(text=readLines("file.txt")[count.fields("file.txt",blank.lines.skip=FALSE) == 9])
获得这一行:
Average 4.2 % 0.3 % 22.9 % 72.5 %
但我希望能够仅在之后提取平均值
引擎利用率(Tick%),因为可能有许多以Average开头的行.在引擎利用率(Tick%)之后立即显示的平均线是我想要的.
read.table(text=readLines("file.txt")[count.fields("file.txt",blank.lines.skip=FALSE) == 9])
我可以在read.table行中使用grep来搜索某些字符吗?
%%%% Shot 1 – 得到了一些工作
原文链接:https://www.f2er.com/regex/356910.htmlextract <- function(filenam="file.txt"){ txt <- readLines(filenam) ## date of current run: ## assumed to be on 2nd line following the first line matching "current_run" ii <- 2 + grep("current_run",txt,fixed=TRUE)[1] line_current_run <- Filter(function(v) v!="",strsplit(txt[ii]," ")[[1]]) date_current_run <- paste(line_current_run[5:8],collapse=" ") ## cpu: ## assumed to be on line following the first line matching "Server Summary" ## which comes after the first line matching "Engine Utilization ..." jj <- grep("Engine Utilization (Tick %)",fixed=TRUE)[1] ii <- grep("Server Summary",fixed=TRUE) ii <- 1 + min(ii[ii>jj]) line_cpu <- Filter(function(v) v!=""," ")[[1]]) cpu <- line_cpu[2] ## Count: ## assumed to be on 2nd line following the first line matching "Transaction Summary" ii <- 2 + grep("Transaction Summary",fixed=TRUE)[1] line_count <- Filter(function(v) v!=""," ")[[1]]) count <- line_count[5] data.frame(Date=date_current_run,cpu=cpu,Count=count,stringsAsFactors=FALSE) } print(extract("file.txt")) ##file.list <- dir("./") file.list <- rep("file.txt",3) merged <- do.call("rbind",lapply(file.list,extract)) print(merged) file.list <- rep("file.txt",2000) print(system.time(merged <- do.call("rbind",extract)))) ## runs in about 2.5 secs on my laptop
%%% Shot 2:第一次尝试提取(可能可变)数量的设备列
extractv2 <- function(filenam="file2.txt"){ txt <- readLines(filenam) ## date of current run: ## assumed to be on 2nd line following the first line matching "current_run" ii <- 2 + grep("current_run"," ")[[1]]) count <- line_count[5] ## Total I/Os ## 1. Each line "Device:" is assumed to be the header of a block of lines ## containing info about a single device (there are 4 such blocks ## in your example); ## 2. each block is assumed to contain one or more lines matching ## "Total I/Os"; ## 3. the relevant count data is assumed to be contained in the last ## of such lines (at column 4),for each block. ## Approach: loop on the line numbers of those lines matching "Device:" ## to get: A. counts; B. device names ii_block_dev <- grep("Device:",fixed=TRUE) ii_lines_IOs <- grep("Total I/Os",fixed=TRUE) nblocks <- length(ii_block_dev) ## A. get counts for each device ## for each block,select *last* line matching "Total I/Os" ii_block_dev_aux <- c(ii_block_dev,Inf) ## just a hack to get a clean code ii_lines_IOs_dev <- sapply(1:nblocks,function(block){ ## select matching liens to "Total I/Os" within each block IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block ] & ii_lines_IOs < ii_block_dev_aux[block+1] ] tail(IOs_per_block,1) ## get the last line of each block (if more than one match) }) lines_IOs <- lapply(txt[ii_lines_IOs_dev],function(strng){ Filter(function(v) v!="",strsplit(strng," ")[[1]]) }) IOs_counts <- sapply(lines_IOs,function(v) v[5]) ## B. get device names: ## assumed to be on lines following each "Device:" match ii_devices <- 1 + ii_block_dev device_names <- sapply(ii_devices,function(ii){ Filter(function(v) v!=""," ")[[1]]) }) ## Create a data.frame with "device_names" as column names and "IOs_counts" as ## the values of a single row. ## Sorting the device names by order() will help produce the same column names ## if different sysmon files list the devices in different order ord <- order(device_names) devices <- as.data.frame(structure(as.list(IOs_counts[ord]),names=device_names[ord]),check.names=FALSE) ## Prevent R from messing with our device names data.frame(stringsAsFactors=FALSE,check.names=FALSE,Date=date_current_run,devices) } print(extractv2("file2.txt")) ## WATCH OUT: ## merging will ONLY work if all devices have the same names across sysmon files!! file.list <- rep("file2.txt",extractv2)) print(merged)
%%%%%%% Shot 3:提取两个表,一个包含一行,另一个包含可变行数(取决于每个sysmon文件中列出的设备).
extractv3 <- function(filenam="file2.txt"){ txt <- readLines(filenam) ## date of current run: ## assumed to be on 2nd line following the first line matching "current_run" ii <- 2 + grep("current_run"," ")[[1]]) count <- line_count[5] ## first part of output: fixed three-column structure fixed <- data.frame(stringsAsFactors=FALSE,Count=count) ## Total I/Os ## 1. Each line "Device:" is assumed to be the header of a block of lines ## containing info about a single device (there are 4 such blocks ## in your example); ## 2. each block is assumed to contain one or more lines matching ## "Total I/Os"; ## 3. the relevant count data is assumed to be contained in the last ## of such lines (at column 4),fixed=TRUE) if(length(ii_block_dev)==0){ variable <- data.frame(stringsAsFactors=FALSE,date_current_run=date_current_run,device_names=NA,IOs_counts=NA) }else{ ii_lines_IOs <- grep("Total I/Os",fixed=TRUE) nblocks <- length(ii_block_dev) if(length(ii_block_dev)==0){ sprintf("WEIRD datapoint at date %s: I have %d devices but 0 I/O lines??") ##stop() } ## A. get counts for each device ## for each block,select *last* line matching "Total I/Os" ii_block_dev_aux <- c(ii_block_dev,Inf) ## just a hack to get a clean code ii_lines_IOs_dev <- sapply(1:nblocks,function(block){ ## select matching lines to "Total I/Os" within each block IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block ] & ii_lines_IOs < ii_block_dev_aux[block+1] ] tail(IOs_per_block,1) ## get the last line of each block (if more than one match) }) lines_IOs <- lapply(txt[ii_lines_IOs_dev],function(strng){ Filter(function(v) v!=""," ")[[1]]) }) IOs_counts <- sapply(lines_IOs,function(v) v[5]) ## B. get device names: ## assumed to be on lines following each "Device:" match ii_devices <- 1 + ii_block_dev device_names <- sapply(ii_devices,function(ii){ Filter(function(v) v!=""," ")[[1]]) }) ## Create a data.frame with three columns: date,device,counts variable <- data.frame(stringsAsFactors=FALSE,date_current_run=rep(date_current_run,length(IOs_counts)),device_names=device_names,IOs_counts=IOs_counts) } list(fixed=fixed,variable=variable) } print(extractv3("file2.txt")) file.list <- c("file.txt","file2.txt","file3.txt") res <- lapply(file.list,extractv3) fixed.merged <- do.call("rbind",lapply(res,function(r) r$fixed)) print(fixed.merged) variable.merged <- do.call("rbind",function(r) r$variable)) print(variable.merged)