考虑以下假设数据:
x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please" y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please" z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please" df <- data.frame(Text = c(x,y,z),row.names = NULL,stringsAsFactors = F)
您是否注意到在不同位置有“:”.例如:
>在’x’中它(“:”)在第一句之后.
>在’y’中它(“:”)在第四句之后.
>并且在’z’中它是在第六句之后.
>此外,在每个文本的最后一句之前还有一个“:”.
我想做什么,创建两列,以便:
>只考虑第一个“:”而不是最后一个.
>如果前三个句子中有“:”,则将整个文本分成两列,否则,将所有文本保留在第二列中,将“NA”保留在第一列中.
想要’x’的输出:
Col1 Col2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
想要输出“y”(因为“:”因此在前三个句子中找不到):
Col1 Col2 NA There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
就像上面’y’的结果一样,’z’的通缉输出结果应该是:
Col1 Col2 NA all of the text from 'z'
我想要做的是:
resX <- data.frame(Col1 = gsub("\\s\\:.*$","\\1",df$Text[[1]]),Col2 = gsub("^[^:]+(?:).\\s",df$Text[[1]])) resY <- data.frame(Col1 = gsub("\\s\\:.*$",df$Text[[2]]),df$Text[[2]])) resZ <- data.frame(Col1 = gsub("\\s\\:.*$",df$Text[[3]]),df$Text[[3]]))
然后使用rbind将上面的内容合并到结果数据帧“resDF”中.
问题是:
>以上可以使用“for()循环”或任何其他方法来完成,使代码更简单.
>“y”和“z”文本的结果不是我想要的(如上所示).
解决方法
你可以试试这个负面的前瞻性正则表达式:
^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$
Regex Demo and Detailed explanation of the regex
Updated:
如果你的条件满足,那么正则表达式将返回true,你应该得到2份
第1组包含第一个值:第2组将包含值.
如果条件未满足,则将整个字符串复制到第2列并将所需的任何内容作为第1列
包含名为过程数据的方法的更新样本片段将为您完成这些技巧.如果条件满足,那么它将拆分数据并放入col1和col2 ….如果在输入中y和z的情况下不满足条件…它将NA放在col1和整个值中在col2.
运行示例源 – > ideone:
library(stringr) x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please" y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please" z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please" df <- data.frame(Text = c(x,stringsAsFactors = F) resDF <- data.frame("Col1" = character(),"Col2" = character(),stringsAsFactors=FALSE) processData <- function(a) { patt <- "^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$" if(grepl(patt,a,perl=TRUE)) { result<-str_match(a,patt) col1<-result[2] col2<-result[3] } else { col1<-"NA" col2<-a } return(c(col1,col2)) } for (i in 1:nrow(df)){ tmp <- df[i,] resDF[nrow(resDF) + 1,] <- processData(tmp) } print(resDF)
样本输出:
Col1 1 There is a horror movie running in the iNox theater. 2 NA 3 NA Col2 1 If row names are supplied of length one and the data \n frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). \n If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify \n the row names and not a column (by name or number) Can we go : Please 2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data \n frame has a single row,the row.names is taken. To specify the row names and not a column. By name or number. : \n If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify \n the row names and not a column (by name or number) Can we go : Please 3 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). \n If row names are supplied of length one. : And the data frame has a single row,the row.names is taken to specify \n the row names and not a column (by name or number) Can we go : Please