我有两个任意人A和B之间的对话记录.
c1 <- "Person A: blabla...something Person B: blabla something else Person A: OK blabla" c2 <- "Person A: again blabla Person B: blabla something else Person A: thanks blabla"
数据框如下所示:
df <- data.frame(id = rbind(123,345),conversation = rbind(c1,c2)) df id conversation c1 123 Person A: blabla...something Person B: blabla something else Person A: OK blabla c2 345 Person A: again blabla Person B: blabla something else Person A: thanks blabla
现在我想只提取人A的一部分并将其放在数据框中.结果应该是:
id person_A 1 123 blabla...something OK blabla 2 345 again blabla thanks blabla
我非常喜欢以一种方式解决这类问题,让您可以访问所有数据(包括Person B的话语).我喜欢tidyr的萃取物用于这种柱分裂.我曾经使用过do.call(rbind,strsplit()))方法但是喜欢提取方法的清洁程度.
c1 <- "Person A: blabla...something Person B: blabla something else Person A: OK blabla" c2 <- "Person A: again blabla Person B: blabla something else Person A: thanks blabla" c3 <- "Person A: again blabla Person B: blabla something else" df <- data.frame(id = rbind(123,345,567),c2,c3)) if (!require("pacman")) install.packages("pacman") pacman::p_load(dplyr,tidyr) conv <- strsplit(as.character(df[["conversation"]]),"\\s+(?=Person\\s)",perl=TRUE) df2 <- df[rep(1:nrow(df),sapply(conv,length)),drop=FALSE] rownames(df2) <- NULL df2[["conversation"]] <- unlist(conv) df2 %>% extract(conversation,c("Person","Conversation"),"([^:]+):\\s+(.+)") ## id Person Conversation ## 1 123 Person A blabla...something ## 2 123 Person B blabla something else ## 3 123 Person A OK blabla ## 4 345 Person A again blabla ## 5 345 Person B blabla something else ## 6 345 Person A thanks blabla ## 7 567 Person A again blabla ## 8 567 Person B blabla something else df2 %>% extract(conversation,"([^:]+):\\s+(.+)") %>% filter(Person == "Person A") ## id Person Conversation ## 1 123 Person A blabla...something ## 2 123 Person A OK blabla ## 3 345 Person A again blabla ## 4 345 Person A thanks blabla ## 5 567 Person A again blabla
df2 %>% extract(conversation,"([^:]+):\\s+(.+)") %>% filter(Person == "Person A") %>% group_by(id) %>% select(-Person) %>% summarise(Person_A =paste(Conversation,collapse=" ")) ## id Person_A ## 1 123 blabla...something OK blabla ## 2 345 again blabla thanks blabla ## 3 567 again blabla
编辑:实际上我怀疑你的数据有真实的名字,如“约翰史密斯”与“人物A”.如果是这种情况,这个初始正则表达式拆分将捕获使用大写后跟冒号的名字和姓氏:
c1 <- "Greg Smith: blabla...something Sue Williams: blabla something else Greg Smith: OK blabla" c2 <- "Greg Smith: again blabla Sue Williams: blabla something else Greg Smith: thanks blabla" c3 <- "Greg Smith: again blabla Sue Williams: blabla something else" df <- data.frame(id = rbind(123,c3))r conv <- strsplit(as.character(df[["conversation"]]),"\\s+(?=([A-Z][a-z]+\\s+[A-Z][a-z]+:))","([^:]+):\\s+(.+)") ## id Person Conversation ## 1 123 Greg Smith blabla...something ## 2 123 Sue Williams blabla something else ## 3 123 Greg Smith OK blabla ## 4 345 Greg Smith again blabla ## 5 345 Sue Williams blabla something else ## 6 345 Greg Smith thanks blabla ## 7 567 Greg Smith again blabla ## 8 567 Sue Williams blabla something else