我试图将网页源读入R并将其处理为字符串.我试图从段落文本中删除段落并从中删除html标签.我遇到以下问题:
@H_502_4@cleanFun=function(fullStr)
{
#find location of tags and citations
tagLoc=cbind(str_locate_all(fullStr,"<")[[1]][,2],str_locate_all(fullStr,">")[[1]][,1]);
#create storage for tag strings
tagStrings=list()
#extract and store tag strings
for(i in 1:dim(tagLoc)[1])
{
tagStrings[i]=substr(fullStr,tagLoc[i,1],2]);
}
#remove tag strings from paragraph
newStr=fullStr
for(i in 1:length(tagStrings))
{
newStr=str_replace_all(newStr,tagStrings[[i]][1],"")
}
return(newStr)
};
这适用于某些标签,但不适用于所有标签,这种失败的示例是以下字符串:
@H_502_4@test="junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"目标是获得:
@H_502_4@cleanFun(test)="junk junk junk junk"但是,这似乎不起作用.我认为这可能与字符串长度或转义字符有关,但我找不到涉及这些字符串的解决方案.