使用R语料库保留文档ID

我搜索了stackoverflow和web,只能找到部分解决方案,或者由于TM或qdap的变化而无法正常工作.问题如下：

我有一个数据帧：ID和文本(简单文档ID /名称,然后一些文本)

我有两个问题：

第1部分：如何创建tdm或dtm并维护文档名称/ ID？它只在inspect(tdm)上显示“character(0)”.
第2部分：我想只保留一个特定的术语列表,即删除自定义停用词的相反.我希望这发生在语料库中,而不是tdm / dtm.

对于第2部分,我使用了我在这里得到的解决方案：How to implement proximity rules in tm dictionary for counting words?
这个发生在tdm部分！对于使用类似“tm_map(my.corpus,keepOnlyWords,customlist)”的第2部分,是否有更好的解决方案？

任何帮助将不胜感激.
非常感谢！

解决方法

首先,这是一个示例data.frame

dd<-data.frame(
    id=10:13,text=c("No wonder,then,that ever gathering volume from the mere transit ","So that in many cases such a panic did he finally strike,that few ","But there were still other and more vital practical influences at work","Not even at the present day has the original prestige of the Sperm Whale"),stringsAsFactors=F
 )

现在,为了从data.frame中读取特殊属性,我们将使用readTabular函数来创建自己的自定义data.frame读取器.这就是我们需要做的

library(tm)
myReader <- readTabular(mapping=list(content="text",id="id"))

我们只需指定用于内容的列和data.frame中的id.现在我们用DataframeSource阅读它,但使用我们的自定义阅读器.

tm <- VCorpus(DataframeSource(dd),readerControl=list(reader=myReader))

现在,如果我们只想保留一组单词,我们就可以创建自己的content_transformer函数.一种方法是

keepOnlyWords<-content_transformer(function(x,words) {
    regmatches(x,gregexpr(paste0("\\b(",paste(words,collapse="|"),"\\b)"),x),invert=T)<-" "
    x
})

这将用空格替换单词列表中没有的所有内容.请注意,您可能希望在此之后运行stripWhitespace.因此,我们的转变看起来像

keep<-c("wonder","then","that","the")

tm<-tm_map(tm,content_transformer(tolower))
tm<-tm_map(tm,keep)
tm<-tm_map(tm,stripWhitespace)

然后我们可以将其转换为文档术语矩阵

dtm<-DocumentTermMatrix(tm)
inspect(dtm)

# <<DocumentTermMatrix (documents: 4,terms: 4)>>
# Non-/sparse entries: 7/9
# Sparsity           : 56%
# Maximal term length: 6
# Weighting          : term frequency (tf)

#     Terms
# Docs that the then wonder
#   10    1   1    1      1
#   11    2   0    0      0
#   12    0   1    0      0
#   13    0   3    0      0

你可以使用我们的单词列表和data.frame中的正确文档ID

使用R语料库保留文档ID

解决方法

猜你在找的HTML相关文章