在NLTK中,如何遍历已解析的句子以返回名词短语字符串列表?
我有两个目标:@H_403_3@(1)创建名词短语列表,而不是使用’traverse()’方法打印它们.我目前使用StringIO来记录现有traverse()方法的输出.这不是一个可接受的解决方案.@H_403_3@(2)解析名词短语字符串,以便:'(NP Michael / NNP Jackson / NNP)成为’Michael Jackson’.在NLTK中有解除解析的方法吗?
NLTK文档建议使用traverse()来查看名词短语,但是如何在这个递归方法中捕获’t’,以便生成一个字符串名词短语列表?
from nltk.tag import pos_tag def traverse(t): try: t.label() except AttributeError: return else: if t.label() == 'NP': print(t) # or do something else else: for child in t: traverse(child) def nounPhrase(tagged_sent): # Tag sentence for part of speech tagged_sent = pos_tag(sentence.split()) # List of tuples with [(Word,PartOfSpeech)] # Define several tag patterns grammar = r""" NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive,adjectives and noun {<NNP>+} # chunk sequences of proper nouns {<NN>+} # chunk consecutive nouns """ cp = nltk.RegexpParser(grammar) # Define Parser SentenceTree = cp.parse(tagged_sent) NounPhrases = traverse(SentenceTree) # collect Noun Phrase return(NounPhrases) sentence = "Michael Jackson likes to eat at McDonalds" tagged_sent = pos_tag(sentence.split()) NP = nounPhrase(tagged_sent) print(NP)
目前打印:@H_403_3@(NP Michael / NNP Jackson / NNP)@H_403_3@(NP麦当劳/ NNP)@H_403_3@并将’无’存储到NP
解决方法
def extract_np(psent): for subtree in psent.subtrees(): if subtree.label() == 'NP': yield ' '.join(word for word,tag in subtree.leaves()) cp = nltk.RegexpParser(grammar) parsed_sent = cp.parse(tagged_sent) for npstr in extract_np(parsed_sent): print (npstr)