我有:
from __future__ import division
import nltk,re,pprint
f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt')
raw = f.read()
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in text]
f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-novels-and-the-NYT.txt')
englishraw = f2.read()
englishtokens = nltk.wordpunct_tokenize(englishraw)
englishtext = nltk.Text(englishtokens)
englishwords = [w.lower() for w in englishwords]
这是直接来自NLTK手册.我接下来要做的是将词汇与一套详尽的英语单词(如OED)进行比较,然后提取差异 – 一组没有,也可能永远不会出现在OED中的Finnegans唤醒词.我更像是一个口头的人,而不是一个数学导向的人,所以我还没有想出如何做到这一点,并且手册对于我实际上不想做的事情进行了太多细节.不过,我假设它只是一两行代码.