我正在处理一个相当大的推文集合,我想为每个推文获取其提及(其他用户的名字,前缀为@),如果提到的用户也在文件中:
users = new Dictionary() for each line in file: username = get_username(line) userid = get_userid(line) users.add(key = userid,value = username) for each line in file: mentioned_names = get_mentioned_names(line) mentioned_ids = mentioned_names.map(x => if x in users: users[x] else null) print "$line | $mentioned_ids"
我已经使用GAWK处理该文件了,所以不是在Python或C中再次处理它,我决定尝试将其添加到我的AWK脚本中.但是,我无法找到一种方法来传递相同的文件,为每个文件执行不同的代码.大多数解决方案都意味着多次调用AWK,但后来我放弃了我在第一遍中创建的关联数组.
我可以用很笨拙的方式做到这一点(比如把文件夹到两次,然后通过sed传递给每只猫的所有行添加不同的前缀),但我希望能够在一对夫妇中理解这段代码.几个月没有恨自己.
什么是AWK方式来做到这一点?
PD:
我找到的不太可怕的方式:
function rewind( i) { # from https://www.gnu.org/software/gawk/manual/html_node/Rewind-Function.html # shift remaining arguments up for (i = ARGC; i > ARGIND; i--) ARGV[i] = ARGV[i-1] # make sure gawk knows to keep going ARGC++ # make current file next to get done ARGV[ARGIND+1] = FILENAME # do it nextfile } BEGIN { count = 1; } count == 1 { # first pass,fills an associative array } count == 2 { # second pass,uses the array } FNR == 30 { # handcoded length,horrible # could also be automated calling wc -l,passing as parameter if (count == 1) { count = 2; rewind(1) } }