regex – sed – 删除大型csv文件中引号内的引号

前端之家收集整理的这篇文章主要介绍了regex – sed – 删除大型csv文件中引号内的引号前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
我正在使用流编辑器sed将大量文本文件数据(400MB)转换为csv格式.

我已经非常接近完成,但突出的问题是引号内的引号,对于这样的数据:

1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","text may not contain double quotes,but may contain commas,"
3,word3,"description for "word3"","more text and more"

所需的输出是:

1,"text contains double quotes some more text"
2,"description for word3","more text and more"

我一直在寻找帮助,但我没有太接近解决方案,我尝试了以下seds与正则表达式模式:

sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt

这些来自以下问题,但似乎不适用于sed:

Related question for perl

Related question for SISS

原始文件是* .txt,我正在尝试用sed编辑它们.

解决方法

这是使用GNU awk和 FPAT变量的一种方法
gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"",$i); $i=N $i N } }1' file

结果:

1,"text contains double
quotes some more text" 2,"another
text","more text and more"

说明:

Using FPAT,a field is defined as either “anything that is not a comma,” or “a double quote,anything that is not a double quote,and a closing double quote”. Then on every line of input,loop through each field and if the field starts and ends with a double quote,remove all quotes from the field. Finally,add double quotes surrounding the field.

猜你在找的Linux相关文章