我们一直在命令行使用grep,cut,sort,uniq和join来进行数据分析.虽然有缺点,但它们效果很好.例如,您必须为每个工具提供列号.我们经常有宽文件(很多列)和列标题,它们给出了列名.实际上,我们的文件看起来很像sql表.我确定有一个驱动程序(ODBC?)将操作分隔的文本文件,以及一些将使用该驱动程序的查询引擎,因此我们可以在我们的文本文件上使用SQL查询.由于进行分析通常是临时的,因此查询新文件(仅使用我在此目录中指定的文件)而不是在某些配置中声明特定表必须是最小化设置.
解决方法
重复其他人的建议,这里是sqlite3的
Python脚本.有点冗长,但它的确有效.
我不喜欢完全复制文件以删除标题行,但我不知道如何说服sqlite3的.import跳过它.我可以创建INSERT语句,但是如果不是更糟的话,这看起来也差一点.
示例调用:
代码:
- #!/usr/bin/env python
- """Run a sql statement on a text file"""
- import os
- import sys
- import getopt
- import tempfile
- import re
- class Usage(Exception):
- def __init__(self,msg):
- self.msg = msg
- def runCmd(cmd):
- if os.system(cmd):
- print "Error running " + cmd
- sys.exit(1)
- # TODO(dan): Return actual exit code
- def usage():
- print >>sys.stderr,"Usage: sql.py --file file --sql sql"
- def main(argv=None):
- if argv is None:
- argv = sys.argv
- try:
- try:
- opts,args = getopt.getopt(argv[1:],"h",["help","file=","sql="])
- except getopt.error,msg:
- raise Usage(msg)
- except Usage,err:
- print >>sys.stderr,err.msg
- print >>sys.stderr,"for help use --help"
- return 2
- filename = None
- sql = None
- for o,a in opts:
- if o in ("-h","--help"):
- usage()
- return 0
- elif o in ("--file"):
- filename = a
- elif o in ("--sql"):
- sql = a
- else:
- print "Found unexpected option " + o
- if not filename:
- print >>sys.stderr,"Must give --file"
- sys.exit(1)
- if not sql:
- print >>sys.stderr,"Must give --sql"
- sys.exit(1)
- # Get the first line of the file to make a CREATE statement
- #
- # Copy the rest of the lines into a new file (datafile) so that
- # sqlite3 can import data without header. If sqlite3 could skip
- # the first line with .import,this copy would be unnecessary.
- foo = open(filename)
- datafile = tempfile.NamedTemporaryFile()
- first = True
- for line in foo.readlines():
- if first:
- headers = line.rstrip().split()
- first = False
- else:
- print >>datafile,line,datafile.flush()
- #print datafile.name
- #runCmd("cat %s" % datafile.name)
- # Create columns with NUMERIC affinity so that if they are numbers,# sql queries will treat them as such.
- create_statement = "CREATE TABLE data (" + ",".join(
- map(lambda x: "`%s` NUMERIC" % x,headers)) + ");"
- cmdfile = tempfile.NamedTemporaryFile()
- #print cmdfile.name
- print >>cmdfile,create_statement
- print >>cmdfile,".separator ' '"
- print >>cmdfile,".import '" + datafile.name + "' data"
- print >>cmdfile,sql + ";"
- cmdfile.flush()
- #runCmd("cat %s" % cmdfile.name)
- runCmd("cat %s | sqlite3" % cmdfile.name)
- if __name__ == "__main__":
- sys.exit(main())