我正在寻找最快的Python库,以将CSV文件(如果重要的话,是1或3列,所有整数或浮点数)读入Python数组(或某些我可以类似方式访问的对象,且类似访问时间)。它应该是免费的,可以在Windows 7和Ubuntu 12.04上运行,并与Python 2.7 x64一起使用。

350
750
252
138
125
125
125
112
95
196
105
101
101
101
102
101
101
102
202
104



9,52,1
52,91,0
91,135,0
135,174,0
174,218,0
218,260,0
260,301,0
301,341,0
341,383,0
383,423,0
423,466,0
466,503,0
503,547,0
547,583,0
583,629,0
629,667,0
667,713,0
713,754,0
754,796,0
796,839,1


评论

巧合的是,两个小时前在超级用户上发布了一个非常类似的问题:superuser.com/q/775893/137286第一个答案建议使用一个快速的库。

您最近是否按照我的建议对fastcsv进行了测试?我很想听听它如何处理您的数据。丹尼尔,干杯

#1 楼

因此,我最终使用Steve Barnes指出的库编写了一个小型基准。我在写问题时发现的内容相同,所以我猜这是主要的问题。其他尚未尝试的想法:用于Python的HDF5,PyTables,IOPro(非免费)。快速。

数据(我应该已经在基准测试中生成了它们,但是我现在没时间了)

代码:



import csv
import os
import cProfile
import time
import numpy
import pandas
import warnings

# Make sure those files in the same folder as benchmark_python.py
# As the name indicates:
# - '1col.csv' is a CSV file with 1 column
# - '3col.csv' is a CSV file with 3 column
filename1 = '1col.csv'
filename3 = '3col.csv'
csv_delimiter = ' '
debug = False

def open_with_python_csv(filename):
    '''
    https://docs.python.org/2/library/csv.html
    '''
    data =[]
    with open(filename, 'rb') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=csv_delimiter, quotechar='|')
        for row in csvreader:
            data.append(row)    
    return data

def open_with_python_csv_cast_as_float(filename):
    '''
    https://docs.python.org/2/library/csv.html
    '''
    data =[]
    with open(filename, 'rb') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=csv_delimiter, quotechar='|')
        for row in csvreader:
            data.append(map(float, row))    
    return data

def open_with_python_csv_list(filename):
    '''
    https://docs.python.org/2/library/csv.html
    '''
    data =[]
    with open(filename, 'rb') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=csv_delimiter, quotechar='|')
        data = list(csvreader)    
    return data


def open_with_numpy_loadtxt(filename):
    '''
    http://stackoverflow.com/questions/4315506/load-csv-into-2d-matrix-with-numpy-for-plotting
    '''
    data = numpy.loadtxt(open(filename,'rb'),delimiter=csv_delimiter,skiprows=0)
    return data

def open_with_pandas_read_csv(filename):
    df = pandas.read_csv(filename, sep=csv_delimiter)
    data = df.values
    return data    


def benchmark(function_name):  
    start_time = time.clock()
    data = function_name(filename1)       
    if debug: print data[0] 
    data = function_name(filename3)
    if debug: print data[0]
    print function_name.__name__ + ': ' + str(time.clock() - start_time), "seconds"


def benchmark_numpy_fromfile():
    '''
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html
    Do not rely on the combination of tofile and fromfile for data storage, 
    as the binary files generated are are not platform independent.
    In particular, no byte-order or data-type information is saved.
    Data can be stored in the platform independent .npy format using
    save and load instead.

    Note that fromfile will create a one-dimensional array containing your data,
    so you might need to reshape it afterward.
    '''
    #ignore the 'tmpnam is a potential security risk to your program' warning
    with warnings.catch_warnings():
        warnings.simplefilter('ignore', RuntimeWarning)
        fname1 = os.tmpnam()
        fname3 = os.tmpnam()

    data = open_with_numpy_loadtxt(filename1)
    if debug: print data[0]
    data.tofile(fname1)
    data = open_with_numpy_loadtxt(filename3)
    if debug: print data[0]
    data.tofile(fname3)
    if debug: print data.shape
    fname3shape = data.shape
    start_time = time.clock()
    data = numpy.fromfile(fname1, dtype=numpy.float64) # you might need to switch to float32. List of types: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
    if debug: print len(data), data[0], data.shape
    data = numpy.fromfile(fname3, dtype=numpy.float64)
    data = data.reshape(fname3shape)
    if debug: print len(data), data[0], data.shape    
    print 'Numpy fromfile: ' + str(time.clock() - start_time), "seconds"

def benchmark_numpy_save_load():
    '''
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html
    Do not rely on the combination of tofile and fromfile for data storage, 
    as the binary files generated are are not platform independent.
    In particular, no byte-order or data-type information is saved.
    Data can be stored in the platform independent .npy format using
    save and load instead.

    Note that fromfile will create a one-dimensional array containing your data,
    so you might need to reshape it afterward.
    '''
    #ignore the 'tmpnam is a potential security risk to your program' warning
    with warnings.catch_warnings():
        warnings.simplefilter('ignore', RuntimeWarning)
        fname1 = os.tmpnam()
        fname3 = os.tmpnam()

    data = open_with_numpy_loadtxt(filename1)
    if debug: print data[0]    
    numpy.save(fname1, data)    
    data = open_with_numpy_loadtxt(filename3)
    if debug: print data[0]    
    numpy.save(fname3, data)    
    if debug: print data.shape
    fname3shape = data.shape
    start_time = time.clock()
    data = numpy.load(fname1 + '.npy')
    if debug: print len(data), data[0], data.shape
    data = numpy.load(fname3 + '.npy')
    #data = data.reshape(fname3shape)
    if debug: print len(data), data[0], data.shape    
    print 'Numpy load: ' + str(time.clock() - start_time), "seconds"


def main():
    number_of_runs = 20
    results = []

    benchmark_functions = ['benchmark(open_with_python_csv)', 
                           'benchmark(open_with_python_csv_list)',
                           'benchmark(open_with_python_csv_cast_as_float)',
                           'benchmark(open_with_numpy_loadtxt)',
                           'benchmark(open_with_pandas_read_csv)',
                           'benchmark_numpy_fromfile()',
                           'benchmark_numpy_save_load()']
    # Compute benchmark
    for run_number in range(number_of_runs):
        run_results = []
        for benchmark_function in benchmark_functions:
            run_results.append(eval(benchmark_function))
            results.append(run_results)

    # Display benchmark's results
    print results
    results = numpy.array(results)
    numpy.set_printoptions(precision=10) # http://stackoverflow.com/questions/2891790/pretty-printing-of-numpy-array
    numpy.set_printoptions(suppress=True)  # suppress suppresses the use of scientific notation for small numbers:
    print numpy.mean(results, axis=0)
    print numpy.std(results, axis=0)    

    #Another library, but not free: https://store.continuum.io/cshop/iopro/

if __name__ == "__main__":
    #cProfile.run('main()') # if you want to do some profiling
    main()  



Windows 7:

输出:

open_with_python_csv: 1.57318865672 seconds
open_with_python_csv_list: 1.35567931732 seconds
open_with_python_csv_cast_as_float: 3.0801260484 seconds
open_with_numpy_loadtxt: 14.4942111801 seconds
open_with_pandas_read_csv: 0.371965476805 seconds
Numpy fromfile: 0.0130216095713 seconds
Numpy load: 0.0245501650124 seconds


到安装所有库:适用于Python扩展包的非官方Windows二进制文件

Windows配置:


Windows 7 SP1 x64 Ultimate
Python 2.7.6 x64
NumPy 1.7.1(pandas.io.parsers.read_csv
熊猫0.13.1(loadtxt

MSI Computer Corp. Notebook Computer GE70 0ND-033US; 9S7-175611-033(with SSD Crucial M5)


Ubuntu 12.04:

输出:

open_with_python_csv: 1.93 seconds
open_with_python_csv_list: 1.52 seconds
open_with_python_csv_cast_as_float: 3.19 seconds
open_with_numpy_loadtxt: 7.47 seconds
open_with_pandas_read_csv: 0.35 seconds
Numpy fromfile: 0.01 seconds
Numpy load: 0.02 seconds


要安装所有库:

sudo apt-get install python-pip
sudo pip install numpy
sudo pip install pandas


如果已安装库led,但需要升级:

sudo apt-get install python-pip
sudo pip install numpy --upgrade
sudo pip install pandas --upgrade


Ubuntu配置:


Ubuntu 12.04 x64
Python 2.7.3
NumPy 1.8.1(from_file
熊猫0.14.0(load


很明显,可以通过注释/编辑/其他方式来提高基准,我确保有很多需要改进的地方:


确保对当前的加载函数进行了优化
尝试新的函数/库,例如适用于Python的HDF5,PyTables IOPro(非免费)。
在基准测试中生成CSV(这样就不必下载CSV文件)


评论


结果的有趣比较。

–史蒂夫·巴恩斯(Steve Barnes)
2014年7月3日在18:17

您将我从numpy.loadtxt的黑暗寂寞世界中救了出来

–zkurtz
2015年1月5日在16:32

好答案。仅供参考,我在2gB cvs文件上运行了基准测试:open_with_python_csv:153.789445秒open_with_python_csv_list:146.709768秒open_with_python_csv_cast_as_float:86.957046秒open_with_numpy_loadtxt:157.669637秒秒open_with_numpy_loadtxt:157.669637秒open_with_pandas_read 77。

– M.R.
2015年6月15日19:38



很棒,今天晚上我从这篇文章中学到了很多东西。但是,您的np.fromfile似乎非常快。我在测试中发现,它只在第一行工作,因此,要制作整个文件,您必须执行一个循环,以使20k行csv文件的运行速度再次降低到60s,而将Pandas的运行速度降低到0.75s。您是否正在使用np.fromfile读取整个文件?通过加载整个csv文件,将'\ n替换为”,然后运行np.fromstring,我也可以使用np.fromstring。字符串操作很快,但是转换为数字很慢。此方法耗时2.6秒。

– Sonicsmooth
17年1月1日在6:50

在2018年的Windows 10上使用python 2.7.13,具有100000行文件和19列的Windows 10上尝试仅测试open_with_python_csv,open_with_python_csv_list和open_with_pandas_read_csv和pandas方法并不快。内置csv表示约为0.35,而熊猫约为0.43。

–达沃斯
18 Mar 19 '18 at 13:24

#2 楼

我想在这里贡献另一个图书馆,我偶然发现了类似的问题。我用Franck Dernoncourts基准代码对其进行了测试,它比Pythons标准的csv和Pandas击败了很多。我无法使用numpy进行测试,因为我使用带有数字和字符串值的24.000行csv进行了测试。正确的unicode字符串。

它名为fastcsv,由Masaya Suzuki开发。您可以在GitHub中关闭它或使用Pypi进行安装。最简单的方法是:

pip install fastcsv


在http://pythonhosted.org/fastcsv/上,您可以看到更多基准测试结果,但仅阅读csv,让我在这里重复其结果:



有趣的是,它如何处理您的数据。

评论


“ TLDR:已弃用。请使用Python3和标准的csv模块。因此,我不建议您使用此库。” -作者的笔记?

–塞巴斯蒂安·帕尔马(Sebastian Palma)
18 Mar 10 '18 at 13:39



@SebastianPalma原始问题指定python 2.7

–达沃斯
18 Mar 19 '18 at 12:56

#3 楼

您可以根据数据大小和复杂度以及如何处理结果数据进行多种选择:


默认情况下Python附带的csv库。
NumPy -numpy.from_file函数-读取一个NumPy数组,因此非常强大。
pandas-pandas.io.parsers.read_csv函数-读取到一个熊猫数据帧,非常强大,并且可以处理庞大的数据集。

第一个可能会更快地导入,而其他的则更强大。全部都是免费的跨平台平台。如果您有默认安装,第一个已经是Python安装的一部分。

#4 楼

有一个新的pydatatable软件包,该软件包具有基于R data.table fread实现的非常快速的csv阅读器。加载后,您只需运行

pandas_dataframe = dt.fread(srcfile).to_pandas()


#5 楼

我建议您留意有关IO的熊猫官方文档。一个人的选择会根据开发周期不断变化,并且始终会添加新的格式。他们还发布了基准。

#6 楼

有一个用于数据挖掘的新Python包,名为DaPy。它具有简单的I / O API,对您来说足够快。根据作者的性能测试,DaPy花了12.5秒来加载具有200万条记录的csv文件,而熊猫花了4秒。但是,DaPy基于某些python本机数据结构,更易于使用。

cmd: pip install DaPy 
>>> import DaPy as dp
>>> data = dp.read('file.csv')
>>> data.show()