Can't load large file (~2gb) using Pandas or Blaze in Python -
i have file >5 million rows , 20 fields. open in pandas, got out of memory error:
pandas.parser.cparsererror: error tokenizing data. c error: out of memory
i have read posts on similar issues , discovered blaze, following 3 methods (.data, .csv, .table), none worked apparently.
# coding=utf-8 import pandas pd pandas import dataframe, series import re import numpy np import sys import blaze bz reload(sys) sys.setdefaultencoding('utf-8') # gave out of memory error '''data = pd.read_csv('file.csv', header=0, encoding='utf-8', low_memory=false) df = dataframe(data) print df.shape print df.head''' data = bz.data('file.csv') # tried followings too, no luck '''data = bz.csv('file.csv') data = bz.table('file.csv')''' print data print data.head(5)
output:
_1 _1.head(5) [finished in 1.0s]
blaze
for bz.data(...)
object you'll have result. loads data needed. if @ terminal , typed in
>>> data
you head repr-ed out screen. if need use print function try
bz.compute(data.head(5))
dask.dataframe
you might consider using dask.dataframe, has similar (though subsetted) api pandas
>>> import dask.dataframe dd >>> data = dd.read_csv('file.csv', header=0, encoding='utf-8')
Comments
Post a Comment