Faster parsing with Python -

- February 15, 2015

i'm trying parse data 1 web page. web page allows (according robots.txt) send 2000 requests per minute.

the problem tried slow. response of server quite quick.

from multiprocessing.pool import threadpool pool  import datetime import lxml.html lh bs4 import beautifulsoup import requests  open('products.txt') f:     lines = f.readlines()  def update(url):      html = requests.get(url).content #  3 seconds       doc = lh.parse(html) # 12 seconds (with commented line below)      soup = beautifulsoup(html) # 12 seconds (with commented line above)  pool = pool(10)  line in lines[0:100]:     pool.apply_async(update, args=(line[:-1],))  pool.close()  = datetime.datetime.now() pool.join() print datetime.datetime.now() -

as commented code - when try html = requests.get(url) 100 urls, time great - under 3 seconds.

the problem when want use parser - preprocessing of html costs 10 seconds , more much.

what recommend me lower time?

edit: tried use soupstrainer - faster nothing noticeable - 9 seconds.

html = requests.get(url).content  product = soupstrainer('div',{'class': ['shopspr','bottom']})  soup = beautifulsoup(html,'lxml', parse_only=product)

depending on need extract pages, perhaps don't need full dom. perhaps away htmlparser(html.parser in python3). should faster.

i decouple getting pages parsing pages, e.g. 2 pools, 1 getting pages , filling queue, other pool getting pages queue , parsing them. use available resources better, wont big speed up. side effect should server start serving pages bigger delay, still keep workers busy big queue.

Search This Blog

WIKI

Faster parsing with Python -

Comments

Post a Comment

Popular posts from this blog

jquery - ReferenceError: CKEDITOR is not defined -

javascript - Chart.js (Radar Chart) different scaleLineColor for each scaleLine -

apache - Error with PHP mail(): Multiple or malformed newlines found in additional_header -