Faster parsing with Python -
i'm trying parse data 1 web page. web page allows (according robots.txt) send 2000 requests per minute.
the problem tried slow. response of server quite quick.
from multiprocessing.pool import threadpool pool import datetime import lxml.html lh bs4 import beautifulsoup import requests open('products.txt') f: lines = f.readlines() def update(url): html = requests.get(url).content # 3 seconds doc = lh.parse(html) # 12 seconds (with commented line below) soup = beautifulsoup(html) # 12 seconds (with commented line above) pool = pool(10) line in lines[0:100]: pool.apply_async(update, args=(line[:-1],)) pool.close() = datetime.datetime.now() pool.join() print datetime.datetime.now() -
as commented code - when try html = requests.get(url)
100 urls, time great - under 3 seconds.
the problem when want use parser - preprocessing of html costs 10 seconds , more much.
what recommend me lower time?
edit: tried use soupstrainer
- faster nothing noticeable - 9 seconds.
html = requests.get(url).content product = soupstrainer('div',{'class': ['shopspr','bottom']}) soup = beautifulsoup(html,'lxml', parse_only=product)
depending on need extract pages, perhaps don't need full dom. perhaps away htmlparser
(html.parser
in python3). should faster.
i decouple getting pages parsing pages, e.g. 2 pools, 1 getting pages , filling queue, other pool getting pages queue , parsing them. use available resources better, wont big speed up. side effect should server start serving pages bigger delay, still keep workers busy big queue.
Comments
Post a Comment