html - Python Scrapy, parsing multiple child objects into the same item? -

- August 15, 2014

for non-profit college assignment trying scrape website www.rateyourmusic.com, able scrape things have encountered problem when trying scrape multiple children of html element.

specifically i'm trying scrape genre of artist many artists multiple genres , can't scrape of them, here parsing method:

def parse_dir_contents(self, response):       item = rateyourmusicartist()      #get genres of artist     sel in response.xpath('//a[@class="genre"]'):              item['genre'] = sel.xpath('text()').extract()      yield item

there multiple //a[@class="genre"] xpaths representing genre, put them in 1 string separated ', '.

is there easy way this? here sample url site i'm scraping http://rateyourmusic.com/artist/kanye_west.

a simple str.join() trick:

", ".join(response.xpath('//a[@class="genre"]/text()').extract())

demo (from scrapy shell):

$ scrapy shell http://rateyourmusic.com/artist/kanye_west in [1]: ", ".join(response.xpath('//a[@class="genre"]/text()').extract()) out[1]: u'hip hop, pop rap, experimental hip hop, hardcore hip hop, electropop, synthpop'

note that, if use item loaders, can make cleaner:

from scrapy.loader.processors import join  loader = myitemloader(response=response) loader.add_xpath("genre", '//a[@class="genre"]/text()', join(", "))  yield loader.load_item()

Search This Blog

WIKI

html - Python Scrapy, parsing multiple child objects into the same item? -

Comments

Post a Comment

Popular posts from this blog

javascript - Chart.js (Radar Chart) different scaleLineColor for each scaleLine -

jquery - ReferenceError: CKEDITOR is not defined -

apache - Error with PHP mail(): Multiple or malformed newlines found in additional_header -