html - Python Scrapy, parsing multiple child objects into the same item? -
for non-profit college assignment trying scrape website www.rateyourmusic.com, able scrape things have encountered problem when trying scrape multiple children of html element.
specifically i'm trying scrape genre of artist many artists multiple genres , can't scrape of them, here parsing method:
def parse_dir_contents(self, response): item = rateyourmusicartist() #get genres of artist sel in response.xpath('//a[@class="genre"]'): item['genre'] = sel.xpath('text()').extract() yield item
there multiple //a[@class="genre"]
xpaths representing genre, put them in 1 string separated ', '.
is there easy way this? here sample url site i'm scraping http://rateyourmusic.com/artist/kanye_west.
a simple str.join()
trick:
", ".join(response.xpath('//a[@class="genre"]/text()').extract())
demo (from scrapy shell):
$ scrapy shell http://rateyourmusic.com/artist/kanye_west in [1]: ", ".join(response.xpath('//a[@class="genre"]/text()').extract()) out[1]: u'hip hop, pop rap, experimental hip hop, hardcore hip hop, electropop, synthpop'
note that, if use item loaders, can make cleaner:
from scrapy.loader.processors import join loader = myitemloader(response=response) loader.add_xpath("genre", '//a[@class="genre"]/text()', join(", ")) yield loader.load_item()
Comments
Post a Comment