python - Following hyperlink and "Filtered offsite request" -
i know there several related threads out there, , have helped me lot, still can't way. @ point running code doesn't result in errors, nothing in csv
file. have following scrapy
spider starts on 1 webpage, follows hyperlink, , scrapes linked page:
from scrapy.http import request scrapy.spider import basespider scrapy.selector import htmlxpathselector scrapy.item import item, field class bbritem(item): year = field() appraisaldate = field() propertyvalue = field() landvalue = field() usage = field() landsize = field() address = field() class spiderbbrtest(basespider): name = 'spiderbbrtest' allowed_domains = ["http://boliga.dk"] start_urls = ['http://www.boliga.dk/bbr/resultater?sort=hus_nr_sort-a,etage-a,side-a&gade=septembervej&hus_nr=29&ipostnr=2730'] def parse2(self, response): hxs = htmlxpathselector(response) bbrs2 = hxs.select("id('evaluationcontrol')/div[2]/div") bbrs = iter(bbrs2) next(bbrs) bbr in bbrs: item = bbritem() item['year'] = bbr.select("table/tbody/tr[1]/td[2]/text()").extract() item['appraisaldate'] = bbr.select("table/tbody/tr[2]/td[2]/text()").extract() item['propertyvalue'] = bbr.select("table/tbody/tr[3]/td[2]/text()").extract() item['landvalue'] = bbr.select("table/tbody/tr[4]/td[2]/text()").extract() item['usage'] = bbr.select("table/tbody/tr[5]/td[2]/text()").extract() item['landsize'] = bbr.select("table/tbody/tr[6]/td[2]/text()").extract() item['address'] = response.meta['address'] yield item def parse(self, response): hxs = htmlxpathselector(response) parturl = ''.join(hxs.select("id('searchresult')/tr/td[1]/a/@href").extract()) url2 = ''.join(["http://www.boliga.dk", parturl]) yield request(url=url2, meta={'address': hxs.select("id('searchresult')/tr/td[1]/a[@href]/text()").extract()}, callback=self.parse2)
i trying export results csv file, nothing file. running code, however, doesn't result in errors. know it's simplyfied example 1 url, illustrates problem.
i think problem not telling scrapy
want save data in parse2
method.
btw, run spider scrapy crawl spiderbbr -o scraped_data.csv -t csv
you need modify yielded request
in parse
use parse2
callback.
edit: allowed_domains
shouldn't include http prefix eg:
allowed_domains = ["boliga.dk"]
try , see if spider still runs correctly instead of leaving allowed_domains
blank
Comments
Post a Comment