python - Following hyperlink and "Filtered offsite request" -

January 15, 2014

i know there several related threads out there, , have helped me lot, still can't way. @ point running code doesn't result in errors, nothing in csv file. have following scrapy spider starts on 1 webpage, follows hyperlink, , scrapes linked page:

from scrapy.http import request scrapy.spider import basespider scrapy.selector import htmlxpathselector scrapy.item import item, field  class bbritem(item):     year = field()     appraisaldate = field()     propertyvalue = field()     landvalue = field()     usage = field()     landsize = field()     address = field()      class spiderbbrtest(basespider):     name = 'spiderbbrtest'     allowed_domains = ["http://boliga.dk"]     start_urls = ['http://www.boliga.dk/bbr/resultater?sort=hus_nr_sort-a,etage-a,side-a&gade=septembervej&hus_nr=29&ipostnr=2730']      def parse2(self, response):                 hxs = htmlxpathselector(response)         bbrs2 = hxs.select("id('evaluationcontrol')/div[2]/div")         bbrs = iter(bbrs2)         next(bbrs)         bbr in bbrs:             item = bbritem()             item['year'] = bbr.select("table/tbody/tr[1]/td[2]/text()").extract()             item['appraisaldate'] = bbr.select("table/tbody/tr[2]/td[2]/text()").extract()             item['propertyvalue'] = bbr.select("table/tbody/tr[3]/td[2]/text()").extract()             item['landvalue'] = bbr.select("table/tbody/tr[4]/td[2]/text()").extract()             item['usage'] = bbr.select("table/tbody/tr[5]/td[2]/text()").extract()             item['landsize'] = bbr.select("table/tbody/tr[6]/td[2]/text()").extract()             item['address']  = response.meta['address']             yield item      def parse(self, response):         hxs = htmlxpathselector(response)         parturl = ''.join(hxs.select("id('searchresult')/tr/td[1]/a/@href").extract())         url2 = ''.join(["http://www.boliga.dk", parturl])         yield request(url=url2, meta={'address': hxs.select("id('searchresult')/tr/td[1]/a[@href]/text()").extract()}, callback=self.parse2)

i trying export results csv file, nothing file. running code, however, doesn't result in errors. know it's simplyfied example 1 url, illustrates problem.

i think problem not telling scrapy want save data in parse2 method.

btw, run spider scrapy crawl spiderbbr -o scraped_data.csv -t csv

you need modify yielded request in parse use parse2 callback.

edit: allowed_domains shouldn't include http prefix eg:

allowed_domains = ["boliga.dk"]

try , see if spider still runs correctly instead of leaving allowed_domains blank

Search This Blog

Roma

python - Following hyperlink and "Filtered offsite request" -

Comments

Post a Comment

Popular posts from this blog

How to logout from a login page in asp.net -

How do i redirect a user to the previous page they came from after logging in? HTML/ASP -

Stack level too deep error after upgrade to rails 3.2 and ruby 1.9.3 -