Scrapy python : unicode links error -


link encoding

when scraping site scrapy extracts links containing &amd , throws excption: not instantiate link objects unicode urls. assuming utf-8 encoding (which wrong) how can fix error?

i had same problem character inserted on links. found this related commit on github , used this advice write file link_extractors.py with:

from scrapy.selector import htmlxpathselector scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.utils.response import get_base_url   class customlinkextractor(sgmllinkextractor): """need fix encoding error."""      def extract_links(self, response):         base_url = none         if self.restrict_xpaths:             hxs = htmlxpathselector(response)             base_url = get_base_url(response)             body = u''.join(f x in self.restrict_xpaths                            f in hxs.select(x).extract())             try:                 body = body.encode(response.encoding)             except unicodeencodeerror:                 body = body.encode('utf-8')         else:             body = response.body          links = self._extract_links(body, response.url, response.encoding, base_url)         links = self._process_links(links)         return links 

afterwards used in spiders.py:

rules = (     rule(customlinkextractor(allow=('/gp/offer-listing*', ),                            restrict_xpaths=("//li[contains(@class,'a-last')]/a", )),          callback='parse_start_url', follow=true,           ), ) 

Comments

Popular posts from this blog

curl - PHP fsockopen help required -

HTTP/1.0 407 Proxy Authentication Required PHP -

c# - Resource not found error -