Scrapy python : unicode links error -

June 15, 2013

link encoding

when scraping site scrapy extracts links containing &amd , throws excption: not instantiate link objects unicode urls. assuming utf-8 encoding (which wrong) how can fix error?

i had same problem character → inserted on links. found this related commit on github , used this advice write file link_extractors.py with:

from scrapy.selector import htmlxpathselector scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.utils.response import get_base_url   class customlinkextractor(sgmllinkextractor): """need fix encoding error."""      def extract_links(self, response):         base_url = none         if self.restrict_xpaths:             hxs = htmlxpathselector(response)             base_url = get_base_url(response)             body = u''.join(f x in self.restrict_xpaths                            f in hxs.select(x).extract())             try:                 body = body.encode(response.encoding)             except unicodeencodeerror:                 body = body.encode('utf-8')         else:             body = response.body          links = self._extract_links(body, response.url, response.encoding, base_url)         links = self._process_links(links)         return links

afterwards used in spiders.py:

rules = (     rule(customlinkextractor(allow=('/gp/offer-listing*', ),                            restrict_xpaths=("//li[contains(@class,'a-last')]/a", )),          callback='parse_start_url', follow=true,           ), )

Search This Blog

Roma

Scrapy python : unicode links error -

Comments

Post a Comment

Popular posts from this blog

How to logout from a login page in asp.net -

How do i redirect a user to the previous page they came from after logging in? HTML/ASP -

java - More than one row with the given identifier was found: 1, for class: com.model.Diagnosis -