Scrapy python : unicode links error -
link encoding
when scraping site scrapy extracts links containing &amd , throws excption: not instantiate link objects unicode urls. assuming utf-8 encoding (which wrong) how can fix error?
i had same problem character →
inserted on links. found this related commit on github , used this advice write file link_extractors.py
with:
from scrapy.selector import htmlxpathselector scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.utils.response import get_base_url class customlinkextractor(sgmllinkextractor): """need fix encoding error.""" def extract_links(self, response): base_url = none if self.restrict_xpaths: hxs = htmlxpathselector(response) base_url = get_base_url(response) body = u''.join(f x in self.restrict_xpaths f in hxs.select(x).extract()) try: body = body.encode(response.encoding) except unicodeencodeerror: body = body.encode('utf-8') else: body = response.body links = self._extract_links(body, response.url, response.encoding, base_url) links = self._process_links(links) return links
afterwards used in spiders.py
:
rules = ( rule(customlinkextractor(allow=('/gp/offer-listing*', ), restrict_xpaths=("//li[contains(@class,'a-last')]/a", )), callback='parse_start_url', follow=true, ), )
Comments
Post a Comment