python - BeautifulSoup return unexpected extra spaces -
i trying grab text html documents beautifulsoup. in relavant case me, originates strange , interesting result: after point, soup full of spaces within text (a space separates every letter following one). tried search web in order find reason that, met news opposite bug (no spaces @ all).
do have suggestion or hint on why happens, , how solve problem?.
this basic code created:
from bs4 import beautifulsoup import urllib2 html = urllib2.urlopen("http://www.beppegrillo.it") prova = html.read() soup = beautifulsoup(prova) print soup
and line taken results, line problem start appear:
value=\"giuseppe labbate ogm? non vorremmo nuovi uccelli chiamati lontre\"><input onmouseover=\"tip('<cen t e r c l s s = \ \ ' t t l e _ v d e o \ \ ' > < b > g u s e p p e l b b t e o g m ? n o n v o r r e m m o n u o v u c c e l l c h m t l o n t r e <
i believe bug lxml's html parser. try:
from bs4 import beautifulsoup import urllib2 html = urllib2.urlopen ("http://www.beppegrillo.it") prova = html.read() soup = beautifulsoup(prova.replace('iso-8859-1', 'utf-8')) print soup
which workaround problem. believe issue fixed in lxml 3.0 alpha 2 , lxml 2.3.6, worth checking whether need upgrade newer version.
if want more info on bug filed here:
https://bugs.launchpad.net/beautifulsoup/+bug/972466
hope helps,
hayden
Comments
Post a Comment