python - BeautifulSoup return unexpected extra spaces -


i trying grab text html documents beautifulsoup. in relavant case me, originates strange , interesting result: after point, soup full of spaces within text (a space separates every letter following one). tried search web in order find reason that, met news opposite bug (no spaces @ all).

do have suggestion or hint on why happens, , how solve problem?.

this basic code created:

from bs4 import beautifulsoup  import urllib2 html = urllib2.urlopen("http://www.beppegrillo.it") prova = html.read() soup = beautifulsoup(prova) print soup 

and line taken results, line problem start appear:

value=\"giuseppe labbate ogm? non vorremmo nuovi uccelli chiamati lontre\"><input onmouseover=\"tip('<cen t e r c l s s = \ \ ' t t l e _ v d e o \ \ ' > < b > g u s e p p e l b b t e o g m ? n o n v o r r e m m o n u o v u c c e l l c h m t l o n t r e <

i believe bug lxml's html parser. try:

from bs4 import beautifulsoup  import urllib2 html = urllib2.urlopen ("http://www.beppegrillo.it") prova = html.read() soup = beautifulsoup(prova.replace('iso-8859-1', 'utf-8')) print soup 

which workaround problem. believe issue fixed in lxml 3.0 alpha 2 , lxml 2.3.6, worth checking whether need upgrade newer version.

if want more info on bug filed here:

https://bugs.launchpad.net/beautifulsoup/+bug/972466

hope helps,

hayden


Comments

Popular posts from this blog

curl - PHP fsockopen help required -

HTTP/1.0 407 Proxy Authentication Required PHP -

c# - Resource not found error -