python - BeautifulSoup return unexpected extra spaces -

May 15, 2011

i trying grab text html documents beautifulsoup. in relavant case me, originates strange , interesting result: after point, soup full of spaces within text (a space separates every letter following one). tried search web in order find reason that, met news opposite bug (no spaces @ all).

do have suggestion or hint on why happens, , how solve problem?.

this basic code created:

from bs4 import beautifulsoup  import urllib2 html = urllib2.urlopen("http://www.beppegrillo.it") prova = html.read() soup = beautifulsoup(prova) print soup

and line taken results, line problem start appear:

value=\"giuseppe labbate ogm? non vorremmo nuovi uccelli chiamati lontre\"><input onmouseover=\"tip('<cen t e r c l s s = \ \ ' t t l e _ v d e o \ \ ' > < b > g u s e p p e l b b t e o g m ? n o n v o r r e m m o n u o v u c c e l l c h m t l o n t r e <

i believe bug lxml's html parser. try:

from bs4 import beautifulsoup  import urllib2 html = urllib2.urlopen ("http://www.beppegrillo.it") prova = html.read() soup = beautifulsoup(prova.replace('iso-8859-1', 'utf-8')) print soup

which workaround problem. believe issue fixed in lxml 3.0 alpha 2 , lxml 2.3.6, worth checking whether need upgrade newer version.

Search This Blog

Roma

python - BeautifulSoup return unexpected extra spaces -

Comments

Post a Comment

Popular posts from this blog

How to logout from a login page in asp.net -

How do i redirect a user to the previous page they came from after logging in? HTML/ASP -

Stack level too deep error after upgrade to rails 3.2 and ruby 1.9.3 -