javascript - How to force the browser to stop parsing dynamically inserted code to HTML 4? -
i need parse old html pdf file, have jar this, accepts legit xhtml code. have parse old html code jar accept it. know how html-code parse idea use html-parser john resig parse tags (img, br, meta) straight xml, have needed effect (mostly closing tags) on them.
my actual attempt looks this:
function fixtags() { var tagstoparse = new array( "br", "img", "input", "meta" ); for(i = 0; < tagstoparse.length; i++) { var elements = document.getelementsbytagname(tagstoparse[i]); for(j = 0; j < elements.length; j++) { elements[j].outerhtml = htmltoxml(elements[j].outerhtml); } } }
the problem here browser interpret new code element html4, leads him changing stuff wanted change. example <br>
becomes <br/>
after going through parser, browser interpret html4 , outerhtml property of element <br>
again.
my first attempt solve force document xhtml temporarily:
var root = document.getelementsbytagname("html")[0]; root.setattribute("xml", "http://www.w3.org/1999/xhtml");
but doesn't seem bother browser @ in behaviour.
the "obvious" solution of building string-tree out of dom, replacing strings there , traversing tree string want seems bit heavy , complex "little" problem, that's why ask you.
so if has idea easier solution, happy, application ie-only ie-exclusive solutions accepted well.
for use case, it's easiest use existing html -> xhtml converter, example: http://www.it.uc3m.es/jaf/html2xhtml/simple-form.html
if want in browser, naive solution try this, using naive regexes (you shouldn't use regexp parse xml) , xmlserializer.
var serializer = new xmlserializer(); var xml = serializer.serializetostring(document).replace(/<(img|meta|input|br|link)([^>]*)/gi, function (ignore, tagname, attributes) { return '<' + tagname + attributes + ' />'; });
you can less naive regex if doesn't work, think document can converted pdf in first place should trick.
edit: note regex assumes none of tags self-closing before operation.
Comments
Post a Comment