php - Parse page with different encoding -
i made parser wordpress since wp , db using utf-8 , pages in different encoding, when parse them gibrish. use curl content outside urls , match , replace regex.
any suggestions how solve problem?
i used suggestion joni below , solved problem. sample code used future queries on problem:
preg_match("/charset=(.*?)(\n|'|\"|>)/ism", $content, $charset); $content = preg_replace('/^http+[^<]+</', '<', $content); $charset = @trim($charset[1]); if (preg_match("~(windows-1251|1251)~i", $charset)) return 'windows-1251'; elseif (preg_match("~iso-8859-7~i", $charset)) return 'iso-8859-7'; elseif (preg_match("~(koi8|iso-ir-111)~i", $charset)) return 'koi8-r';
detect correct encoding content type header (or html meta tag if header missing) , use when parse document.
Comments
Post a Comment