php - Parse page with different encoding -


i made parser wordpress since wp , db using utf-8 , pages in different encoding, when parse them gibrish. use curl content outside urls , match , replace regex.

any suggestions how solve problem?

i used suggestion joni below , solved problem. sample code used future queries on problem:

preg_match("/charset=(.*?)(\n|'|\"|>)/ism", $content, $charset); $content = preg_replace('/^http+[^<]+</', '<', $content); $charset = @trim($charset[1]); if (preg_match("~(windows-1251|1251)~i", $charset)) return 'windows-1251';  elseif (preg_match("~iso-8859-7~i", $charset))  return 'iso-8859-7'; elseif (preg_match("~(koi8|iso-ir-111)~i", $charset))  return 'koi8-r'; 

detect correct encoding content type header (or html meta tag if header missing) , use when parse document.


Comments

Popular posts from this blog

curl - PHP fsockopen help required -

HTTP/1.0 407 Proxy Authentication Required PHP -

c# - Resource not found error -