我只想使用UTF8.问题是我不知道每个网页的字符集.如何检测并转换为UTF8?
<?PHP $url = "http://vkontakte.ru"; $ch = curl_init($url); $options = array( CURLOPT_RETURNTRANSFER => true,); curl_setopt_array($ch,$options); $data = curl_exec($ch); // $data = magic($data); print $data;
见:http://paulisageek.com/tmp/curl-utf8
什么是魔术()?
通过Gumbo和Pekka的建议,我写了curl_exec_utf8
/** The same as curl_exec except tries its best to convert the output to utf8 **/ function curl_exec_utf8($ch) { $data = curl_exec($ch); if (!is_string($data)) return $data; unset($charset); $content_type = curl_getinfo($ch,CURLINFO_CONTENT_TYPE); /* 1: HTTP Content-Type: header */ preg_match( '@([\w/+]+)(;\s*charset=(\S+))?@i',$content_type,$matches ); if ( isset( $matches[3] ) ) $charset = $matches[3]; /* 2: <Meta> element in the page */ if (!isset($charset)) { preg_match( '@<Meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s*charset=([^\s"]+))?@i',$data,$matches ); if ( isset( $matches[3] ) ) $charset = $matches[3]; } /* 3: <xml> element in the page */ if (!isset($charset)) { preg_match( '@<\?xml.+encoding="([^\s"]+)@si',$matches ); if ( isset( $matches[1] ) ) $charset = $matches[1]; } /* 4: PHP's heuristic detection */ if (!isset($charset)) { $encoding = mb_detect_encoding($data); if ($encoding) $charset = $encoding; } /* 5: Default for HTML */ if (!isset($charset)) { if (strstr($content_type,"text/html") === 0) $charset = "ISO 8859-1"; } /* Convert it if it is anything but UTF-8 */ /* You can change "UTF-8" to "UTF-8//IGNORE" to ignore conversion errors and still output something reasonable */ if (isset($charset) && strtoupper($charset) != "UTF-8") $data = iconv($charset,'UTF-8',$data); return $data; }
正则表达式大部分来自http://nadeausoftware.com/articles/2007/06/php_tip_how_get_web_page_content_type