我有一个网站,每月一次通过FTP接收CSV文件.多年来它是一个ASCII文件.现在我接受一个月的UTF-8,然后是接下来的UTF-16BE和之后一个月的UTF-16LE.也许下个月我会得到UTF-32. Fgets返回UTF文件开头的字节顺序标记.如何让
PHP自动识别字符编码?我曾尝试过mb_detect_encoding,无论文件类型如何,它都会返回ASCII.我更改了代码以读取BOM并明确将字符编码放入mb_convert_encoding.这工作到最新文件,即UTF-16LE.在此文件中,它正确读取第一行,所有后续行显示为问号(“?”).我究竟做错了什么?
$fhandle = fopen( $file_in,"r" ); if ( fhandle === false ) { echo "<p class=redbold>Error opening file $file_in.</p>"; die(); } $i = 0; while( ( $line = fgets( $fhandle ) ) !== false ) { $i++; // Detect encoding on first line. Actual text always begins with string "Document" if ( $i == 1 ) { $line_start = substr( $line,4 ); $line_start_hex = bin2hex( $line_start ); $utf16_start = 'fffe4400'; $utf8_start = 'efbbbf44'; if ( strcmp( $line_start,'Docu' ) == 0 ) { $char_encoding = 'ASCII'; } elseif ( strcmp( $line_start_hex,'efbbbf44' ) == 0 ) { $char_encoding = 'UTF-8'; $line = substr( $line,3 ); } elseif ( strcmp( $line_start_hex,'fffe4400' ) == 0 ) { $char_encoding = 'UTF-16LE'; $line = substr( $line,2 ); } elseif ( strcmp( $line_start_hex,'feff4400' ) == 0 ) { $char_encoding = 'UTF-16BE'; $line = substr( $line,2 ); } else { echo "<p class=redbold>Error,unknown character encoding. Line =<br>",$line_start_hex,'</p>'; require( '../footer.PHP' ); die(); } echo "<p>char_encoding = $char_encoding</p>"; } // Convert UTF if ( $char_encoding != 'ASCII' ) { $line = mb_convert_encoding( $line,'ASCII',$char_encoding); } echo '<p>'; var_dump( $line ); echo '</p>'; }
输出:
char_encoding = UTF-16LE string(101) "DocumentNumber,RecordedTS,Title,PageCount,City,TransTaxAccountCode,TotalTransferTax,Description,Name " string(83) "???????????????????????????????????????????????????????????????????????????????????" string(88) "????????????????????????????????????????????????????????????????????????????????????????" string(84) "????????????????????????????????????????????????????????????????????????????????????" string(80) "????????????????????????????????????????????????????????????????????????????????"
明确传递命令和可能的编码进行检测,并使用严格的参数.也
请使用file_get_contents,如果文件是UTF-16LE,fgets会为你搞砸.
请使用file_get_contents,如果文件是UTF-16LE,fgets会为你搞砸.
<?PHP header( "Content-Type: text/html; charset=utf-8"); $input = file_get_contents( $file_in ); $encoding = mb_detect_encoding( $input,array( "UTF-8","UTF-32","UTF-32BE","UTF-32LE","UTF-16","UTF-16BE","UTF-16LE" ),TRUE ); if( $encoding !== "UTF-8" ) { $input = mb_convert_encoding( $input,"UTF-8",$encoding ); } echo "<p>$encoding</p>"; foreach( explode( PHP_EOL,$input ) as $line ) { var_dump( $line ); }
顺序很重要,因为UTF-8和UTF-32限制性更强,UTF-16非常宽松;几乎任何随机的
偶数长度的字节是有效的UTF-16.
保留所有信息的唯一方法是将其转换为unicode编码,而不是ASCII.