php – 尝试使用HTML DOM解析器在Amazon页面上获取主图像

我正在尝试使用 HTML DOM Parser来获取“主”产品图像的图像源,无论解析器指向哪个产品页面.

在每个页面上,似乎该图像的id为“landingImage”.
你会认为这应该是诀窍：

$finalarray[$i][2] = $html->find('img[id="landingImage"]',0)->src;

但没有这样的运气.

我也试过

foreach($html->find('img') as $e)
    if (strpos($e,'landingImage') !== false) { 
        $finalarray[$i][2] = $e->src;
    }

我注意到,通常图像源有SY300或SX300,所以我这样做：

foreach($html->find('img') as $e)
    if (strpos($e,'SX300') !== false) { 
        $finalarray[$i][2] = $e->src;
    }
    else if (strpos($e,'SY300') !== false) { 
        $finalarray[$i][2] = $e->src;
    }

不幸的是,一些图像源链接不包含,例如：

http://www.amazon.com/gp/product/B001O21H00/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B001O21H00&linkCode=as2&tag=bmref-20

使用Amazon API可能是更好的解决方案,但这不是问题.

当我从示例网页(没有运行JavaScript的内容)下载html时,我找不到id =“landingImage”[1]的任何标签.但是我可以找到一个id =“main-image”的图像标签.尝试使用DOMDocument提取此标记不成功.不知何故的方法loadHTML()和loadHTMLFile()不能解析html.

但有趣的部分可以用正则表达式来提取.以下代码将为您提供图像源：

$url = 'http://www.amazon.com/gp/product/B001O21H00/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B001O21H00&linkCode=as2&tag=bmref-20';
$html = file_get_contents($url);

$matches = array();
if (preg_match('#<img[^>]*id="main-image"[^>]*src="(.*?)"[^>]*>#',$html,$matches)) {
    $src = $matches[1];
}

// The source of the image is
// $src: 'http://ecx.images-amazon.com/images/I/21JzKZ9%2BYGL.jpg'

[1] html源码已在PHP中下载,功能为file_get_contents.使用Firefox下载html源代码会导致不同的HTML代码.在最后一种情况下,您会发现一个带有id属性“landingImage”(JavaScript未启用！)的图像标签.看来下载的html源码依赖于客户端(http请求中的头文件).

php – 尝试使用HTML DOM解析器在Amazon页面上获取主图像

猜你在找的PHP相关文章