如何在SOLR中编制.html文件索引

我想要做索引的文件存储在服务器上(我不需要抓取). /路径/到/文件/
示例 HTML 文件是

<Meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<Meta name="product_id" content="11"/>
<Meta name="assetid" content="10001"/>
<Meta name="title" content="title of the article"/>
<Meta name="type" content="0xyzb"/>
<Meta name="category" content="article category"/>
<Meta name="first" content="details of the article"/>

<h4>title of the article</h4>
<p class="link"><a href="#link">How cite the Article</a></p>
<p class="list">
  <span class="listterm">Length: </span>13 to 15 feet<br>
  <span class="listterm">Height to Top of Head: </span>up to 18 feet<br>
  <span class="listterm">Weight: </span>1,200 to 4,300 pounds<br>
  <span class="listterm">Diet: </span>leaves and branches of trees<br>
  <span class="listterm">Number of Young: </span>1<br>
  <span class="listterm">Home: </span>Sahara<br>

</p>
</p>

我在solrconfing.xml文件中添加了请求处理程序.

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
  <str name="config">/path/to/data-config.xml</str>
</lst>

我的data-config.xml看起来像这样

<dataConfig>
<dataSource type="FileDataSource" />
<document>
    <entity name="f" processor="FileListEntityProcessor" baseDir="/path/to html/files/" fileName=".*html" recursive="true" rootEntity="false" dataSource="null">
        <field column="plainText" name="text"/>
    </entity>
</document>
</dataConfig>

我保留了默认的schema.xml文件,并将以下代码添加到schema.xml文件中.

<field name="product_id" type="string" indexed="true" stored="true"/>
 <field name="assetid" type="string" indexed="true" stored="true" required="true" />
 <field name="title" type="string" indexed="true" stored="true"/>
 <field name="type" type="string" indexed="true" stored="true"/>
 <field name="category" type="string" indexed="true" stored="true"/>
 <field name="first" type="text_general" indexed="true" stored="true"/>

 <uniqueKey>assetid</uniqueKey>

当我在设置它之后尝试进行完全导入时,它显示所有html文件都已获取.但是当我在SOLR中搜索时,它没有向我显示任何结果.任何人都知道可能的原因是什么？

我的理解是所有文件都正确获取但未在SOLR中编入索引.有谁知道如何在SOLR中索引那些元标记和HTML文件的内容？

您的回复将不胜感激.

解决方法

您可以使用 Solr Extracting Request Handler将Solr与HTML文件一起提供,并从html文件中提取内容.例如在 link

Solr使用Apache Tika从uploaded html file中提取内容

如果你想抓取网站并将其编入索引,Nutch与Solr是一个更广泛的解决方案.
Nutch with Solr Tutorial将帮助您入门.

如何在SOLR中编制.html文件索引

解决方法

猜你在找的HTML相关文章