solr配置文件schema.xml解析

schema.xml，主要定义索引的字段和字段类型

<?xml version="1.0" encoding="UTF-8" ?>
 略...
<!-- 
schema.xml位于solr/conf/目录下，类似于数据表配置文件

有关如何根据需要定制化该文件，请参照：http://wiki.apache.org/solr/SchemaXml

性能须知: 这里包含了很多实际应用不需要的可选项。 为改善性能，你可以：

 - 尽量将所有仅用于搜索，而不用于实际返回的字段设置stored="false"；
 - 尽量将所有仅用于返回，而不用于搜索的字段设置indexed="false"；
 - 去掉所有不需要的copyField 语句；
 - 为了达到最佳的索引大小和搜索性能,对所有的文本字段设置indexed="false"，
 使用copyField将他们拷贝到“整合字段”name="text"的字段中，使用整合字段进行搜索；
 - 使用server模式来运行JVM,同时将log级别调高,避免输出所有请求的日志。
-->

<schema name="example" version="1.5">
 略...

<fields>
 <!-- fields各个属性说明:
 name: 必须属性 - 字段名
 type: 必须属性 - <types>中定义的字段类型 
 indexed: 如果字段需要被索引（用于搜索或排序），属性值设置为true
 stored: 如果字段内容需要被返回，值设置为true
 docValues: 如果这个字段应该有文档值（doc values），设置为true。文档值在门面搜索，分组，排序和函数查询中会非常有用。虽然不是必须的，而且会导致                 生成索引变大变慢，但这样设置会使索引加载更快，更加NRT友好，更高的内存使用效率。然而也有一些使用限制：目前仅支持StrField,UUIDFiel                d和所有 Trie*Fields,并且依赖字段类型,可能要求字段为单值（single-valued）的,必须的或者有默认值。

 multiValued: 如果这个字段在每个文档中可能包含多个值，设置为true
 termVectors: [false] 设置为true后，会保存所给字段的相关向量（vector）
         当使用MoreLikeThis时,用于相似度判断的字段需要设置为stored来达到最佳性能.
 termPositions: 保存和向量相关的位置信息，会增加存储开销 
 termOffsets: 保存 offset 和向量相关的信息，会增加存储开销
 required: 字段必须有值，否则会抛异常
 default: 在增加文档时，可以根据需要为字段设置一个默认值，防止为空

        sortMissingLast: 指没有该指定字段数据的document排在有该指定字段数据的document的后面
        sortMissingFirst: 指没有该指定字段数据的document排在有该指定字段数据的document的前面
        omitNorms: 字段的长度不影响得分和在索引时不做boost时，设置它为true。一般文本字段不设置为true。
        compressed: 字段是压缩的。这可能导致索引和搜索变慢，但会减少存储空间，只有StrField和TextField是可以压缩，这通常适合字段的长度超过200个字符。
        positionIncrementGap: 和multiValued一起使用，设置多个值之间的虚拟空白的数量
 -->

 <!-- 字段名由字母数字下划线组成，且不能以数字开头。两端为下划线的字段为保留字段，
      如(_version_)。
 -->
 
 <field name="id" type="string" indexed="true" stored="true" 
   required="true" multiValued="false" /> 

 <field name="title" type="text_general" indexed="true" 
   stored="true" multiValued="true"/>
 <field name="description" type="text_general" indexed="true" stored="true"/>
 <field name="author" type="text_general" indexed="true" stored="true"/>
 <field name="keywords" type="text_general" indexed="true" stored="true"/>
 <field name="category" type="text_general" indexed="true" stored="true"/>
 <field name="url" type="text_general" indexed="true" stored="true"/>
 <field name="last_modified" type="date" indexed="true" stored="true"/>
 <!-- 注意: 为了节省空间,这个字段默认不被索引,因使用copyField被拷贝到了名为text的字段中
      。用于内容返回和高亮。搜索时使用text字段 
 -->
 <field name="content" type="text_general" indexed="false" 
   stored="true" multiValued="true"/>
 
 <!-- 整合字段(catchall field),包含其他可搜索的字段 （通过copyField实现） -->
 <field name="text" type="text_general" indexed="true" 
   stored="false" multiValued="true"/>

 <!-- 保留字段，不能删除，否则报错 -->
 <field name="_version_" type="long" indexed="true" stored="true"/>
 
</fields>


<!-- 文档的唯一标识，可理解为主键，除非标识为required="false",否则值不能为空-->
<uniqueKey>id</uniqueKey>

 <!-- 拷贝需要索引的字段到整合字段中 -->
 <copyField source="title" dest="text"/>
 <copyField source="author" dest="text"/>
 <copyField source="description" dest="text"/>
 <copyField source="keywords" dest="text"/>
 <copyField source="content" dest="text"/>
 <copyField source="url" dest="text"/>

 <types>
 <!-- 字段类型定义 -->
 <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
 <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
 <fieldType name="int" class="solr.TrieIntField" precisionStep="0" 
  positionIncrementGap="0"/>
 <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" 
  positionIncrementGap="0"/>
 <fieldType name="long" class="solr.TrieLongField" precisionStep="0" 
  positionIncrementGap="0"/>
 <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" 
  positionIncrementGap="0"/>
 <fieldType name="date" class="solr.TrieDateField" precisionStep="0" 
  positionIncrementGap="0"/>

 
 <fieldType name="text_ansj" class="solr.TextField">
    <analyzer type="index">
      <tokenizer class="org.ansj.solr5.AnsjTokenizerFactory"
        query="false" pstemming="true" stopwordsDir="stopwords/stopwords.dic"/>
			 <filter class="org.apache.lucene.analysis.pinyin.solr5.PinyinTokenFilterFactory"
        pinyinAll="true" minTermLenght="2" maxTermLenght="15"/>
      <filter class="org.apache.lucene.analysis.pinyin.solr5.PinyinEdgeNGramTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="org.ansj.solr5.AnsjTokenizerFactory"
        query="true" pstemming="false" stopwordsDir="stopwords/stopwords.dic"/>
			<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>	
    </analyzer>
  </fieldType>
 <fieldType name="text_ansj_nopin" class="solr.TextField">
    <analyzer type="index">
      <tokenizer class="org.ansj.solr5.AnsjTokenizerFactory"
        query="false" pstemming="true" stopwordsDir="stopwords/stopwords.dic"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="org.ansj.solr5.AnsjTokenizerFactory"
        query="true" pstemming="false" stopwordsDir="stopwords/stopwords.dic"/>
			<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>	
    </analyzer>
  </fieldType>

 
</types>
 
 <!-- 文档相似度判断依赖于文档相似度得分。 一个自定义的 Similarity 或 SimilarityFactory 可以在这里指定,但是默认的设置已经适合大多数应用。可以参考: http://wiki.apache.org/solr/SchemaXml#Similarity
 -->
 <!--
 <similarity class="com.example.solr.CustomSimilarityFactory">
 <str name="paramkey">param value</str>
 </similarity>
 -->

</schema>

<analyzertype="index">
<!--这个分词包是空格分词，在向索引库添加text类型的索引时，Solr会首先用空格进行分词
然后把分词结果依次使用指定的过滤器进行过滤，最后剩下的结果，才会加入到索引库中以备查询。
注意:Solr的analysis包并没有带支持中文的包，需要自己添加中文分词器，google下。
-->
<tokenizerclass="solr.WhitespaceTokenizerFactory"/> <filterclass="solr.SynonymFilterFactory"synonyms="index_synonyms.txt" ignoreCase="true"expand="false"/>

< copyField source =" features " dest =" text "/>

=" includes "/>

在添加索引时，将所有被拷贝field（如cat）中的数据拷贝到text field中

作用：

将多个field的数据放在一起同时搜索，提供速度
将一个field的数据拷贝到另一个，可以用2种不同的方式来建立索引。

动态字段，没有具体名称的字段，用dynamicField字段

如：name为*_i，定义它的type为int，那么在使用这个字段的时候，任务以_i结果的字段都被认为符合这个定义。如name_i,school_i

<dynamicFieldname="*_i"type="intindexed="truestored"/>

如果一个field的名字没有匹配到，那么就会用动态field试图匹配定义的各种模式。

"*"只能出现在模式的最前和最后
较长的模式会被先去做匹配
如果2个模式同时匹配上，最先定义的优先

tokenizerclass="solr.WhitespaceTokenizerFactory"/>

空格分词，精确匹配。

filter="solr.WordDelimiterFilterFactorygenerateWordParts="1generateNumberPartscatenateWordscatenateNumberscatenateAll="0splitOnCaseChange"/>

在分词和匹配时，考虑 "-"连字符，字母数字的界限，非字母数字字符，这样 "wifi"或"wi fi"都能匹配"Wi-Fi"。

="solr.SynonymFilterFactorysynonyms="synonyms.txtignoreCaseexpand"/>

同义词

="solr.StopFilterFactorywords="stopwords.txtenablePositionIncrements"/>

在禁用字（stopword）删除后，在短语间增加间隔

stopword：即在建立索引过程中（建立索引和搜索）被忽略的词，比如is this等常用词。在conf/stopwords.txt维护。

其他一些标签

uniqueKey>id</>

文档的唯一标识，必须填写这个field（除非该field被标记 required="false"），否则solr建立索引报错。

defaultSearchField>text>

如果搜索参数中没有指定具体的field，那么这是默认的域。

solrQueryParserdefaultOperator="OR"/>

配置搜索参数短语间的逻辑，可以是"AND|OR"。

solr配置文件schema.xml解析

猜你在找的XML相关文章