我理解的elasticsearch(三）

match查询

match查询是一个散列映射，包含了希望搜索的字段和字符串，默认情况下，match查询使用布尔行为和OR操作符，为了同时搜索多个词，可以将operator字段设置为and.

curl 'localhost:9200/get-together/_search' -d ' {
"query" : {
    "match" : {
        "name" : {
            "query" : "elasticsearch",
            "operator" : "and"
                 }
            }
        }
    }'

phrase查询

可以指定词组之间的间隔,slop用于表示词组中多个分词之间的距离。

curl 'localhost:9200/get-together/group/_search' -d ' {
"query" : {
    "match" : {
        "name" : {
            "type" : "phrase",
            "query": "land row",
            "slop" : 1
         }
    }
},
"_source" : ["name" , "descriptioin"]
}'

multi_match查询

用来匹配多个字段中的词条。

curl 'localhost:9200/get-together/_search' -d ' {
"query" : {
    "multi_match" : {
        "query" : "elasticsearch",
        "fields" : [ "name" , "description" ]   
    }
}
}'

bool查询

bool查询允许在单独的查询中组合任意数量的查询，对于多个查询使用特定的词进行修饰，使用must,should,must_not。对于must匹配，只有匹配上这些查询的结果才会被返回；对于should匹配，文档至少匹配一个should子句才能返回；对于must_not匹配，会使得其匹配的文档从结果集合中被移除。

curl 'localhost:9200/get-together/_search' -d ' {
"query" : {
    "bool" : {
        "must" : [
            {
            "term" : {
                "attendees" : "david"
            }
            }
        ],
         "should" : [
            {
             "term" : {
                "attendees" : "clint"
            }
            },
            {
             "term" : {
                 "attendees" : "andy"
            }
            }
        ],
        "must_not" : [
            {
             "range" : {
                 "date" : {
                     "lt" : "2019-01-8"
                  }
              }
             }
         ],
         minimum_should_match" : 1
        }
    }
}'

bool过滤器和查询版本一致，只不过它是组合的是过滤器而已。

分析数据

分析是在文档被加入到倒排索引之前，elasticsearch在其实例上进行的操作，在文档被加入索引之前，elasticsearch让每个被分析字段经过一系列的处理步骤。步骤有：

字符过滤器：用来转变字符，

文本切分为字符：将文本切分为单个或多个分词，

分词过滤：用来转变每个分词，

分词索引：将分词存储到索引中。

零个或多个字符过滤器，分词器，分词过滤器组成一个分析器。

使用分析器：

有两种方式来指定字段所使用的分析器：

⑴当创建索引的时候，为特定的索引进行设置

curl 'localhost:9200/newindex' -d ' 
{
    "settings" : {                  <---设置
        "number_of_shards" : 2 ,      <--- 主分片
        "number_of_replicas" ： 1，       <---副本分片
        "index" : {              <--- 设置索引
            "analysis" : {        <--- 索引的分析设置
                "analyzer" : {      
                     "myCustomerAnalyzer":  {   <----在分析器对象中设置定制分析器
                         "type" : "custom",
                         "tokenizer" : "myCustomTokenizer",
                         "filter" : ["myCustomFilter1","myCustomFilter2"],
                         "char_filter" : ["myCustomCharFilter" ]
                      }
                 },
            "tokenizer" : {      <---定制分词器
                "myCustomTokenizer" : {
                    "type" : "letter"
            }
            "filter" :  {       <---定制过滤器
                "myCustomerFilter1" : {
                    "type" : "lowercase"
                },
                "myCustomerFilter2" : {
                    "type" : "kstem"
                }
            },
            "char_filter" : {   <---字符过滤器
                "myCustomCharFilter: {
                    "type" : "mapping",
                    "mapping" : ["ph>=f" , "u=> you" ]
                 }
             }
        }

    }
},
    "mapping" : {    <---创建索引映射
            ...
     }
}'

⑵在elasticsearch的配置文件中，设置全局的分析器。在elasticsearch.yml中进行设置。和上面的没有区别，只需要将index:之后和mapping之前的复制到elasticsearch.yml中。

在映射中指定某个字段的分析器：

{
    "mapping" : {
        "document" : {
            "properties" : {
                "description" : {
                    "type" : "string",
                    "analyzer" : "myCustomAnalyzer"
                 }
            }
         }
     }  
}

如果想让某个字段不被分析，需要指定index字段为not_analyzed。

使用分析API：

使用标准分析器分析文本 ” hello，everyone ,this is a good boy"

curl -XPOST 'localhost:9200/_analyze?analyzer=standard' -d 'hello ,everyone, this is a good boy '

对于分析器的使用，可以使用analyzer参数来指定分析器，对于创建索引的时候定制的分析器，可以通过名字来指定使用这个分析器，但不在使用HTTP的/_search端点，而是需要先指定索引。

curl -XPOST 'localhost:9200/get-together/_analyze?analyzer=myCustomAnalyzer' -d  'share your experience with Nosql'

elasticsearch内置有很多分析器，eg：

标准分析器：是文本的默认分析器，包括标准分词器，标准分词过滤器，分词过滤器等；

简单分析器：即只使用小写转换分词器；

空白分析器：根据空白将文本切分成若干分词；

停用词分析器：过滤停用词；

关键词分析器：将整个字段作为一个单独的分词

模式分析器，雪球分析器。

分词器有：标准分词器，关键词分词器，字母分词器，小写分词器，空白分词器，模式分词器。

分词过滤有：标准分词过滤器，小写分词过滤器，长度分词过滤器，停用词分词过滤器，ASCII折叠分词过滤器，等等等。

简单介绍Elasticsearch的打分机制：

elasticsearch的打分机制是一个公式，将考量的文档作为输入，然后使用不同的因素来确定该文档的得分，返回更为相关的文档被优先返回，在elasticsearch中这种相关性被称为得分。在计算得分时elasticsearch需要使用被搜索的词的相关信息，词频和逆文档频率。

词频：

一个分词在文本中出现的次数。即当一个词在文档中出现的频率越高，则该文档的得分就越高。

逆文档频率：

如果一个分词在索引的不同文档中出现的次数越多，则它就越不重要。逆文档频率只检查一个词条是否出现在某文档中，并不检查它出现的次数。eg:the，它几乎在每个文档中出现，则证明其并不重要。

其他打分方法有： Okapi BM25 ; 随机性分歧，基于信息的，LM Dirichlet相似度，LM Jelinek Mercer相似度。

参考《Elasticsearch 实战》如有错误，敬请指出，部分代码实例来自此书。