ElasticSearch高级操作

`ElasticSearch`高级操作

`ElasticSearch`进阶操作——搜索

[yangqi@xiaoer ~]$ curl -XGET 'http://xiaoer:9200/library/books/_search?pretty'

# 查询指定字段的值
[yangqi@xiaoer ~]$ curl -XGET 'http://xiaoer:9200/library/books/_search?pretty&_source=name,price'

# 从第0条数据开始查询，查询2条
[yangqi@xiaoer ~]$ curl -XGET 'http://xiaoer:9200/library/books/_search?pretty&from=0&size=2'

# 相当于：select * from books where name.first = 'li'
[yangqi@xiaoer ~]$ curl -XGET 'http://xiaoer:9200/library/books/_search?pretty&q=name.first:li'

# 相当于：select * from books where price = 78.3 or price = 2000
??? [yangqi@xiaoer ~]$ curl -XGET 'http://xiaoer:9200/library/books/_search?pretty&q=price:78.3,price:2000'

# 相当于：select * from books where name.first = 'li' and price = 78.3
[yangqi@xiaoer ~]$ curl -XGET 'http://xiaoer:9200/library/books/_search?pretty&q=name.first:sun&q=price:78.3'

# 相当于：selec * from books where name.first like 'l*'
[yangqi@xiaoer ~]$ curl -XGET 'http://xiaoer:9200/library/books/_search?pretty&q=name.first:l*'

# 搜索的相关补充
q: 指定查询的语句，例如q=aa或q=user:aa
df: q 中不指定字段默认查询的字段，如果不指定，es会查询所有字段
Sort：排序，asc升序，desc降序
timeout：指定超时时间，默认不超时
from，size：用于分页
term：a b指定单词 ==> a or b
phrase: a b 整个词语 ==> "a b"

# 范查询：将 books 中只要含有指定字符的全部查询出来
[yangqi@xiaoer ~]$ curl -XGET 'http://xiaoer:9200/library/_search?pretty&q=es'

# 查询计划
[yangqi@xiaoer ~]$ curl -H 'Content-Type: application/json' -XGET 'http://xiaoer:9200/library_search?pretty&q=hadoop' -d '{ "profile": true }'

# 布尔操作符
（1）AND(&&),OR(||),NOT(!)
例如：name:(tom NOT lee)	
#表示name字段中可以包含tom但一定不包含lee
（2）+、-分别对应must和must_not
例如：name:(tom +lee -alfred)	
#表示name字段中，一定包含lee，一定不包含alfred，可以包含tom
注意：+在url中会被解析成空格，要使用encode后的结果才可以，为%2B

# 算术运算符
age:>=1
age:(>=1&&=10)或者age:(+=1 +<=10)

# 通配符查询
# ?: 匹配 1 个字符
# *: 匹配 0 个或者多个字符
注意：通配符匹配执行效率低，且占用较多的内存，不建议使用，如无特殊要求，不要将 ?/* 放在最前面

# 正则表达式
name:/[mb]oat/
# 模糊匹配fuzzy query
name:roam~1
匹配与roam差1个character的词，比如foam、roams等
# 近似度查询proximity search
“fox quick”~5
以term为单位进行差异比较，比如”quick fox” “quick brown fox”

# match query
# 使用 Kibana 开发工具
GET library/_search
{
  "profile": "true",
  "query": {
    "match": {
      "name.first": "yang"
    }
  }
}

# 通过 operator 参数可以控制单词间的匹配关系，可选项为 or 和 and
GET library/_search
{
  "query": {
    "match": {
      "name.first": {
        "query": "hadoop",
        "operator": "and"
      }
    }
  }
}

`ElasticSearch`进阶操作——批量

# 使用 Kibana 开发工具
# 插入多个文档
POST library/books/_bulk?pretty
{"index":{"_id":5}}
{"title":"this is hadoop book","name":{"first":"sun","last":"yong"},"publish_date":"2009-03-04","price":400}
{"index":{"_id":6}}
{"title":"this is hadoop book","name":{"first":"yang","last":"yong"},"publish_date":"2010-09-07","price":40}

# 将以下内容存储文件 books.json 中
{"index":{"_id":7}}
{"title":"this is php book","name":{"first":"sun","last":"zhao"},"publish_date":"2011-03-03","price":30.0}
{"index":{"_id":8}}
{"title":"this is php book","name":{"first":"yang","last":"wang"},"publish_date":"2010-10-03","price":50.0}
{"index":{"_id":9}}
{"title":"this id php book","name":{"first":"zhang","last":"san"},"publish_date":"2019-10-10","price":40.0}

# 将已经存在的文件导入索引中
??? [yangqi@xiaoer ~]$ curl -H 'Content-Type: application/json' -XPOST 'http://xiaoer:9200/library/books/_bulk?pretty' --data-binary @/home/yangqi/books.json

# 批处理需要注意
批处理注意点说明：
	a) Bulk请求可以在URL中声明/_index或者/_index/_type
	b) Bulk一次最大处理多少数据量
		Bulk会把将要处理的数据载入内存中，所以数据量是有限制的
		最佳的数据量不是一个确定的数值，它取决于你的硬件，你的文档大小以及复杂性，你的索引以及搜索的负载
		一般建议是1000~5000个文档，如果你的文档很大，可以适当减少队列，大小建议是5~15MB，默认不能超过100M，可以在es的配置文件中修改这个值
		http.max_content_length: 100mb
	c) 灵活使用批处理操作，会大幅度提高程序执行的效率，但是，批处理操作的数据量是有一个临界值的，不是没有极限的！

`ElasticSearch`进阶操作——聚合

# 计算每个 price 出现的次数
[yangqi@xiaoer ~]$ curl -H 'Content-Type: application/json' -XGET 'http://xiaoer:9200/library/books/_search?pretty' -d '{ "aggs": { "ALL_NAMES": { "terms": {"field": "price"} } } }'

# 将符合查询结果的值进行打印
[yangqi@xiaoer ~]$ curl -H 'Content-Type: application/json' -XGET 'http://xiaoer:9200/library/books/_search?pretty' -d '{ "query": { "match": { "price": "40" } }, "aggs": { "ALL_NAMES": { "terms": {"field": "price"} } } }'

`ElasticSearch`进阶操作——分词

[yangqi@xiaoer ~]$ curl -H 'Content-Type: application/json' -XPOST 'http://xiaoer:9200/_analyze?pretty' -d '{ "analyzer": "standard", "text": "hello world" }'

[yangqi@xiaoer ~]$ curl -H 'Content-Type: application/json' -XPOST 'http://xiaoer:9200/_analyze?pretty' -d '{ "analyzer": "standard", "text": "我是一名工程师" }'

# 使用 Kibana 开发工具
# 自定义分词器
POST _analyze?pretty
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": "Hello WORLD"
}

ElasticSearch自带的分词器

分词器（Analyzer）	特点
Standard（es 默认）	支持多语言，按词切分并做小写处理
Simple	按照非字母切分，小写处理
Whitespace	按照空格来切分
Stop	去除语气助词，如 the、an、的、这等
Keyword	不分词
Pattern	正则分词，默认 \w+，即非字词符号做分隔符
Language	常见语言的分词器（30+）

中文分词器

分词器名称	介绍	特点	地址
IK	实现中英文单词切分	自定义词库	https://github.com/medcl/elasticsearch-analysis-ik
Jieba	python 流行分词系统，支持分词和词性标注	支持繁体、自定义、并行分词	http://github.com/sing1ee/elasticsearch-jieba-plugin
Hanlp	由一系列模型与算法组成的 java 工具包	普及自然语言处理在生产环境中的应用	https://github.com/hankcs/HanLP
THULAC	清华大学词法分析工具包	具有中文分词和词性标注功能	https://github.com/microbun/elasticsearch-thulac-plugin

# Character Filters
POST _analyze?pretty
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text": "<div><hi>B (+)Trees</hi></div>"
}

ik分词器

# ik 分词器提供两种分词模式：ik_smart 和 ik_max_word
# ik_smart：最少分词
# ik_max_word：最细粒度的分词，分词最多
# 使用 kibana 开发工具
POST _analyze?pretty
{
  "analyzer": "ik_max_word",
  "text": "我是一名大数据工程师"
}

POST _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": "我是一名大数据工程师"
}

# 停词：对于搜索没有意义的一类词，不会将这类词纳入分词结果，比如：是、的 is a an等
# 扩展词典：扩展词可以是文本本身或者是子字符串。中华人民共和国： 中华人
# 同义词：相近的词配置。中 ==> 好 晓得 ==> 知道

ElasticSearch高级操作

ElasticSearch高级操作

ElasticSearch进阶操作——搜索

ElasticSearch进阶操作——批量

ElasticSearch进阶操作——聚合

ElasticSearch进阶操作——分词

`ElasticSearch`高级操作

`ElasticSearch`进阶操作——搜索

`ElasticSearch`进阶操作——批量

`ElasticSearch`进阶操作——聚合

`ElasticSearch`进阶操作——分词