ClickHouse主键探讨[译文+补充]

http://www.clickhou***.cn/topic/5a366e48828d76d75ab5d59e

本文背景

Our guest author is ClickHouse evangelist Yegor Andreenko. Follow him on Twitter @f1yegor.
This is a cross-post from: ([https://medium.com/@f1yegor/clickhouse-primary-keys-2cf2a45d7324](https://medium.com/@f1yegor/clickhouse-primary-keys-2cf2a45d7324))
Special thanks to Alexey Milovidov, ClickHouse developer, for providing material for this article.

译文

我们的客座嘉宾，是ClickHouse的布道师 Yegor Andreenko，这是他的Twitter：@f1yegor
这篇文章是从这里转载过来的： [https://medium.com/@f1yegor/clickhouse-primary-keys-2cf2a45d7324](https://medium.com/@f1yegor/clickhouse-primary-keys-2cf2a45d7324)
特别感谢ClickHouse的开发者Alexey Milovidov为本文提供的原始素材。

正文

Recently I dived deep into ClickHouse . ClickHouse is column-store database by Yandex with great performance for analytical queries. For example check benchmark and post of Mark Litwintschik.
This is the translation of answer given by Alexey Milovidov (creator of ClickHouse) about composite primary key .

译文

最近我开始深入研究ClickHouse。ClickHouse是Yandex开源的高性能的列式分析数据库。关于性能测试，可以看看这两篇：
- 官方Benchmark
- Mark Litwintschik的压测文章.
这篇文章，其实是对ClickHouse的开发者Alexey Milovidov，在Google网上论坛，对ClickHouse主键问题回复的翻译和总结（原贴为俄文）。

原问题

How many columns primary key could have? And what is layout of data on storage drive? Are there any theoretical/practical limits?
Could columns with missing data at some rows be part of the primary key?

CK的主键最多可以有多少个列组成？主键在磁盘上的存储形式是什么样的？理论或者实际使用过程中，有什么约束和限制么？

如果某些列缺省了某些值，可以作为主键么？

ClickHouse的主键

Data in table of MergeTree type stored in set of multiple parts. On average you could expect little number of parts (units-tens per month).
In every part data stored sorted lexicographically by primary key. For example, if your primary key — (CounterID, Date), than rows would be located sorted by CounterID, and for rows with the same CounterID — sorted by Date.
Data structure of primary key looks like an array of marks — it’s values of primary key every index_granularity rows.
index_granularity — settings of MergeTree engine, default to 8192.

译文

MergeTree 系列的引擎，数据是由多组part文件组成的，一般来说，每个月（译者注：CK目前最小分区单元是月）会有几个part文件（这里的part就是block）。
每一个part的数据，是按照主键进行字典序排列。例如，如果你有一个主键是(CounterID, Date)，数据行会首先按照CounterID排序，如果CounterID相同，按照Date排序。
主键的数据结构，看起来像是标记文件组成的矩阵，这个标记文件就是每间隔index_granularity（索引粒度）行的主键值。
MergeTree引擎中，默认的index_granularity设置是8192。

举例

We say that primary key is sparse index of sorted data. Let’s visualise it with only one part. (I should have equal length between marks, but it’s a bit imperfect to draw asci-art here):
It’s convenient to represent marks as marks of ruler. Primary key allows effectively read range of data. For select ClickHouse chooses set of mark ranges that could contain target data.

主键是有序数据的稀疏索引。我们用图的方式看一部分的数据（原则上，图中应该保持标记的平均长度，但是用ASCI码的方式不太方便）。
mark文件，就像一把尺子一样。主键对于范围查询的过滤效率非常高。对于查询操作，CK会读取一组可能包含目标数据的mark文件。

This way, if you select CounterID IN (‘a’, ‘h’), server reads data with mark ranges [0, 3) and [6, 8).
例如，如果你的查询条件是CounterID IN (‘a’, ‘h’)，服务器将会读取标记文件为[0, 3]和[6, 8]之间对应的数据文件。
if you select CounterID IN (‘a’, ‘h’) AND Date = 3, server reads data with mark ranges [1, 3) and [7, 8).
如果你的查询条件是CounterID IN (‘a’, ‘h’)并且指定了Date = 3，服务器将会读取标记文件为[1, 3)和[7, 8)之间对应的数据文件。

Sometimes primary key works even if only the second column condition presents in select:
有时，主键的过滤效果并不是很好，比如，只有第二列出现在查询条件中：
if you select Date = 3, server reads data with mark ranges [1, 10).
如果查询条件只是Date = 3，服务器讲读取 [1, 10)之间对应的数据文件。
In our example it’s all marks except 0 — this is 90% of data. In this case index isn’t really effective, but still allows to skip part of data.
在上述例子中，mark文件除了0，其他90%的数据都需要扫描，虽然索引过滤效果不好，但是，仍然是可以跳过一些数据的。
On the other hand, if we have more data for one CounterID, index allows to skip wider ranges of Date in data.
另一方面，如果每个CounterID对应多条数据，索引将会跳过更多的Date数据。（？？？）
In any case, usage of index never could be less efficient than full scan.
综合来讲，使用索引，总是会比全表扫描要高效一些的。

Sparse index could read unnecessary rows: during read of one range of primary key, index_granularity * 2 unnecessary rows in every part. It’s normal and you shouldn’t try to reduce index_granularity. ClickHouse designed to work effective with data by large batches of rows, that’s why a bit of additional column during read isn’t hurt the performance. index_granularity = 8192 — good value for most cases.
稀疏索引会读取很多不必要的数据：读取primary key的每一个part，会多读取index_granularity * 2的数据。这对于稀疏索引来说很正常，也没有必要减少index_granularity的值。ClickHouse的设计，致力于高效的处理海量数据，这就是为什么一些多余的读取并不会有损性能。index_granularity=8192对于大多数场景都是比较好的选择。
Sparse index allows to work with tables that have enormous number of rows. And it always fits in RAM.
稀疏索引适用于海量数据表，并且，稀疏索引文件本身，放到内存是没有问题的。

Primary key isn’t unique. You can insert many rows with the same value of primary key.
主键并不是唯一的，可以插入主键相同的数据行。
Primary key can also contain functional expressions.
- Example: (CounterID, EventDate, intHash32(UserID))
主键的构成，同样可以存在函数表达式。
- 如，(CounterID, EventDate, intHash32(UserID))
Above it’s used to mix up the data of particular UserID for every tuple CounterID, EventDate. By-turn it’s used in sampling ( https://clickhouse.yandex/reference_en.html#SAMPLE clause).
上述例子中，通过使用Hash函数，把特定的UserID对应的CounterID和EventDate做了聚合，顺便，这种聚合方式，可以在SAMPLE这个功能中利用到。

总结

Let’s sum up what choice of primary key affects:
1. The most important and obvious: primary key allows to read less data during SELECT queries.
  As shown in examples above it’s usually doesn’t make sense to include many columns into primary key for this purpose. Let’s say you have primary key (a, b). By adding one more column c: (a, b, c) makes sense only if it conforms with both conditions:
  - if you have queries with filter for this column;
  - in your data could be quite long (several time bigger than index_granularity) ranges of data with the same values of (a, b).
  In other words when adding one more column will allow to skip big enough ranges of data.
2. Data is sorted by primary key. That way data is more compressable. Sometimes it happens that by adding one more column into primary key data could be compressed better.
3. When you use different kinds of MergeTree with additional logic in merge: CollapsingMergeTree, SummingMergeTree and etc., primary key affects merge of data. For this reason it might be necessary to use more columns in primary key even when it’s not necessary for point 1.

总结一下，主键改变会带来哪些影响：
1. 显而易见，主键最重要的作用，就是减少了查询操作中过多的数据读取。如上面的例子，给主键增加过多的列，并没有太大的意义。
  举例来说，你有一个主键是(a, b)，增加一个列c，主键变成(a, b, c)，这种变动，只在下列情况有作用：
  1. 你的查询使用到了c列
  2. 以(a, b)组合，去重后的值，要远大于index_granularity的值。
换句话说，增加一个列到主键中去，可以跳过很多的数据。
1. 数据是按照主键排序的，这样更容易做数据压缩。有时，主键增加一列后，会带来更好的数据压缩效果。
2. 当使用MergeTree家族的其他引擎，比如CollapsingMergeTree, SummingMergeTree等，主键影响数据的merge操作。考虑到这个原因，有必要增加主键的宽度，而忽略第一种情况。

Number of columns into primary key isn’t limited explicitly. Long primary key is usually useless. In real use case the maximum that I saw was ~20 columns (for SummingMergeTree), but I don’t recommend this variant.
Long primary key will negatively affect insert performance and memory usage.
Long primary key will not negatively affect the performance of SELECT queries.

组成主键的列的数量，并没有明确规定。过长的主键通常来说没啥用。在日常经验中，我见过最长的主键是的是20个列（在SummingMergeTree引擎中场景下），但我并不建议这样的方式。
过长的主键，会拖慢写入性能，并且会造成过多的内存占用。
过长的主键，并不会影响对查询性能有太大的影响。

During insert, missing values of all columns will be replaced with default values and written to table.
插入过程中，如果部分字段没有指定确切的values，将会被默认值替代写入表中。

译者注

关于最左原则

了解MySQL的人，都知道一个重要的索引概念，最左原则，即对于(a, b, c)这样的索引，如果仅仅指定一个查询条件b或者c，是无法用到索引的，同样需要全表扫描，原因在于MySQL使用的是B树索引
在CK里，主键索引用的并不是B树，而是稀疏索引。
如图，在一个表的分区里，包括如下数据文件：
1. 校验值
2. 列名
3. 列.mrk
4. 列.bin
5. 主键
它们的关系如下：
1. 每隔8192行数据，是1个block
- 主键会每隔8192，取一行主键列的数据，同时记录这是第几个block
- 查询的时候，如果有索引，就通过索引定位到是哪个block，然后找到这个block对应的mrk文件
- mrk文件里记录的是某个block的数据集，在整列bin文件的哪个物理偏移位置
- 加载数据到内存，之后并行化过滤

这里分别举例看看是否依赖最左原则：
- 在使用半主键的时候，如果出现了索引跳跃，即直接用了y字段，跳过x，实际上也是可以有一些数据过滤的，但是效果并不是很好
- 所以你说这是最左原则还是不是呢？
索引使用建议
- 线上使用务必指定date字段，限制过滤的数据行
- 并不是所有的字段都要加入到主键中
- (date, hour, min, ts)大多数情况下会是比较好的选择，但是请在SQL里用上这几个条件
- 实际生产中，用上date后，即使再全扫其他字段，在百亿以下的效率依然很高

CK为啥这么高效？

答案就是全表扫描
那么问题来了，为什么全表扫描性能还这么快？
在计算机系统里，有一个概念叫SIMD，即单指令流多数据流（Single Instruction Multiple Data），是一种采用一个控制器来控制多个处理器，同时对一组数据（又称“数据向量”）中的每一个分别执行相同的操作从而实现空间上的并行性的技术。这种方式，极大的提升了数据的查询效率，因此可以做到即使是全表扫，也能达到很高的性能。
关于CK的高效
1. CK不能简单看成一个数据库，它用了大量的并行计算方式，把单机性能压榨到极限
- Hadoop生态非常依赖集群的数量，通过scale out的方式，让计算发生在本地，分而治之，通过M后再R的方式，提高执行效率。
- 在实际的使用过程中，很明显的感觉到，10台规模的Hadoop和100台规模的Hadoop无法同日而语，原因就在于数据打的不够散
- CK的方式，可以理解为，通过列式存储的方式，本身查询的时候就做了Map化，再对每一列做操作的时候，又使用向量化操作，等于是又增加了并发，因此，单机效率极高
- 理解有误请指正