票务服务:廉价的分布式唯一主键

Ticket Servers: Distributed Unique Primary Keys on the Cheap

This is the first post in the Using, Abusing and Scaling MySQL at Flickr series.

Ticket servers aren’t inherently interesting, but they’re an important building block at Flickr. They are core to topics we’ll be talking about later, like sharding and master-master. Ticket servers give us globally (Flickr-wide) unique integers to serve as primary keys in our distributed setup.

票务服务本来没有意思,但它们是 Flickr 的重要组成部分。它们是我们将在后面讨论的主题,如:分片和主-主的核心。票务服务提供给我们全局唯一的整型数字,在分布式体系中作为主键。

Why?

Sharding (aka data partioning) is how we scale Flickr’s datastore. Instead of storing all our data on one really big database, we have lots of databases, each with some of the data, and spread the load between them. Sometimes we need to migrate data between databases, so we need our primary keys to be globally unique. Additionally our MySQL shards are built as master-master replicant pairs for resiliency. This means we need to be able to guarantee uniqueness within a shard in order to avoid key collisions. We’d love to go on using MySQL auto-incrementing columns for primary keys like everyone else, but MySQL can’t guarantee uniqueness across physical and logical databases.

分片(又名数据分割)是我们如何扩大数据存储的办法。不是将所有我们的数据存储在一个真正的大数据库中,而是我们拥有很多数据库,每个数据库有一些数据,并且分散它们之间的负载。有时候我们需要在数据库间迁移数据,因此我们需要全局唯一的主键。这意味着我们为了避免键冲突,需要在一个分片内保证唯一性。我们很乐意想其他人一样使用 MySQL 自动递增列作为主键,但 MySQL 不能保证跨物理和逻辑数据库的唯一性。

GUIDs?

Given the need for globally unique ids the obvious question is, why not use GUIDs? Mostly because GUIDs are big, and they index badly in MySQL. One of the ways we keep MySQL fast is we index everything we want to query on, and we only query on indexes. So index size is a key consideration. If you can’t keep your indexes in memory, you can’t keep your database fast. Additionally ticket servers give us sequentiality which has some really nice properties including making reporting and debugging more straightforward, and enabling some caching hacks.

给出了全局唯一 id 的需求后明显问题是,为什么不使用 GUID?主要因为 GUID 大,且在 MySQL中对索引的支持很差。我们保持 MySQL 快速的方法之一是我们为我们想查询的每个属性建立索引,并且我们只对索引查询。如果你不能保持你的索引在内存中,你就不能保持你的数据库快。此外 Ticket 服务给了我们顺序性,它有许多好的特性,包括使报告和调试变得更简单,并且能命中一些缓存。

全局唯一标识符,简称GUID,是一种由算法生成的唯一标识,通常表示成32个16进制数字(0-9,A-F)组成的字符串,如:{21EC2020-3AEA-1069-A2DD-08002B30309D},它实质上是一个128位长的二进制整数。

Consistent Hashing?

Some projects like Amazon’s Dynamo provide a consistent hashing ring on top of the datastore to handle the GUID/sharding issue. This is better suited for write-cheap environments (e.g. LSMTs), while MySQL is optimized for fast random reads.

一些项目如亚马逊的 Dynamo 提供在数据存储上一致性哈希环来处理 GUID 或 分片问题。这更适合写操作少的环境(如 LSMTs),而 MySQL 正优化快速随机读取。

Centralizing Auto-Increments

If we can’t make MySQL auto-increments work across multiple databases, what if we just used one database? If we inserted a new row into this one database every time someone uploaded a photo we could then just use the auto-incrementing ID from that table as the primary key for all of our databases.

如果我们不能使 MySQL 自动增长跨数据库工作,如果我们只用一个数据库会怎样?当有人上传一张照片,如果我们每次插入新行到数据库,我们能只使用这个表中自动增长的 ID 作为所有数据库的主键。

Of course at 60+ photos a second that table is going to get pretty big. We can get rid of all the extra data about the photo, and just have the ID in the centralized database. Even then the table gets unmanageably big quickly. And there are comments, and favorites, and group postings, and tags, and so on, and those all need IDs too.

理所当然每秒 60+ 张照片,表将变得相当大。我们除去和照片相关的所有额外数据,统一的数据库中只有 ID。即使这样,该表会很快变得无法管理的大。并且,评论、收藏夹、群发和标签等等,它们也需要 ID。

REPLACE INTO

A little over a decade ago MySQL shipped with a non-standard extension to the ANSI SQL spec, “REPLACE INTO”. Later “INSERT ON DUPLICATE KEY UPDATE” came along and solved the original problem much better. However REPLACE INTO is still supported.

十多年前 MySQL 附带了一个 ANSI SQL 规范的非标准扩展,”REPALCE INTO”。后来 “INSERT ON DUPLICATE KEY UPDATE” 出现了,并且更好的解决了这个原始问题。然而, “REPLACE INTO” 仍然被支持。

REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted.

“REPLACE” 工作起来酷似 “INSERT”,只是如果表中的一个旧行和新行有相同的主键或者唯一索引,在新行插入之前旧行将被删除。

This allows us to atomically update in a place a single row in a database, and get a new auto-incremented primary ID.

这使我们能自动更新数据库中的一行数据,并且得到一个自动增长的主键 ID。

Putting It All Together

A Flickr ticket server is a dedicated database server, with a single database on it, and in that database there are tables like Tickets32 for 32-bit IDs, and Tickets64 for 64-bit IDs.

一个 Flickr ticket 服务器是一个专用的数据库服务器,上面有一个单独的数据库,在数据库里面有像 32 位 ID 的 Tickets32 和 64 位 ID 的Tickets64 表。

The Tickets64 schema looks like:

Tickets64 创建方案如下:

1
2
3
4
5
6
CREATE TABLE `Tickets64` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`stub` char(1) NOT NULL default '',
PRIMARY KEY (`id`),
UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM

SELECT * from Tickets64 returns a single row that looks something like:

1
2
3
4
5
+-------------------+------+
| id | stub |
+-------------------+------+
| 72157623227190423 | a |
+-------------------+------+

When I need a new globally unique 64-bit ID I issue the following SQL:

当我需要一个全局唯一的 64 位 ID,我执行以下 SQL 语句:

1
2
REPLACE INTO Tickets64 (stub) VALUES ('a');
SELECT LAST_INSERT_ID();

SPOFs

You really really don’t know want provisioning your IDs to be a single point of failure. We achieve “high availability” by running two ticket servers. At this write/update volume replicating between the boxes would be problematic, and locking would kill the performance of the site. We divide responsibility between the two boxes by dividing the ID space down the middle, evens and odds, using:

你真的不知道想调配的 ID 是一个单一故障点。我们通过运行两套票务服务器实现高可用。在主机间的写入/更新卷复制是有问题的,锁操作会杀死站点的性能。我们通过将 ID 空间一分为二,分为奇和偶来划分两个主机的责任。

1
2
3
4
5
6
7
TicketServer1:
auto-increment-increment = 2
auto-increment-offset = 1
TicketServer2:
auto-increment-increment = 2
auto-increment-offset = 2

We round robin between the two servers to load balance and deal with down time. The sides do drift a bit out of sync, I think we have a few hundred thousand more odd number objects then evenly numbered objects at the moment, but this hurts no one.

我们在两台服务器间轮询来负载均衡和处理宕机时间。双方偏移有点不同步,我认为此时我们有成百上千的奇数对象和偶数对象,这不会影响任何人。

More Sequences

We actually have more tables then just Tickets32 and Tickets64 on the ticket servers. We have a sequences for Photos, for Accounts, for OfflineTasks, and for Groups, etc. OfflineTasks get their own sequence because we burn through so many of them we don’t want to unnecessarily run up the counts on other things. Groups, and Accounts get their own sequence because we get comparatively so few of them. Photos have their own sequence that we made sure to sync to our old auto-increment table when we cut over because its nice to know how many photos we’ve had uploaded, and we use the ID as a short hand for keeping track.

事实上在票务服务器上除了 Tickets32 和 Tichets64 我们有更多表。我们有一系列照片表、账户表、线下任务表和群表等。

So There’s That

It’s not particularly elegant, but it works shockingly well for us having been in production since Friday the 13th, January 2006, and is a great example of the Flickr engineering dumbest possible thing that will work design principle.

虽然它不是特别优雅,但是它工作起来意想不到的适合我们,从 2006 年 1 月 13 号,星期五就已经在产品中,并且…(翻译不出来了,尴尬)

More soon.

Belorussian translation provided by PC.

这本来是一篇英文文章,我试着翻译了一下。它主要讲的是利用 MySQL 数据库的自动增长特性来构建一个全局唯一的 ID。
SPOFs 部分,提到避免单一故障点而部署两套 MySQL 服务。auto-increment-offset 表示自增起始值,auto-increment-increment 表示自增的增量。增量可以看成是步长。当只有一套服务的时候,ID 每次加 1(步长),部署了两套服务后,为了避免出现重复 ID,将步长变为 2,即分到每套服务上的 ID 的步长还保持 1。然后,通过使两套服务的起始值不一样 (1, 2),就形成了两套服务将 ID 划分奇偶了。所以,如果你想部署 3 套服务,那就可以将总步长设置为 3,分到每套服务上的 ID 的步长为 1,起始值为 (1, 2, 3)。
下面总结一下这种方法的优、缺点。优点有:

  1. 编码简单。从实现方法来看,它编程简单,只需要很少的代码就能生成全局唯一的 ID。
  2. ID 有序。从 ID 层面来看,它不仅唯一,还递增。递增使 ID 有序,有助于后期查询。

它的缺点与优点一样明显:它需要单独的服务器来运行数据库,提供 ID 生成服务。

题外话

文章中还有一些值得研究的点,比如:
[ ] 负载均衡的解决方案
[ ] sharding 的解决方案
[ ] 数据库索引
任何服务,想要高可用性,就要考虑到怎样负载均衡;而 sharding 是对数据库的分拆,主要提升数据库查询速度;索引是加快查询的主要手段,文中提到 GUID 对索引不友好是为什么。