Cap’n Proto

简介

Cap’n Proto 是非常快速的数据交换格式和基于容量的 RPC 系统， Cap’n Proto没有任何encoding/decoding步骤，Cap’n Proto编码的数据格式跟在内存里面的布局是一致的，所以可以直接将编码好的structure直接字节存放到硬盘上面。

Cap’n Proto的编码是方案是独立于任何平台的，但在现在的CPU上面（小端序）会有更高的性能。数据的组织类似compiler组织struct：固定宽度，固定偏移，以及合适的内存对齐，对于可变的数组使用pointer嵌入，而pointer也是使用的偏移存放而不是绝对地址。整数使用的是小端序，因为多数现代CPU都是小端序的。

其实如果熟悉C或者C++的结构体，就可以知道Cap’n Proto的编码方式就跟struct的内存布局差不多。

特性

增量读取
随机访问
mmap
内部语言通信：C++
Arena 分配
极小的生成代码
极小的运行时库
Time-traveling RPC

原理

Example

跟Protobuf一样，Cap’n Proto也需要定义描述文件，然后通过capnp的编译器编译成特定语言的对象使用。一个描述文件的简单例子:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
@0xdbb9ad1f14bf0b36;  # unique file ID, generated by `capnp id`

struct Person {
  name @0 :Text;
  birthdate @3 :Date;

  email @1 :Text;
  phones @2 :List(PhoneNumber);

  struct PhoneNumber {
    number @0 :Text;
    type @1 :Type;

    enum Type {
      mobile @0;
      home @1;
      work @2;
    }
  }
}

struct Date {
  year @0 :Int16;
  month @1 :UInt8;
  day @2 :UInt8;
}

几个需要关注的地方:

类型是定义在名字后面的，通常来说，对于一个变量来说，我们可能最关注的是它的名字，一个好的命名，就很容易让大家知道是干啥的。譬如上面的name一看就知道是表示的用户的名字。这点跟c语言是反的，它是先类型，在变量名，不过很多后续的语言，譬如go，rust等都是先名字，再类型了。
@N用来给struct里面的field进行编号，编号从0开始，而且必须是连续的（这点跟Protobuf不一样）。上面birthdate虽然看起来在email和phones的前面，但是它的编号较大，实际编码的时候会放到后面。

参考

注释

使用 #进行注释，注释应该跟在定义的后面，或者新启一行：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
struct Date {
  # A standard Gregorian calendar date.

  year @0 :Int16;
  # The year.  Must include the century.
  # Negative value indicates BC.

  month @1 :UInt8;   # Month number, 1-12.
  day @2 :UInt8;     # Day number, 1-30.
}

内置类型

原生支持的数据类型如下：

Void: Void
Boolean: Bool
Integers: Int8, Int16, Int32, Int64
Unsigned integers: UInt8, UInt16, UInt32, UInt64
Floating-point: Float32, Float64
Blobs: Text, Data
Lists: List(T)

需要注意:

Void只有一个可能的值，使用0 bits进行编码，通常很少使用，但是可以作为union的member。
Text通常是UTF-8编码的，使用NULL结尾的字符串。
Data是任意二进制数据。
List是一个泛型类型，我们可以用特定类型去特化实现，譬如List(Int32)就是一个Int32的List。

结构体

结构体其实类似于c的struct，field的有名字，有类型定义，同时需要编号：

1
2
3
4
struct Person {
  name @0 :Text;
  email @1 :Text;
}

Field也可以有默认值:

1
2
3
4
5
6
foo @0 :Int32 = 123;
bar @1 :Text = "blah";
baz @2 :List(Bool) = [ true, false, false, true ];
qux @3 :Person = (name = "Bob", email = "bob@example.com");
corge @4 :Void = void;
grault @5 :Data = 0x"a1 40 33";

联合

Union是定义在struct里面同一个位置的一组fields，一次只能允许一个field被设置，我们使用不一样的tag来获知当前哪个field被设置了，不同于c里面的union，它不是类型，只是简单的fields聚合。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
struct Person {
  # ...

  employment :union {
    unemployed @4 :Void;
    employer @5 :Company;
    school @6 :School;
    selfEmployed @7 :Void;
    # We assume that a person is only one of these.
  }
}

union可以没有名字，但是一个struct里面最多只能包含一个没名字的union:

1
2
3
4
5
6
7
8
struct Shape {
  area @0 :Float64;

  union {
    circle @1 :Float64;      # radius
    square @2 :Float64;      # width
  }
}

对于union，我们需要注意:

Union里面的field需要跟struct的field一起编号。
我们在上面的union中使用了Void类型，这个类型没有任何额外的信息，仅仅是为了跟其他状态区分。
通常，当一个struct初始化的时候，在union里面具有最小number field会被默认的设置，如果不想默认设置任何field，我们可以用在union里面的最小number定义一个unset的field。
我们可以将当前存在的field加入一个新的union，并且不会破坏当前数据的兼容性。

群组

我们通过group将一组fields封装到特定的作用域里面：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
struct Person {
  # ...

  # Note:  This is a terrible way to use groups, and meant
  #   only to demonstrate the syntax.
  address :group {
    houseNumber @8 :UInt32;
    street @9 :Text;
    city @10 :Text;
    country @11 :Text;
  }
}

Group并不是struct里面独立的一个对象，它里面的fields仍然是struct的fields，需要跟其他struct的fields一起编号。

通常在一个struct里面使用group其实没啥大的意思，但是在union里面就比较有趣了:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
struct Shape {
  area @0 :Float64;

  union {
    circle :group {
      radius @1 :Float64;
    }
    rectangle :group {
      width @2 :Float64;
      height @3 :Float64;
    }
  }
}

在union里面使用group，我们很好的将field进行了自说明，现在看到radius，我们就知道它是circle的变量，而不需要额外的注释了。

当然，使用group，对于后续协议升级也是很有帮助的，在最开始的时候，我们的shape是square，但是现在想支持rectangle，如果需要额外的加入一个field。如果有group，我们仅仅需要添加一个新的group就可以了。

动态类型域

Struct可以定义field的类型为AnyPointer，类似于c里面的void*.

枚举

Enum就是一组符号值的集合:

1
2
3
4
5
6
7
enum Rfc3092Variable {
  foo @0;
  bar @1;
  baz @2;
  qux @3;
  # ...
}

Enum的成员必须从0开始编号，在c语言里面，enum通常都是数字类型的，但是在Cap’n Proto里面，它还可以是其他值。

接口

Interface是一组methods的集合，各个method可以有参数，有返回值，methods也必须从0开始编号。Interface支持继承，同样也支持多继承。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
interface Node {
  isDirectory @0 () -> (result :Bool);
}

interface Directory extends(Node) {
  list @0 () -> (list: List(Entry));
  struct Entry {
    name @0 :Text;
    node @1 :Node;
  }

  create @1 (name :Text) -> (file :File);
  mkdir @2 (name :Text) -> (directory :Directory);
  open @3 (name :Text) -> (node :Node);
  delete @4 (name :Text);
  link @5 (name :Text, node :Node);
}

interface File extends(Node) {
  size @0 () -> (size: UInt64);
  read @1 (startAt :UInt64 = 0, amount :UInt64 = 0xffffffffffffffff)
       -> (data: Data);
  # Default params = read entire file.

  write @2 (startAt :UInt64, data :Data);
  truncate @3 (size :UInt64);
}

泛型

我们可以定义泛型的struct或者interface

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
struct Map(Key, Value) {
  entries @0 :List(Entry);
  struct Entry {
    key @0 :Key;
    value @1 :Value;
  }
}

struct People {
  byName @0 :Map(Text, Person);
  # Maps names to Person instances.
}

在上面的例子中，我们定义了一个泛型的Map，然后在People里面用Text，Person作为参数来特化这个Map，如果我们了解c++的模板，就可以知道他们差不多。

泛型方法

interface也可以提供泛型method：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
interface Assignable(T) {
  # A generic interface, with non-generic methods.
  get @0 () -> (value :T);
  set @1 (value :T) -> ();
}

interface AssignableFactory {
  newAssignable @0 [T] (initialValue :T)
      -> (assignable :Assignable(T));
  # A generic method.
}

我们首先定义了一个泛型的interface，然后在对应的factory里面，创建这个interface的method就是泛型的method。

常量

我们可以用const来定义常量

1
2
3
const pi :Float32 = 3.14159;
const bob :Person = (name = "Bob", email = "bob@example.com");
const secret :Data = 0x"9f98739c2b53835e 6720a00907abd42f";

我们可以直接引用这些常量

1
2
3
const foo :Int32 = 123;
const bar :Text = "Hello";
const baz :SomeStruct = (id = .foo, message = .bar);

通常常量都都定义在全局scope里面，我们通过.来进行引用获取。

嵌套，作用域以及别名

我们可以在struct或者interface里面嵌套常量，别名或者新的类型定义。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
struct Foo {
  struct Bar {
    #...
  }
  bar @0 :Bar;
}

struct Baz {
  bar @0 :Foo.Bar;
}

上面Baz里面我们通过Foo.Bar来进行类型的获取。

我们可以使用using对一个类型设置别名。

1
2
3
4
5
6
7
8
9
struct Qux {
  using Foo.Bar;
  bar @0 :Bar;
}

struct Corge {
  using T = Foo.Bar;
  bar @0 :T;
}

导入

我们通过import导入其他文件的类型定义

1
2
3
4
struct Foo {
  # Use type "Baz" defined in bar.capnp.
  baz @0 :import "bar.capnp".Baz;
}

也可以直接使用using来设置别名

1
2
3
4
5
6
using Bar = import "bar.capnp";

struct Foo {
  # Use type "Baz" defined in bar.capnp.
  baz @0 :Bar.Baz;
}

或者这样

1
2
3
4
5
using import "bar.capnp".Baz;

struct Foo {
  baz @0 :Baz;
}

注解

有时候我们需要在Cap’n Proto上面附加一些不属于Cap’n Proto的自有协议。这就是Annotation，不过话说真有必要吗？这里还是先忽略吧。

唯一ID

每个Cap’n Proto文件都必须有唯一的一个64bit ID，使用capnp id生成。譬如最开始例子里面的file ID

1
2
# file ID
@0xdbb9ad1f14bf0b36;

其实struct，enum这些的也需要定义ID，但默认情况下面，我们都是自动生成的。

64位的ID还是很可能冲突的，但是实际不用考虑这样的情况，反而是错误的使用（譬如copy了一个example但没有更改file ID）更可能导致冲突。

升级协议

如果我们要升级定义的协议，需要注意：

新的类型，常量或者别名可以添加到任何地方，他们不会影响现有的类型。
新的fields，enumerants以及methods需要使用比之前都要大的编号。
新加入到method里面的参数必须添加到参数列表的最后，并且有默认值。
成员可以随意在文件里面变换位置，只要number不变。
符号名字可以任意更改，只要ID和number别换就行了。但要注意默认生成的ID是根据父ID以及name来生成的，所以我们需要通过capnp compile -ocapnp myschema.capnp找到这个名字关联的ID并且在改名后显示的定义。
类型定义可以移动到任意的作用域，只要ID显示声明。
一个field可以被移入union或者group里面，就像在struct里面替换了以前的field，新加入了一个group或者union。
一个非泛型的类型可以变成泛型。（话说对于泛型的研究后续在考虑吧，总觉得没必要弄得这么复杂）

有一些操作是不安全的：