Serialization and Encoders

There’s a nicely designed Serializer component within Symfony which allows you to convert structured object data into a transportable or storeable string (or binary) format.

The nice thing about the symfony/serializer design is that it separates two major concerns of serialization: 1) extracting the data from the objects and 2) encoding it into a string.

The extraction part is called normalization wherein the structured object data is converted into a common format — usually easier to encode / supported by all encoders — for example that format could be an associative array.

The encoding part takes the normalized data and creates a string (or binary) representation of it ready to be transported or stored on disk.

The extra encoders I bundled together

The bundle is a collection of general purpose serialization encoders I scavenged while investigating what options there are in this field, what purposes they serve, how efficient they are in usage (from multiple perspectives).

Fully working PHP encoders: Bencode, BSON, CBOR, Export, IGBinary, MsgPack, Serialize, Tnetstring, UBJSON and YAML.

Partial PHP implementations: Sereal and Smile and PList.

No PHP encoders found: BINN, BJSON, JSON5 HOCON, HJSON and CSON.

Of which:

  • bencode does not support floats.
  • PList has a full PHP encoder but the API requires encoding each scalar-node individually (instead of receiving one multilevel array).

How to judge an encoder

Reference points:

  1. Raw initial data discounting the data structure overheads
    A PHP array composed of key/value pairs of information (an invoice containing a vendor, a client and a number of products each with their specific details);
  2. Access time walking over and copying all raw data
    Using array_reduce to extract all key/value pairs and evaluating their respective binary lengths.

Objectives:

  1. Read speed
    In most applications decoding the data is a more frequent operation than encoding it. Is it fast enough?
  2. Write speed
    If the data was supposed to be transported/communicated from endpoint to endpoint then writing speed should be the second highest concern. If it’s supposed to be stored (semi)persistently then perhaps memory/disk usage should gain higher priority.
  3. Disk space usage
    Compared to the initial data how much more meta-data do you need?
  4. Compression yield
    Is the compressed version of the string significantly lower than the uncompressed version?
  5. Compression overhead
    How much time does the compression algorithm add to the process?
  6. Memory usage
    Is the memory allocated when reading/writing data from/to the serialization blob comparable to the raw data?
  7. Easy to read by humans
  8. Easy to write by humans
  9. Community and support

Analysis of the benchmark data (tables below):

Time is expressed as percent (ex.: decoder read time divided by raw php access time).
Disk usage is expressed as percent (ex.: encoded data length divided by raw data length).

  1. IGBinary and MsgPack and BSON seem to win across the board (read, write, disk usage).
  2. Serialize, JSON and YAML are pretty good at reading and writing but have higher disk usages.
  3. All of the php extensions are much faster than any of the pure php implementations (msgpack, igbinary, bson, serialize, json, export, yaml).
  4. All of the php extensions are much faster even than using array_reduce recursively on the raw array data (wth?).
  5. GZipping encoded data makes the disk usage almost the same as that of the raw data — sweet.
  6. BZipping has (marginally) less compression (~10%) performance but takes much more time to compress.
  7. The time required for GZipping nearly equal to the encoding time of the fastest of the encoders.
  8. The fastest human readable/writable formats (JSON and YAML when using the php extensions) are still 2x/7x slower than their binary counterparts.
  9. BSON and MsgPack seem to have very active communities and are used in important projects such as MongoDB and Redis (respectively).
  10. JSON is by far the most popular and ubiquitous of the encoders and is used for all sorts of purposes: communication, storage, logging, configuration; its human readability/writeability is what permits half of those purposes to work.

Benchmark data:

Encoding the data

Format Read time Write time Disk usage
igbinary 4.4137 5.5693 126.625
bson 4.4225 3.6207 162.075
msgpack 5.0974 3.1087 135.915
serialize 6.196 4.2387 198.18
json 13.3619 7.7793 154.12
export 15.0877 9.8206 311.725
yaml 26.5641 21.2818 171.57
tnetstring 181.053 142.24 160.635
xml 182.4433 243.1294 194.39
bencode 261.363 110.4493 148.705
cbor 296.5037 200.4747 136.04
ubjson 346.5415 241.0281 153.615

Encoding + GZipping the data

Format Read time Write time Disk usage
igbinary 9.5062 16.2105 99.245
bson 9.9311 15.5781 105.86
msgpack 10.3767 14.1905 96.21
serialize 12.0267 16.6106 108.44
json 18.759 18.8141 94.905
export 21.6112 23.7916 108.83
yaml 32.2774 33.3446 98.75
tnetstring 186.9992 153.6501 101.23
xml 187.9654 254.5289 106.205
bencode 266.7813 121.1408 95.455
cbor 301.6611 210.844 92.725
ubjson 351.9681 252.3896 100.985

Encoding + BZipping the data

Format Read time Write time Disk usage
igbinary 18.8168 64.4219 106.71
bson 20.3083 73.1522 111.38
msgpack 20.4046 66.812 107.625
serialize 23.8041 79.7363 114.46
json 28.1694 69.9902 102.09
export 34.218 111.2053 114.56
yaml 42.3537 88.8431 104.155
tnetstring 198.5296 210.8589 109.465
xml 199.7003 315.5523 114.08
bencode 276.3642 169.8964 104.1
cbor 310.9098 257.9502 103.005
ubjson 362.4339 309.8795 108.735
Advertisements