Serialization and Encoders
There’s a nicely designed Serializer
component within Symfony
which allows you to convert structured object data into a transportable or storeable string (or binary) format.
The nice thing about the symfony/serializer
design is that it separates two major concerns of serialization: 1) extracting the data from the objects and 2) encoding it into a string.
The extraction part is called normalization wherein the structured object data is converted into a common format — usually easier to encode / supported by all encoders — for example that format could be an associative array.
The encoding part takes the normalized data and creates a string (or binary) representation of it ready to be transported or stored on disk.
The extra encoders I bundled together
The bundle is a collection of general purpose serialization encoders I scavenged while investigating what options there are in this field, what purposes they serve, how efficient they are in usage (from multiple perspectives).
Fully working PHP encoders: Bencode, BSON, CBOR, Export, IGBinary, MsgPack, Serialize, Tnetstring, UBJSON and YAML.
Partial PHP implementations: Sereal
and Smile
and PList.
No PHP encoders found: BINN, BJSON, JSON5 HOCON, HJSON and CSON.
Of which:
bencode
does not supportfloat
s.PList
has a full PHP encoder but the API requires encoding each scalar-node individually (instead of receiving one multilevel array).
How to judge an encoder
Reference points:
- Raw initial data discounting the data structure overheads
A PHP array composed of key/value pairs of information (an invoice containing a vendor, a client and a number of products each with their specific details); - Access time walking over and copying all raw data
Usingarray_reduce
to extract all key/value pairs and evaluating their respective binary lengths.
Objectives:
- Read speed
In most applications decoding the data is a more frequent operation than encoding it. Is it fast enough? - Write speed
If the data was supposed to be transported/communicated from endpoint to endpoint then writing speed should be the second highest concern. If it’s supposed to be stored (semi)persistently then perhaps memory/disk usage should gain higher priority. - Disk space usage
Compared to the initial data how much more meta-data do you need? - Compression yield
Is the compressed version of the string significantly lower than the uncompressed version? - Compression overhead
How much time does the compression algorithm add to the process? - Memory usage
Is the memory allocated when reading/writing data from/to the serialization blob comparable to the raw data? - Easy to read by humans
- Easy to write by humans
- Community and support
Analysis of the benchmark data (tables below):
Time is expressed as percent (ex.: decoder read time divided by raw php access time).
Disk usage is expressed as percent (ex.: encoded data length divided by raw data length).
IGBinary
andMsgPack
andBSON
seem to win across the board (read, write, disk usage).Serialize
,JSON
andYAML
are pretty good at reading and writing but have higher disk usages.- All of the php extensions are much faster than any of the pure php implementations (
msgpack
,igbinary
,bson
,serialize
,json
,export
,yaml
). - All of the php extensions are much faster even than using
array_reduce
recursively on the raw array data (wth?). - GZipping encoded data makes the disk usage almost the same as that of the raw data — sweet.
- BZipping has (marginally) less compression (~10%) performance but takes much more time to compress.
- The time required for GZipping nearly equal to the encoding time of the fastest of the encoders.
- The fastest human readable/writable formats (
JSON
andYAML
when using the php extensions) are still 2x/7x slower than their binary counterparts. BSON
andMsgPack
seem to have very active communities and are used in important projects such asMongoDB
andRedis
(respectively).JSON
is by far the most popular and ubiquitous of the encoders and is used for all sorts of purposes: communication, storage, logging, configuration; its human readability/writeability is what permits half of those purposes to work.
Benchmark data:
Encoding the data
Format | Read time | Write time | Disk usage |
---|---|---|---|
igbinary | 4.4137 | 5.5693 | 126.625 |
bson | 4.4225 | 3.6207 | 162.075 |
msgpack | 5.0974 | 3.1087 | 135.915 |
serialize | 6.196 | 4.2387 | 198.18 |
json | 13.3619 | 7.7793 | 154.12 |
export | 15.0877 | 9.8206 | 311.725 |
yaml | 26.5641 | 21.2818 | 171.57 |
tnetstring | 181.053 | 142.24 | 160.635 |
xml | 182.4433 | 243.1294 | 194.39 |
bencode | 261.363 | 110.4493 | 148.705 |
cbor | 296.5037 | 200.4747 | 136.04 |
ubjson | 346.5415 | 241.0281 | 153.615 |
Encoding + GZipping the data
Format | Read time | Write time | Disk usage |
---|---|---|---|
igbinary | 9.5062 | 16.2105 | 99.245 |
bson | 9.9311 | 15.5781 | 105.86 |
msgpack | 10.3767 | 14.1905 | 96.21 |
serialize | 12.0267 | 16.6106 | 108.44 |
json | 18.759 | 18.8141 | 94.905 |
export | 21.6112 | 23.7916 | 108.83 |
yaml | 32.2774 | 33.3446 | 98.75 |
tnetstring | 186.9992 | 153.6501 | 101.23 |
xml | 187.9654 | 254.5289 | 106.205 |
bencode | 266.7813 | 121.1408 | 95.455 |
cbor | 301.6611 | 210.844 | 92.725 |
ubjson | 351.9681 | 252.3896 | 100.985 |
Encoding + BZipping the data
Format | Read time | Write time | Disk usage |
---|---|---|---|
igbinary | 18.8168 | 64.4219 | 106.71 |
bson | 20.3083 | 73.1522 | 111.38 |
msgpack | 20.4046 | 66.812 | 107.625 |
serialize | 23.8041 | 79.7363 | 114.46 |
json | 28.1694 | 69.9902 | 102.09 |
export | 34.218 | 111.2053 | 114.56 |
yaml | 42.3537 | 88.8431 | 104.155 |
tnetstring | 198.5296 | 210.8589 | 109.465 |
xml | 199.7003 | 315.5523 | 114.08 |
bencode | 276.3642 | 169.8964 | 104.1 |
cbor | 310.9098 | 257.9502 | 103.005 |
ubjson | 362.4339 | 309.8795 | 108.735 |