Mihai Stancu

Notes & Rants

OrientDB @ eMAG TechLabs — 2016-01-21

OrientDB @ eMAG TechLabs

I wrote a piece on OrientDB for eMAG TechLabs.

It’s an analysis of features and functionality vs. driver implementation availability from the PHP + Symfony2 + Doctrine2 developer’s perspective.

I also (re)wrote a driver to scavenge and integrate the functionality of other partially implemented drivers I found and analyzed.

JSON logformat and analysis — 2016-01-13

JSON logformat and analysis

Setup log format

So you’re tired of reading apache with column -t or need to process them with external tools maybe push them into a logstash? Say no more:

# Inside your virtual host definition

# Declaring your custom log format as a JSON structure
LogFormat '{"time":"%{%FT%T%z}t","response":{"status":"%>s","duration":"%D","length":"%B"},"request":{"method":"%m","host":"%V","port":"%p","url":"%U","query":"%q"},"client":{"ip":"%a","agent":"%{User-agent}i","referer":"%{Referer}i"}}' json_log

# Declaring an environment variable based on the type of file requested
SetEnvIf Request_URI "(\.gif|\.png|\.jpg|\.ico|\.css|\.js|\.eot|\.ttf|\.woff2?)$" request_static=1

# Declaring separate log files (one for static content, one for dynamic pages) with the new log format
CustomLog /path/to/log/access_static.log  json_log env=request_static
CustomLog /path/to/log/access_dynamic.log json_log env=!request_static

Tool to read/analyse the logs (manually)

A small tool called jq which basically reads each line and treats it as a JSON object, then outputs them pretty printed.

The package itself doesn’t have any dependencies and is readily available in linux repos.

Minimal usage:

echo '{"a": 1, "b": 2, "c": 3, "d": [{"e": 4}]}' | jq .
  "a": 1,
  "b": 2,
  "c": 3,
  "d": [
      "e": 4

Object restructuring:

echo '{"a": 1, "b": 2, "c": 3, "d": [{"e": 4}]}' | jq '{"c": .a, "e": .d[0].e}'
  "c": 1,
  "e": 4

Parsing string content as JSON:

echo '{"a":1,"b":"[{\"c\":2,\"d\":\"3\"}, {\"c\":3,\"e\":\"5\"}]"}' | jq '.["b"]|fromjson'
    "c": 2,
    "d": "3"
    "c": 3,
    "e": "5"


echo '{"a":1,"b":"[{\"c\":2,\"d\":\"3\"}, {\"c\":3,\"e\":\"5\"}]"}' | jq '.["b"]|fromjson|.[]|select(.c == 2)'
  "c": 2,
  "d": "3"
Object Oriented Databases — 2015-12-14

Object Oriented Databases

When I said what I said about SQL and RDBMSs and that other thing I said about tree structures and hierarchies in RDBMSs this is what I meant (mostly).

Throwback Sunday

I was browsing DB Engines (valuable resource this is) looking at popularity rankings for various database systems and comparing features looking for hot new tech when I started digging into Object Oriented Databases again.

I searched for (open source?) Object Oriented Database engines more intensely nearly 6 years ago (DB Engines wasn’t around back then) and I was disappointed to find that there was little to no popular demand for pure OODBMSs. Every damn google search or wiki lookup spat out RDBMSs conjoint twin little brother, the ORDBMSs (Object-Relational Databases) but that didn’t fit the bill I had in mind.

At that time I did find one open source and pure OODBMS EyeDB which currently looks like a dead project (damn!).

I might have missed (read disregarded) some niche products (read proprietary)

I don’t remember reading about InterSystems Caché or the underlying technology MUMPS which looks very ahead of its time.

But I do remember some important players on the market: Versant and Objectivity which were (and still are) proprietary, as well as another intriguing approach JADE a proprietary full-stack system including a DB.

But why all the fuss? Why not just RDBMS like every one else (freak)?

It felt very strange to me that developers would go gentle into that good night. Developers are inherently lazy creatures which would rather spend 20h automating a 22h long repetitive task than blankly toil away at the repetitive task.

Why would they ever accept to learn an entirely new set of concepts about handling data, read about the mathematics behind it, and mentally bridge the gap between one concept and the other every damn day of the rest of their careers (a repetitive task)?

Why jump through all of these hoops when an OODBMS can achieve the same performance as any RDBMS (or better) and also do away with the systems’ impedance mismatch of using an ORM? Not to mention all the work of building and maintaining an ORM having to debug for it or to limit your access to DBMS features because of the ORM.

Why bother writing a CREATE TABLE which contains virtually the same thing as your class declaration? …and then endeavor to burden yourself with manually keeping every future change from the TABLE or the class in perfect sync with one another? ..DRY anyone?

Versant Object Database for example describes an awesome schema versioning capacity in their product which allows you to simply give the DB the newly compiled class structure and VOD will handle updating old entries to the new schema (eagerly or lazily depending on your requirements).

Multiple apps in one repo with Symfony2 — 2015-10-03

Multiple apps in one repo with Symfony2

My requirements:

  • Moving application specific configurations into separate application bundles (not separate app/ folders)
  • Retaining common configurations in the app/config/config_*.yml files
  • Retaining common practices such as calling app/console just adding a parameter to specify the application


  1. Change your apache2 vhost to add a (conditional?) environment variable
    # ...
    # The RegEx below matches subdomains
    SetEnvIf Host nth\..+? SYMFONY_APP=nth
    # ...
  2. Create app/NthKernel.php which extends AppKernel
  3. Overwrite NthKernel::$name = 'nth'
  4. Overwrite NthKernel::serialize, NthKernel::unserialize to ensure the correct name is kept after serialization/deserialization
  5. Overwrite NthKernel::getCacheDir to ensure the cache dirs are split based on the application name:

    public function getCacheDir()
       return $this->rootDir.'/cache/'.$this->name.'/'.$this->environment;

  6. Overwrite the NthKernel::registerContainerConfiguration to load configurations based on the application name and environment. In my case I loaded all config.yml files from any installed bundle:
    public function registerContainerConfiguration(LoaderInterface $loader)
        $env = $this->getEnvironment();
        foreach ($this->bundles as $bundle) {
            $dir = $bundle->getPath() . '/Resources/config/';
            if (file_exists($path = $dir . 'config_'.$env.'.yml')) {
            } elseif (file_exists($path = $dir . 'config.yml')) {
        $dir = __DIR__.'/config/';
        if (file_exists($path = $dir . 'config_'.$env.'.yml')) {
        } elseif (file_exists($path = $dir . 'config.yml')) {
  7. Change web/app.php/web/app_dev.php to ensure they instantiate NthKernel and NthCache based on the environment variable apache is providing (SYMFONY_APP):
    $app = ucfirst(getenv('SYMFONY_APP'));
    require_once __DIR__.'/../app/AppKernel.php';
    require_once __DIR__.'/../app/'.$app.'Kernel.php';
    //require_once __DIR__.'/../app/AppCache.php';
    //require_once __DIR__.'/../app/'.$app.'Cache.php';
    $kernel = $app.'Kernel';
    $kernel = new $kernel('dev', true);
    //$cache = $app.'Cache';
    //$kernel = new $cache($kernel);
  8. Change app/console to allow you to specify which application you need to use
    $app = ucfirst($input->getParameterOption(array('--app', '-a'), getenv('SYMFONY_APP')));
    $env = $input->getParameterOption(array('--env', '-e'), getenv('SYMFONY_ENV') ?: 'dev');
    // ...
    /* Move require_once after you initialized the `$app` variable */
    require_once __DIR__.'/AppKernel.php';
    require_once __DIR__.'/'.$app.'Kernel.php';
    $kernel = $app.'Kernel';
    $kernel = new $kernel($env, $debug);
    $application = new Application($kernel);
           new InputOption(
               'The Application name.',
  9. Use app/console by specifying the application you need to use
    app/console --app=nth --env=dev debug:router
    app/console --app=nth --env=dev debug:container

Other resources:

JoliCode wrote this article on the topic.

Their approach on the problem seems more idiomatic — creating a structure application specific subfolders (apps/nth) each with its own AppKernel, apps/nth/cache and apps/nth/config etc..

A collection of thoroughly random encoders — 2015-10-01

A collection of thoroughly random encoders

Serialization and Encoders

There’s a nicely designed Serializer component within Symfony which allows you to convert structured object data into a transportable or storeable string (or binary) format.

The nice thing about the symfony/serializer design is that it separates two major concerns of serialization: 1) extracting the data from the objects and 2) encoding it into a string.

The extraction part is called normalization wherein the structured object data is converted into a common format — usually easier to encode / supported by all encoders — for example that format could be an associative array.

The encoding part takes the normalized data and creates a string (or binary) representation of it ready to be transported or stored on disk.

The extra encoders I bundled together

The bundle is a collection of general purpose serialization encoders I scavenged while investigating what options there are in this field, what purposes they serve, how efficient they are in usage (from multiple perspectives).

Fully working PHP encoders: Bencode, BSON, CBOR, Export, IGBinary, MsgPack, Serialize, Tnetstring, UBJSON and YAML.

Partial PHP implementations: Sereal and Smile and PList.

No PHP encoders found: BINN, BJSON, JSON5 HOCON, HJSON and CSON.

Of which:

  • bencode does not support floats.
  • PList has a full PHP encoder but the API requires encoding each scalar-node individually (instead of receiving one multilevel array).

How to judge an encoder

Reference points:

  1. Raw initial data discounting the data structure overheads
    A PHP array composed of key/value pairs of information (an invoice containing a vendor, a client and a number of products each with their specific details);
  2. Access time walking over and copying all raw data
    Using array_reduce to extract all key/value pairs and evaluating their respective binary lengths.


  1. Read speed
    In most applications decoding the data is a more frequent operation than encoding it. Is it fast enough?
  2. Write speed
    If the data was supposed to be transported/communicated from endpoint to endpoint then writing speed should be the second highest concern. If it’s supposed to be stored (semi)persistently then perhaps memory/disk usage should gain higher priority.
  3. Disk space usage
    Compared to the initial data how much more meta-data do you need?
  4. Compression yield
    Is the compressed version of the string significantly lower than the uncompressed version?
  5. Compression overhead
    How much time does the compression algorithm add to the process?
  6. Memory usage
    Is the memory allocated when reading/writing data from/to the serialization blob comparable to the raw data?
  7. Easy to read by humans
  8. Easy to write by humans
  9. Community and support

Analysis of the benchmark data (tables below):

Time is expressed as percent (ex.: decoder read time divided by raw php access time).
Disk usage is expressed as percent (ex.: encoded data length divided by raw data length).

  1. IGBinary and MsgPack and BSON seem to win across the board (read, write, disk usage).
  2. Serialize, JSON and YAML are pretty good at reading and writing but have higher disk usages.
  3. All of the php extensions are much faster than any of the pure php implementations (msgpack, igbinary, bson, serialize, json, export, yaml).
  4. All of the php extensions are much faster even than using array_reduce recursively on the raw array data (wth?).
  5. GZipping encoded data makes the disk usage almost the same as that of the raw data — sweet.
  6. BZipping has (marginally) less compression (~10%) performance but takes much more time to compress.
  7. The time required for GZipping nearly equal to the encoding time of the fastest of the encoders.
  8. The fastest human readable/writable formats (JSON and YAML when using the php extensions) are still 2x/7x slower than their binary counterparts.
  9. BSON and MsgPack seem to have very active communities and are used in important projects such as MongoDB and Redis (respectively).
  10. JSON is by far the most popular and ubiquitous of the encoders and is used for all sorts of purposes: communication, storage, logging, configuration; its human readability/writeability is what permits half of those purposes to work.

Benchmark data:

Encoding the data

Format Read time Write time Disk usage
igbinary 4.4137 5.5693 126.625
bson 4.4225 3.6207 162.075
msgpack 5.0974 3.1087 135.915
serialize 6.196 4.2387 198.18
json 13.3619 7.7793 154.12
export 15.0877 9.8206 311.725
yaml 26.5641 21.2818 171.57
tnetstring 181.053 142.24 160.635
xml 182.4433 243.1294 194.39
bencode 261.363 110.4493 148.705
cbor 296.5037 200.4747 136.04
ubjson 346.5415 241.0281 153.615

Encoding + GZipping the data

Format Read time Write time Disk usage
igbinary 9.5062 16.2105 99.245
bson 9.9311 15.5781 105.86
msgpack 10.3767 14.1905 96.21
serialize 12.0267 16.6106 108.44
json 18.759 18.8141 94.905
export 21.6112 23.7916 108.83
yaml 32.2774 33.3446 98.75
tnetstring 186.9992 153.6501 101.23
xml 187.9654 254.5289 106.205
bencode 266.7813 121.1408 95.455
cbor 301.6611 210.844 92.725
ubjson 351.9681 252.3896 100.985

Encoding + BZipping the data

Format Read time Write time Disk usage
igbinary 18.8168 64.4219 106.71
bson 20.3083 73.1522 111.38
msgpack 20.4046 66.812 107.625
serialize 23.8041 79.7363 114.46
json 28.1694 69.9902 102.09
export 34.218 111.2053 114.56
yaml 42.3537 88.8431 104.155
tnetstring 198.5296 210.8589 109.465
xml 199.7003 315.5523 114.08
bencode 276.3642 169.8964 104.1
cbor 310.9098 257.9502 103.005
ubjson 362.4339 309.8795 108.735
Everything is a file — 2015-08-26

Everything is a file

UNIX invented it, BSD and Linux gave it to the world

Everything is a file is very successful paradigm in the UNIX/Linux communities which has allowed the kernel to simplify and uniformize how it uses devices which are exposed to the user as files. All files are treated as a bag of bytes. Reading/writing from a file is straightforward.

Besides actual data storage a lof ot fruitful exaptation has been derived from this paradigm and from the UNIX/Linux file system conventions:

  • Files, folders, symlinks, hardlinks, named pipes (fifo), network pipes, devices
  • Applications which handle readable files and can work together well (ex.: lines separated with \n, columns separated with \t): less/more, tail, head, sort, split, join, fold, par, grep, awk, colum, wc, sed, tee
  • Configuration management
  • Application storage
  • Library registry
  • Disk cloning
    • Disk images for backup (dd)
    • Smaller than disk size images (skip unused space)
    • Compress disk images on the fly without storing the uncompressed version (dd | gz)
    • Restoring disk images from backups
    • Disk recovery — HDDs, CDs, DVDs, USB sticks etc. — when they have bad sectors or scratches
    • Creating bootable USB sticks from a raw image file or an ISO (dd again)
  • Virtual filesystems
    • Mounting a raw image file or ISO as a filesystem
    • Mounting archives and compressed archives as a filesystem (tar, gz, bz, zip, rar)
    • Network filesystems look just like normal folders SAMBA, NFS
    • Using various network protocols as filesystems: HTTP, FTP, SSH
  • Searching everywhere (find, grep, sed)

Plan9 from Bell Labs made it better

Current UNIX/Linux distros don’t implement this paradigm fully — ex.: network devices aren’t files — but some less known systems do (such as UNIX successor plan9 / inferno and their Linux correspondent glendix).

The plan9 project went onward in applying the paradigm for:

  • Processes
    • Process management
    • Inter process communication
    • Client-Server network communication
  • Network related issues:
    • Network interfaces are files
    • Access rights to network interfaces is based on filesystem access rights to symlinks pointing to interface files
    • The filesystem (9P) extends over the network as a network communication protocol
  • Graphics interfaces and mouse IO

Other innovations it brought us (which got implemented in UNIX/Linux):

  • UTF-8 / Unicode
  • Filesystem snapshotting
  • Union filesystems
  • Lightweight threads
How-to put a rabbit in the browser — 2015-08-23

How-to put a rabbit in the browser

A rational for why you would want to do this is here.

I couldn’t find an AMQP library written in browser side JavaScript but I did find a STOMP JavaScript library which works nicely with a server-side STOMP plugin so we use STOMP as our message protocol.

STOMP is a simpler more text-oriented messaging protocol which includes HTTP-like headers. RabbitMQ has a plugin for implementing STOMP and that plugin allows access to AMQP based exchanges and queues by mapping STOMP destinations to exchange/queue names.

You will need:

  • Native WebSockets or SockJS on the browser side
  • The STOMP.js browser library
  • A SockJS server which is conveniently available in the rabbitmq_web_stomp plugin for rabbitmq (bundled with rabbitmq)
  • A STOMP compliant message broker which is available via the rabbitmq_stomp plugin for rabbitmq (bundled with rabbitmq)

Install rabbitmq and enable the plugins:

sudo apt-get install rabbitmq
sudo rabbitmq-plugins enable rabbitmq_management
sudo rabbitmq-plugins enable rabbitmq_web_stomp
sudo service rabbitmq-server restart

Configure exchanges/queues:

  1. Log into http://localhost:15672/ with user guest and password guest.
  2. Create a new test exchange (called test in my code) and a new test queue (called test in my code).
  3. Create bindings between the new exchange and queue.
  4. Manually publish a message in the exchange and ensure it arrives in the queue.

Copy the following two files:

<!-- index.html -->

<!DOCTYPE html>
        <title>Rabbits in the front-end</title>

        <a href="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.js">https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.js</a>
        <a href="https://cdnjs.cloudflare.com/ajax/libs/stomp.js/2.3.3/stomp.js">https://cdnjs.cloudflare.com/ajax/libs/stomp.js/2.3.3/stomp.js</a>
        <a href="http://script.js">http://script.js</a>
            <input type="text" value="">
            <button id="send">Send</button>
/** script.js */

var RabbitMQ = {
    hostname: window.location.hostname,
    port: 15674,
    path: "stomp/websocket",

    username: "guest",
    password: "guest",

    exchange: "/exchange/test",
    queue: "/queue/test",

    onMessage: function(message) {
    onSuccess: function(message) { 
        this.subscribe(RabbitMQ.queue, RabbitMQ.onMessage);
    onError: function() {

var ws = new WebSocket("ws://" + RabbitMQ.hostname + ":"+RabbitMQ.port+"/"+RabbitMQ.path);

var qc = Stomp.over(ws);
qc.heartbeat.outgoing = 0;
qc.heartbeat.incoming = 0;

$(window).load(function() {
    $("form button#send").click(function(e) {

        var parent = $(this).parent();

        if ($("input", parent).val()) {
            qc.send(RabbitMQ.exchange, null, $("input", parent).val());

            $("input", parent).val("")

        return false;

And finally visit http://localhost/index.html to check the results.

Rabbits in the browser — 2015-08-21

Rabbits in the browser

eCommerce applications a usually read-intensive — due to the number of products and category listings — and tend to optimize their scaling for a higher number of reads by using replication for example and letting the slaves handle reads.

Writing performance often bottlenecks in the checkout phase of the application where new orders are registered, stocks are balanced etc..

This type of bottleneck is all the more visible in highly cached applications where most of the read-intensive information is served from memory while the checkout still needs a lot of concurrent write access on a single master database.

Replacing synchronous on demand processing with asynchronous message passing and processing should:

  1. Allow more simultaneous connections — since the connections are simple TCP socket
  2. Decrease the number of processes used — no nginx, no php-fpm, just tcp kernel-threads and rabbit worker threads
  3. Decrease the memory use — based on the number of consumers used to process the inbound data
  4. Decrease DB concurrency — based on the number of consumers doing the work rather than the number of buyers placing an order (orders of magnitude lower)

Messages with reply-queues could allow asynchronous responses to be received later on when the processor has finished its task. A TCP proxy in front of a cluster of RabbitMQ machines and a few PHP worker machines behind them should scale much better than receiving all the heavy-processing traffic in nginx + php-fpm processes.

Message brokers such as RabbitMQ can dynamically generate reply-queues when asked to do so and those queues are session-based only so their content is only accessible to the user that sent the request.

Security-wise message brokers support TLS over the socket but extra security measures can be envisioned — ex.: security token, message-digest checks etc..

A short example of the above principles is here.