Mihai Stancu

Notes & Rants

OrientDB @ eMAG TechLabs — 2016-01-21

OrientDB @ eMAG TechLabs

I wrote a piece on OrientDB for eMAG TechLabs.

It’s an analysis of features and functionality vs. driver implementation availability from the PHP + Symfony2 + Doctrine2 developer’s perspective.

I also (re)wrote a driver to scavenge and integrate the functionality of other partially implemented drivers I found and analyzed.

Advertisements
JSON logformat and analysis — 2016-01-13

JSON logformat and analysis

Setup log format

So you’re tired of reading apache with column -t or need to process them with external tools maybe push them into a logstash? Say no more:

#
# Inside your virtual host definition
#

# Declaring your custom log format as a JSON structure
LogFormat '{"time":"%{%FT%T%z}t","response":{"status":"%>s","duration":"%D","length":"%B"},"request":{"method":"%m","host":"%V","port":"%p","url":"%U","query":"%q"},"client":{"ip":"%a","agent":"%{User-agent}i","referer":"%{Referer}i"}}' json_log

# Declaring an environment variable based on the type of file requested
SetEnvIf Request_URI "(\.gif|\.png|\.jpg|\.ico|\.css|\.js|\.eot|\.ttf|\.woff2?)$" request_static=1

# Declaring separate log files (one for static content, one for dynamic pages) with the new log format
CustomLog /path/to/log/access_static.log  json_log env=request_static
CustomLog /path/to/log/access_dynamic.log json_log env=!request_static

Tool to read/analyse the logs (manually)

A small tool called jq which basically reads each line and treats it as a JSON object, then outputs them pretty printed.

The package itself doesn’t have any dependencies and is readily available in linux repos.

Minimal usage:

echo '{"a": 1, "b": 2, "c": 3, "d": [{"e": 4}]}' | jq .
{
  "a": 1,
  "b": 2,
  "c": 3,
  "d": [
    {
      "e": 4
    }
  ]
}

Object restructuring:

echo '{"a": 1, "b": 2, "c": 3, "d": [{"e": 4}]}' | jq '{"c": .a, "e": .d[0].e}'
{
  "c": 1,
  "e": 4
}

Parsing string content as JSON:

echo '{"a":1,"b":"[{\"c\":2,\"d\":\"3\"}, {\"c\":3,\"e\":\"5\"}]"}' | jq '.["b"]|fromjson'
[
  {
    "c": 2,
    "d": "3"
  },
  {
    "c": 3,
    "e": "5"
  }
]

Filtering:

echo '{"a":1,"b":"[{\"c\":2,\"d\":\"3\"}, {\"c\":3,\"e\":\"5\"}]"}' | jq '.["b"]|fromjson|.[]|select(.c == 2)'
{
  "c": 2,
  "d": "3"
}
Object Oriented Databases — 2015-12-14

Object Oriented Databases

When I said what I said about SQL and RDBMSs and that other thing I said about tree structures and hierarchies in RDBMSs this is what I meant (mostly).

Throwback Sunday

I was browsing DB Engines (valuable resource this is) looking at popularity rankings for various database systems and comparing features looking for hot new tech when I started digging into Object Oriented Databases again.

I searched for (open source?) Object Oriented Database engines more intensely nearly 6 years ago (DB Engines wasn’t around back then) and I was disappointed to find that there was little to no popular demand for pure OODBMSs. Every damn google search or wiki lookup spat out RDBMSs conjoint twin little brother, the ORDBMSs (Object-Relational Databases) but that didn’t fit the bill I had in mind.

At that time I did find one open source and pure OODBMS EyeDB which currently looks like a dead project (damn!).

I might have missed (read disregarded) some niche products (read proprietary)

I don’t remember reading about InterSystems Caché or the underlying technology MUMPS which looks very ahead of its time.

But I do remember some important players on the market: Versant and Objectivity which were (and still are) proprietary, as well as another intriguing approach JADE a proprietary full-stack system including a DB.

But why all the fuss? Why not just RDBMS like every one else (freak)?

It felt very strange to me that developers would go gentle into that good night. Developers are inherently lazy creatures which would rather spend 20h automating a 22h long repetitive task than blankly toil away at the repetitive task.

Why would they ever accept to learn an entirely new set of concepts about handling data, read about the mathematics behind it, and mentally bridge the gap between one concept and the other every damn day of the rest of their careers (a repetitive task)?

Why jump through all of these hoops when an OODBMS can achieve the same performance as any RDBMS (or better) and also do away with the systems’ impedance mismatch of using an ORM? Not to mention all the work of building and maintaining an ORM having to debug for it or to limit your access to DBMS features because of the ORM.

Why bother writing a CREATE TABLE which contains virtually the same thing as your class declaration? …and then endeavor to burden yourself with manually keeping every future change from the TABLE or the class in perfect sync with one another? ..DRY anyone?

Versant Object Database for example describes an awesome schema versioning capacity in their product which allows you to simply give the DB the newly compiled class structure and VOD will handle updating old entries to the new schema (eagerly or lazily depending on your requirements).

MultiKeyMaps for Tuples — 2015-09-14

MultiKeyMaps for Tuples

Understanding MultiKeyMaps of Tuples

A database table is an on-disk Vector of Records, while its Indexes are key/value pair Maps of String to Record. You can filter, sort and group it by any column Indexed column (or actually any column within the Record).

Taking this concept from RDBMSs (an on-disk environment) and moving into a purely in-memory environment, adding all the possible optimizations for memory efficiency and speed and that’s a MultiKeyMap of Records.

Basically it’s just another container type that I think should be in the standard libraries of (m)any languages.

What kinds of problems would it fix?

I’ve seen programmers insert stuff into temporary/in-memory tables in order to perform filtering, sorting and grouping on multiple keys (sometimes for relatively small data-sets).

Those programmers could have written their own filtering and sorting logic but it would have been far more error prone, it would require testing, it would have been more time consuming and quite possibly less memory/speed efficient (in an interpreted language compared to a compiled library).

I would assume that those kinds of data structures are already present in RDBMS libraries — granted they would tend to be very purpose-written including the on-diskyness aspect they would naturally inherit.

Standard container libraries rarely make assumptions about the data they carry in order to provide the most abstract, least presumptive and most flexible tool they can. This is likely why this concept wouldn’t easily take root in a statically typed language but even without the use of dynamic reflection / runtime information it could still be implemented sufficiently abstracted to do its job well.

How do we currently bridge this gap?

Doctrine’s ArrayCollections with Criteria based filtering is an ORMs approach to filtering a Collection of Records. The presumption exists that the Collection is composed of some kind of logical Records whose fields are accessible.

It is used in conjunction with mapped relationships (1:n, m:1, m:n) between entities which are sometimes big enough to require subfiltering.

    $expr = Criteria::expr();
    $criteria = Criteria::create();
    $criteria->where($expr->gte('start', $start));
    $criteria->andWhere($expr->lte('end', $end);
    $subselection = $collection->matching($criteria);;

The above code would either amend the ORM query created to fetch the data from the RDBMS or if the data is already present it would iterate over the collection and find elements matching the specified criteria. The filtering would obviously (unnecessarily) occur every time.

How would I do it?

class IndexedCollection extends ArrayCollection {
    /* indexing logic: https://github.com/mihai-stancu/collections/blob/master/IndexedCollection.php */ 
}

class Record {
    /* Simple structured class containing an email property */
}

$records = new IndexedCollection(
    array(
        $record1,
        $record2,
        $record3,
        $record4
    ),
    array(
        'id' => true,
        'email' => true,
        'lastname' => false,
    )
);

if ($records->containsKeyBy('email', 'contact@example.com')) {
    $record = $records->getBy('email', 'contact@example.com');
    $records->removeBy('email', 'contact@example.com');
}

if (isset($records['lastname:Smith'])) {
    $record = $records['email:Smith'];
    unset($records['email:Smith']);
}

Admittedly the above is missing some syntactic sugar one could only get if the host language would be cooperative in this endeavor, but it gets the point across.

Databases don’t do trees — 2015-09-11

Databases don’t do trees

Convoluted solutions

Storing tree-like data structures (such as webpages, categories, blog comments, forum topics, nested menus, organization charts) in relational databases is a real challenge and even when you’re using ready-made optimized libraries it’s still convoluted.

  • adjacency list is recursive and don’t scale well;
  • materialized path uses unanchored text-comparison (like '%this%') — better scaling but no cigar;
  • nested sets (aka modified preorder tree traversal) are heavy maintenance: easy for data-reads one select query can fetch a lot of what you may need, but some update scenarios can touch half or more of the records in that table;
  • closure trees duplicate lots of record-to-record relationships and updates can remove/reinsert lots of record-to-record relationships;

All of the above patterns usually also use adjacency list in parallel to allow for consistency checks — in case the application-level implementation is bad or transactions aren’t used (or available) the trees can end up in incoherent states which need to be corrected (with more fallible code).

Would you try saying all of that in one single breath? It all seems too damn convoluted to work with.

Hats of to some

Oracle has some (syntactic) support for selecting hierarchical data from adjacency lists. It’s still searching for data recursively — all other patterns try to avoid this — but at least it’s being done without round-trip TCP communication from the application to the database and with possible query planner optimizations included.

But unfortunately there is not indexing support included with this Oracle feature which is where the support should be on this topic.

PostgreSQL offers syntactic support for querying materialized path columns as well as a custom data type for them with support for indexing said data type (using GiST).

Can you spell dysfunctional?

We need the ease of use of adjacency lists — only specify a parent — with the syntactical support Oracle offers them and the data type and indexing support PostgreSQL is offering.

To say this in one breath: CREATE INDEX category_parent ON category(parent) USING HIERARCHICAL, why doesn’t this just work after 40 years of SQL evolution?

Everything is a file — 2015-08-26

Everything is a file

UNIX invented it, BSD and Linux gave it to the world

Everything is a file is very successful paradigm in the UNIX/Linux communities which has allowed the kernel to simplify and uniformize how it uses devices which are exposed to the user as files. All files are treated as a bag of bytes. Reading/writing from a file is straightforward.

Besides actual data storage a lof ot fruitful exaptation has been derived from this paradigm and from the UNIX/Linux file system conventions:

  • Files, folders, symlinks, hardlinks, named pipes (fifo), network pipes, devices
  • Applications which handle readable files and can work together well (ex.: lines separated with \n, columns separated with \t): less/more, tail, head, sort, split, join, fold, par, grep, awk, colum, wc, sed, tee
  • Configuration management
  • Application storage
  • Library registry
  • Disk cloning
    • Disk images for backup (dd)
    • Smaller than disk size images (skip unused space)
    • Compress disk images on the fly without storing the uncompressed version (dd | gz)
    • Restoring disk images from backups
    • Disk recovery — HDDs, CDs, DVDs, USB sticks etc. — when they have bad sectors or scratches
    • Creating bootable USB sticks from a raw image file or an ISO (dd again)
  • Virtual filesystems
    • Mounting a raw image file or ISO as a filesystem
    • Mounting archives and compressed archives as a filesystem (tar, gz, bz, zip, rar)
    • Network filesystems look just like normal folders SAMBA, NFS
    • Using various network protocols as filesystems: HTTP, FTP, SSH
  • Searching everywhere (find, grep, sed)

Plan9 from Bell Labs made it better

Current UNIX/Linux distros don’t implement this paradigm fully — ex.: network devices aren’t files — but some less known systems do (such as UNIX successor plan9 / inferno and their Linux correspondent glendix).

The plan9 project went onward in applying the paradigm for:

  • Processes
    • Process management
    • Inter process communication
    • Client-Server network communication
  • Network related issues:
    • Network interfaces are files
    • Access rights to network interfaces is based on filesystem access rights to symlinks pointing to interface files
    • The filesystem (9P) extends over the network as a network communication protocol
  • Graphics interfaces and mouse IO

Other innovations it brought us (which got implemented in UNIX/Linux):

  • UTF-8 / Unicode
  • Filesystem snapshotting
  • Union filesystems
  • Lightweight threads
Rabbits in the browser — 2015-08-21

Rabbits in the browser

eCommerce applications a usually read-intensive — due to the number of products and category listings — and tend to optimize their scaling for a higher number of reads by using replication for example and letting the slaves handle reads.

Writing performance often bottlenecks in the checkout phase of the application where new orders are registered, stocks are balanced etc..

This type of bottleneck is all the more visible in highly cached applications where most of the read-intensive information is served from memory while the checkout still needs a lot of concurrent write access on a single master database.

Replacing synchronous on demand processing with asynchronous message passing and processing should:

  1. Allow more simultaneous connections — since the connections are simple TCP socket
  2. Decrease the number of processes used — no nginx, no php-fpm, just tcp kernel-threads and rabbit worker threads
  3. Decrease the memory use — based on the number of consumers used to process the inbound data
  4. Decrease DB concurrency — based on the number of consumers doing the work rather than the number of buyers placing an order (orders of magnitude lower)

Messages with reply-queues could allow asynchronous responses to be received later on when the processor has finished its task. A TCP proxy in front of a cluster of RabbitMQ machines and a few PHP worker machines behind them should scale much better than receiving all the heavy-processing traffic in nginx + php-fpm processes.

Message brokers such as RabbitMQ can dynamically generate reply-queues when asked to do so and those queues are session-based only so their content is only accessible to the user that sent the request.

Security-wise message brokers support TLS over the socket but extra security measures can be envisioned — ex.: security token, message-digest checks etc..

A short example of the above principles is here.