Mihai Stancu

Notes & Rants

Object Oriented Databases — 2015-12-14

Object Oriented Databases

When I said what I said about SQL and RDBMSs and that other thing I said about tree structures and hierarchies in RDBMSs this is what I meant (mostly).

Throwback Sunday

I was browsing DB Engines (valuable resource this is) looking at popularity rankings for various database systems and comparing features looking for hot new tech when I started digging into Object Oriented Databases again.

I searched for (open source?) Object Oriented Database engines more intensely nearly 6 years ago (DB Engines wasn’t around back then) and I was disappointed to find that there was little to no popular demand for pure OODBMSs. Every damn google search or wiki lookup spat out RDBMSs conjoint twin little brother, the ORDBMSs (Object-Relational Databases) but that didn’t fit the bill I had in mind.

At that time I did find one open source and pure OODBMS EyeDB which currently looks like a dead project (damn!).

I might have missed (read disregarded) some niche products (read proprietary)

I don’t remember reading about InterSystems Caché or the underlying technology MUMPS which looks very ahead of its time.

But I do remember some important players on the market: Versant and Objectivity which were (and still are) proprietary, as well as another intriguing approach JADE a proprietary full-stack system including a DB.

But why all the fuss? Why not just RDBMS like every one else (freak)?

It felt very strange to me that developers would go gentle into that good night. Developers are inherently lazy creatures which would rather spend 20h automating a 22h long repetitive task than blankly toil away at the repetitive task.

Why would they ever accept to learn an entirely new set of concepts about handling data, read about the mathematics behind it, and mentally bridge the gap between one concept and the other every damn day of the rest of their careers (a repetitive task)?

Why jump through all of these hoops when an OODBMS can achieve the same performance as any RDBMS (or better) and also do away with the systems’ impedance mismatch of using an ORM? Not to mention all the work of building and maintaining an ORM having to debug for it or to limit your access to DBMS features because of the ORM.

Why bother writing a CREATE TABLE which contains virtually the same thing as your class declaration? …and then endeavor to burden yourself with manually keeping every future change from the TABLE or the class in perfect sync with one another? ..DRY anyone?

Versant Object Database for example describes an awesome schema versioning capacity in their product which allows you to simply give the DB the newly compiled class structure and VOD will handle updating old entries to the new schema (eagerly or lazily depending on your requirements).

Advertisements
MultiKeyMaps for Tuples — 2015-09-14

MultiKeyMaps for Tuples

Understanding MultiKeyMaps of Tuples

A database table is an on-disk Vector of Records, while its Indexes are key/value pair Maps of String to Record. You can filter, sort and group it by any column Indexed column (or actually any column within the Record).

Taking this concept from RDBMSs (an on-disk environment) and moving into a purely in-memory environment, adding all the possible optimizations for memory efficiency and speed and that’s a MultiKeyMap of Records.

Basically it’s just another container type that I think should be in the standard libraries of (m)any languages.

What kinds of problems would it fix?

I’ve seen programmers insert stuff into temporary/in-memory tables in order to perform filtering, sorting and grouping on multiple keys (sometimes for relatively small data-sets).

Those programmers could have written their own filtering and sorting logic but it would have been far more error prone, it would require testing, it would have been more time consuming and quite possibly less memory/speed efficient (in an interpreted language compared to a compiled library).

I would assume that those kinds of data structures are already present in RDBMS libraries — granted they would tend to be very purpose-written including the on-diskyness aspect they would naturally inherit.

Standard container libraries rarely make assumptions about the data they carry in order to provide the most abstract, least presumptive and most flexible tool they can. This is likely why this concept wouldn’t easily take root in a statically typed language but even without the use of dynamic reflection / runtime information it could still be implemented sufficiently abstracted to do its job well.

How do we currently bridge this gap?

Doctrine’s ArrayCollections with Criteria based filtering is an ORMs approach to filtering a Collection of Records. The presumption exists that the Collection is composed of some kind of logical Records whose fields are accessible.

It is used in conjunction with mapped relationships (1:n, m:1, m:n) between entities which are sometimes big enough to require subfiltering.

    $expr = Criteria::expr();
    $criteria = Criteria::create();
    $criteria->where($expr->gte('start', $start));
    $criteria->andWhere($expr->lte('end', $end);
    $subselection = $collection->matching($criteria);;

The above code would either amend the ORM query created to fetch the data from the RDBMS or if the data is already present it would iterate over the collection and find elements matching the specified criteria. The filtering would obviously (unnecessarily) occur every time.

How would I do it?

class IndexedCollection extends ArrayCollection {
    /* indexing logic: https://github.com/mihai-stancu/collections/blob/master/IndexedCollection.php */ 
}

class Record {
    /* Simple structured class containing an email property */
}

$records = new IndexedCollection(
    array(
        $record1,
        $record2,
        $record3,
        $record4
    ),
    array(
        'id' => true,
        'email' => true,
        'lastname' => false,
    )
);

if ($records->containsKeyBy('email', 'contact@example.com')) {
    $record = $records->getBy('email', 'contact@example.com');
    $records->removeBy('email', 'contact@example.com');
}

if (isset($records['lastname:Smith'])) {
    $record = $records['email:Smith'];
    unset($records['email:Smith']);
}

Admittedly the above is missing some syntactic sugar one could only get if the host language would be cooperative in this endeavor, but it gets the point across.

Databases don’t do trees — 2015-09-11

Databases don’t do trees

Convoluted solutions

Storing tree-like data structures (such as webpages, categories, blog comments, forum topics, nested menus, organization charts) in relational databases is a real challenge and even when you’re using ready-made optimized libraries it’s still convoluted.

  • adjacency list is recursive and don’t scale well;
  • materialized path uses unanchored text-comparison (like '%this%') — better scaling but no cigar;
  • nested sets (aka modified preorder tree traversal) are heavy maintenance: easy for data-reads one select query can fetch a lot of what you may need, but some update scenarios can touch half or more of the records in that table;
  • closure trees duplicate lots of record-to-record relationships and updates can remove/reinsert lots of record-to-record relationships;

All of the above patterns usually also use adjacency list in parallel to allow for consistency checks — in case the application-level implementation is bad or transactions aren’t used (or available) the trees can end up in incoherent states which need to be corrected (with more fallible code).

Would you try saying all of that in one single breath? It all seems too damn convoluted to work with.

Hats of to some

Oracle has some (syntactic) support for selecting hierarchical data from adjacency lists. It’s still searching for data recursively — all other patterns try to avoid this — but at least it’s being done without round-trip TCP communication from the application to the database and with possible query planner optimizations included.

But unfortunately there is not indexing support included with this Oracle feature which is where the support should be on this topic.

PostgreSQL offers syntactic support for querying materialized path columns as well as a custom data type for them with support for indexing said data type (using GiST).

Can you spell dysfunctional?

We need the ease of use of adjacency lists — only specify a parent — with the syntactical support Oracle offers them and the data type and indexing support PostgreSQL is offering.

To say this in one breath: CREATE INDEX category_parent ON category(parent) USING HIERARCHICAL, why doesn’t this just work after 40 years of SQL evolution?

SQL survived way too long — 2015-08-15

SQL survived way too long

SQL currently forces two computer programs to discuss what they want to do in an English based textual representation (wtf?) and both computer programs jump through hoops to adapt to that.

The database itself uses lexers, parsers, optimizers and interpretors to understand and execute the requested SQL statements. The consumer application uses DBALs, ActiveRecords or ORMs or other manners of SQL statement construction in order to convey them to the database.

The fact that two computer programs have to talk to each other in a language foreign to them and entire third party projects are dedicated to helping them cross these communication boundaries easier should give us enough pause (once again, wtf?).

Some of the disadvantages of forcing two natively binary machines to speak broken English:

  • computational overhead on both sides — either in generating queries or in parsing them
  • representational overhead on both sides — data communicated between the two entities is needlessly converted into text (including integers, floats, enums, sets) or it is at least syntactically escaped (strings, blobs)
  • structural duplication — a logical entity is declared once in the application logic and once as a (set of) table(s) in the database
  • added vulnerabilities — the entire spectrum of SQL injection attacks is caused by the textual nature of the communication
  • feature restriction — new features need to be added in database server logic, consumer application logic as well as SQL syntax logic (buzz words: standardization, vendor acceptance, uniformity) which ripples into implementations at the DBAL and ActiveRecord or ORM level in order for the consumer application logic to have access to it

SEQUEL the “Structured English Query Language” was born in the 70s and was used by (not necessarily technical) humans directly, that’s why it was English based. SQL is now a mature technology provided by many vendors with very high performance yields which made it a de facto standard for every data-driven development project — this is a classic case of exaptation.

The very same technology backbone without the “English” part and the “textual” part would mean:

  1. Less code to maintain — less effort and less errors
  2. Less market fragmentation from vendors in terms of behavior — or at least in terms of syntax
  3. No need for DBALs, ActiveRecords or ORMs — but complex client libraries and APIs would be needed to expose the same functionality
  4. Deduplication of structural code — consumer applications would provide authoritative structure of database objects
  5. Access to the entire consumer application library in the database — complex mathematical or logical functions would sometimes need to be implemented at the database level in the (more limited) SQL or PL/SQL languages even when already available at the application level; ex.: hashing functions, geo-positioning arithmetic, complex string manipulation, output format manipulation etc.
  6. More permeability of features between database server and consumer application — such as C-level datatypes such as structs, enums and unions with all their nested structuring and functionality which are currently not available in MySQL (unions, nested structs) but are partially in PostgreSQL
  7. Added permeability allows shifting responsibilities when necessary — moving some of the processing from an overburdened database server to an easier-to-scale application level