Mihai Stancu

Notes & Rants

OrientDB @ eMAG TechLabs — 2016-01-21

OrientDB @ eMAG TechLabs

I wrote a piece on OrientDB for eMAG TechLabs.

It’s an analysis of features and functionality vs. driver implementation availability from the PHP + Symfony2 + Doctrine2 developer’s perspective.

I also (re)wrote a driver to scavenge and integrate the functionality of other partially implemented drivers I found and analyzed.

Advertisements
Object Oriented Databases — 2015-12-14

Object Oriented Databases

When I said what I said about SQL and RDBMSs and that other thing I said about tree structures and hierarchies in RDBMSs this is what I meant (mostly).

Throwback Sunday

I was browsing DB Engines (valuable resource this is) looking at popularity rankings for various database systems and comparing features looking for hot new tech when I started digging into Object Oriented Databases again.

I searched for (open source?) Object Oriented Database engines more intensely nearly 6 years ago (DB Engines wasn’t around back then) and I was disappointed to find that there was little to no popular demand for pure OODBMSs. Every damn google search or wiki lookup spat out RDBMSs conjoint twin little brother, the ORDBMSs (Object-Relational Databases) but that didn’t fit the bill I had in mind.

At that time I did find one open source and pure OODBMS EyeDB which currently looks like a dead project (damn!).

I might have missed (read disregarded) some niche products (read proprietary)

I don’t remember reading about InterSystems Caché or the underlying technology MUMPS which looks very ahead of its time.

But I do remember some important players on the market: Versant and Objectivity which were (and still are) proprietary, as well as another intriguing approach JADE a proprietary full-stack system including a DB.

But why all the fuss? Why not just RDBMS like every one else (freak)?

It felt very strange to me that developers would go gentle into that good night. Developers are inherently lazy creatures which would rather spend 20h automating a 22h long repetitive task than blankly toil away at the repetitive task.

Why would they ever accept to learn an entirely new set of concepts about handling data, read about the mathematics behind it, and mentally bridge the gap between one concept and the other every damn day of the rest of their careers (a repetitive task)?

Why jump through all of these hoops when an OODBMS can achieve the same performance as any RDBMS (or better) and also do away with the systems’ impedance mismatch of using an ORM? Not to mention all the work of building and maintaining an ORM having to debug for it or to limit your access to DBMS features because of the ORM.

Why bother writing a CREATE TABLE which contains virtually the same thing as your class declaration? …and then endeavor to burden yourself with manually keeping every future change from the TABLE or the class in perfect sync with one another? ..DRY anyone?

Versant Object Database for example describes an awesome schema versioning capacity in their product which allows you to simply give the DB the newly compiled class structure and VOD will handle updating old entries to the new schema (eagerly or lazily depending on your requirements).

Databases don’t do trees — 2015-09-11

Databases don’t do trees

Convoluted solutions

Storing tree-like data structures (such as webpages, categories, blog comments, forum topics, nested menus, organization charts) in relational databases is a real challenge and even when you’re using ready-made optimized libraries it’s still convoluted.

  • adjacency list is recursive and don’t scale well;
  • materialized path uses unanchored text-comparison (like '%this%') — better scaling but no cigar;
  • nested sets (aka modified preorder tree traversal) are heavy maintenance: easy for data-reads one select query can fetch a lot of what you may need, but some update scenarios can touch half or more of the records in that table;
  • closure trees duplicate lots of record-to-record relationships and updates can remove/reinsert lots of record-to-record relationships;

All of the above patterns usually also use adjacency list in parallel to allow for consistency checks — in case the application-level implementation is bad or transactions aren’t used (or available) the trees can end up in incoherent states which need to be corrected (with more fallible code).

Would you try saying all of that in one single breath? It all seems too damn convoluted to work with.

Hats of to some

Oracle has some (syntactic) support for selecting hierarchical data from adjacency lists. It’s still searching for data recursively — all other patterns try to avoid this — but at least it’s being done without round-trip TCP communication from the application to the database and with possible query planner optimizations included.

But unfortunately there is not indexing support included with this Oracle feature which is where the support should be on this topic.

PostgreSQL offers syntactic support for querying materialized path columns as well as a custom data type for them with support for indexing said data type (using GiST).

Can you spell dysfunctional?

We need the ease of use of adjacency lists — only specify a parent — with the syntactical support Oracle offers them and the data type and indexing support PostgreSQL is offering.

To say this in one breath: CREATE INDEX category_parent ON category(parent) USING HIERARCHICAL, why doesn’t this just work after 40 years of SQL evolution?

SQL survived way too long — 2015-08-15

SQL survived way too long

SQL currently forces two computer programs to discuss what they want to do in an English based textual representation (wtf?) and both computer programs jump through hoops to adapt to that.

The database itself uses lexers, parsers, optimizers and interpretors to understand and execute the requested SQL statements. The consumer application uses DBALs, ActiveRecords or ORMs or other manners of SQL statement construction in order to convey them to the database.

The fact that two computer programs have to talk to each other in a language foreign to them and entire third party projects are dedicated to helping them cross these communication boundaries easier should give us enough pause (once again, wtf?).

Some of the disadvantages of forcing two natively binary machines to speak broken English:

  • computational overhead on both sides — either in generating queries or in parsing them
  • representational overhead on both sides — data communicated between the two entities is needlessly converted into text (including integers, floats, enums, sets) or it is at least syntactically escaped (strings, blobs)
  • structural duplication — a logical entity is declared once in the application logic and once as a (set of) table(s) in the database
  • added vulnerabilities — the entire spectrum of SQL injection attacks is caused by the textual nature of the communication
  • feature restriction — new features need to be added in database server logic, consumer application logic as well as SQL syntax logic (buzz words: standardization, vendor acceptance, uniformity) which ripples into implementations at the DBAL and ActiveRecord or ORM level in order for the consumer application logic to have access to it

SEQUEL the “Structured English Query Language” was born in the 70s and was used by (not necessarily technical) humans directly, that’s why it was English based. SQL is now a mature technology provided by many vendors with very high performance yields which made it a de facto standard for every data-driven development project — this is a classic case of exaptation.

The very same technology backbone without the “English” part and the “textual” part would mean:

  1. Less code to maintain — less effort and less errors
  2. Less market fragmentation from vendors in terms of behavior — or at least in terms of syntax
  3. No need for DBALs, ActiveRecords or ORMs — but complex client libraries and APIs would be needed to expose the same functionality
  4. Deduplication of structural code — consumer applications would provide authoritative structure of database objects
  5. Access to the entire consumer application library in the database — complex mathematical or logical functions would sometimes need to be implemented at the database level in the (more limited) SQL or PL/SQL languages even when already available at the application level; ex.: hashing functions, geo-positioning arithmetic, complex string manipulation, output format manipulation etc.
  6. More permeability of features between database server and consumer application — such as C-level datatypes such as structs, enums and unions with all their nested structuring and functionality which are currently not available in MySQL (unions, nested structs) but are partially in PostgreSQL
  7. Added permeability allows shifting responsibilities when necessary — moving some of the processing from an overburdened database server to an easier-to-scale application level