SQL or NoSQL – Why not to use both (in PostgreSQL)

NoSQL databases have become very popular in last years and there is a plenty of various options available. It looks like traditional relational databases (RDBMs) are almost not needed any more. NoSQL solutions are advertised as faster, more scalable and easier to use. So who would care about relations, joins, foreign keys and similar stuff (not talking about ACID properties, transactions, transaction isolation)? Who would,  if NoSQLs can make your life much easier. But there is a key insight about NoSQL databases – their wonderful achievements are possible because they made their life easier too is some aspects. But that comes with some price – would you be happy, if your bank will store your saving in MongoDb?

However there are many environments, where NoSQL databases shine – especially when there are huge amounts of simple data structures, which need to be scaled massively across the globe and where these data are not of much value – solutions like social networks, instant messaging etc. are not so much concerned about data consistency or data loss, because these data are basically valueless. (Their business model is just based on sharing absolutely trivial data, where one piece can be easily replaced with another and it does not matter if some pieces are lost. Consider – what will happen if whole Facebook will go away in one minute? Nothing! Few people will be pissed off because they think their online profile was cool, few sad that they cannot share their meaningless achievements with so called ‘friends’, but generally considered nothing special will happen and no real value will be lost. People will just switch to another provider and fill it’s database with tons of trivialities and will easily forget about data in their previous account).

I don’t want to create impression that NoSQL databases are useless, they are very good for certain scenarios (and we need to remember that NoSQL is rather broad category, it includes structured documents stores, key-value stores, object databases etc. – each one has it’s particular niche, where it excels), but relational databases are also good, actually very good. Relational model is fairly good abstraction of very many real world situations, data structures, entities, however we call them. And relational databases provide solid tools to works with them. So it make sense to use them in many cases. It might bit more difficult to start with relational database then with schema-less document store, but  in the long run it should pay off. And what is really nice it’s not about one or another solution, but we can use both and combine them smartly and inventively.
So enough of general mumbo jumbo – let’s get to my particular case – I’ve been looking for data store for my new project and considered to try MongoDb this time ( while in past I stuck to relational DBs), however finally decided for PostgreSQL (again) – and I’d like to share some tests, findings and thoughts.

For data I’ll use this relational model (simplified):

erd

This is basically commonly referred author – book schema (used in many database examples),  but here to make it more real life like the relationship between authors and books is many-to-many.  Data are taken from my previous project mybookshelf – approximately 30k+ books and ~10k authors – so relatively small database – but with the real data.

I’ve played with these 3 set-ups:

  • classical relationship data store – PostgreSQL 9.5, accessed from python with psycopg2 and SQLAlchemy ORM.
  • MongoDB 3.2.5 – in order to leverage capabilities of document based approach, data schema is flattened into one document. Mongo is accessed from pymongo.
  • Use PostgreSQL 9.5 JSONB  type for structured documents –  same flattened schema as in Mongo.

By no means these tests should be considered as benchmarks, for that purpose they are too simplistic.  Rather it should be considered as quick peek into particular data stores capabilities.

Basic stuff

Let’s start with something simple (all examples are from IPython notebook running Python 3.4):

Get all ebooks via SQLAlchemy ORM:

It seems to be be quite slow, what about Mongo?

Yes, Mongo is significantly faster, however we should not forget that ORM is doing much much more then just loading data – it maps data to objects, observes objects changes, maintains local ‘unit of work’ context etc. This extra functionality costs us many cycles.  Especially if we work with many objects like here.  However we can always step back to plain SQL:

Here performance is even slightly better then in Mongo. But it does not retrieve details for each book – if we want all details we have to join ebook table with other tables, this is the query ORM executes in background:

OK joins take  time – we know that – but it’s not that bad – still roughly comparable with Mongo.

And now let’s look at PostgreSQL document solution – selecting from one table with JSONB column ( same JSON data as in Mongo):

Interesting – why it takes longer then relational model?

In web applications usually we work with fewer records (paginated, sorted) – so let’s see how it looks if we limit number of rows and sort them (no index yet):

Last result looks bit weird – why joined select takes longer then ORM fetch?  It’s  because this query is  suboptimal –  it is better first to limit ebook rows and then join – this is actually what happens in ORM fetch:

That’s better – Mongo is still faster, but not so much.

Complex query

Now let’s try a complex search across several attributes –  returns records where all words are found either in book title,  series or authors name:

and Mongo equivalent:

Mongo is again faster, but PostgreSQL solution is better, because it searches without diacritics (something very important for us living in the middle Europe).  I did not find any easy way how to do it in Mongo (only in full text search – which is discussed below).

Again ORM has some extra costs, but even direct SQL query is slower then Mongo:

We can do same query also for PostgreSQL JSONB documents:

So even faster then Mongo and with diacritics insensitive match. But  note about caching in result would be probably true, because if  I try similar query without limit (there are approximately 50 matches in total), query takes much longer:

while Mongo is faster for all results:

 Full Text Search

Both Mongo and PostgreSQL offer full text search. In PostgreSQL a column (full_text) was added to ebook table containing full text search data (TSVECTOR), concatenation of  title, series title and all authors names. Both PostgreSQL and Mongo use full text index and support case insensitive, diacritics insensitive search.

Again we can assume there is significant ORM overhead, let’s look at pure sql query:

And Mongo:

In fulltext search PosgreSQL is faster.

Updates

I was focused mostly on queries – because my use case will be mostly read orientated, but could not resist to test updating author name, where I expected relational database to  really excel due to it’s data representation.

Nor so fast, what about Mongo?

Ooops – why PostgreSQL is so slow – maybe it’s again ORM part. Let’s look at pure SQL update:

OK it looks like update time is spent in the database. Well done Mongo,  I did not really expect so fast updates.

PostgreSQL JSONB Fun

PostgreSQL offers many useful  operators and functions, with which we can create different queries – functionality is roughly comparable with Mongo.  Few examples below:

@> operator find document containing given json:

That’s superfast, compare with Mongo:

->> extracts text and can be used for instance  in where clause:

we can use join on JSONB array to get distinct authors:

or use it to get count of books per author:

And we can do much more …

Conclusion

There are many different databases and they differ much in functionality and targeted usage. Despite of current  popularity of NoSQL databases, traditional relational databases (RDBMs) are viable solutions for many applications storage needs including dynamic web applications. RDBMs are rock solid and proven in many battles. They are know to be reliable and scalable.
Just performance should not be the main selection criteria,  unless such performance is really and truly needed. RDBMs can offer much in area of reliability and data consistency. Neither ease of use should be overestimated – there is a good reason why relational databases are more complex.
I personally would prefer RDMS like PostgreSQL as a default choice, unless there is very good reason to use other (NoSQL) database.  And even in these cases, not all data must go to other database – we can use it for instance as a ‘data cache’   for fast reads from replicated servers.
And when looking for document based stores or key-value stores, we do not have to leave PostgreSQL environment, because it provides very solid solutions also in this area (JSONB and HSTORE).

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">