In these days there are many “buzz-words” around and many different ideas about what is the best tool for “my big data”.
We can hear / read many about “how new or old or sexy” different SQL or noSQL tools are. Names like “Hadoop”, “Hive”, “Cassandra”, “MongoDB” etc. are flying around.
But based on my experiences this is not the point all. Only things which should be interesting for us are use cases:
- how really big data we have – several milions of records per day is really almost nothin, I worked with PostgreSQL database which collected billions of records per day from telecommunication network – all still on one machine. And even this are not exactly “really big data”.
- how well they can be compressed – if you have really big data with many repeated values then propper tool or propper file system with compression can save you a lot of disk space and therefore also I/O operation,
- how quick responses we need + which HW we can afford – modern noSQL databases like Redis or MongoDB really need huge amount of memory – they can give you “real time” responses but everything has its price. If minutes for report generation are not important for you and you want to use one “usual” machine then mySQL or PostgreSQL are the best choices for you. Because on “usual” HW they will always outperform noSQL systems.
- how quick responses we need + which HW we can afford (2) – customers often say – Google can give us answers in seconds. Our question after this statement is – Google has big datacenters for it. Do you want to build such a datacenter to have as quick responses as from Google?
- how quick responses we need + which HW we can afford (3) – if we really need real time responses for high amount of queries then (and only then) we really need modern in-memory noSQL database with reliable horizontal sharding. And which one is the best for us depends on other factors which I will discusse later in another text.