As part of my role, I need to perform aggregates (group by operation) on some medium size raw text file data, in the region of 100s to 200s of Gb, as a one-off exercise.
I have tried using R is performing such aggregates, however, it is slow for a relative small size file. Also, there is memory concern as the biggest data file will not fit in memory.
Therefore I hesitate to carry on with this approach.
I have also tried Mysql. Mysql will perform a sort when doing a group by (or use suitable
index if available), which is a cause of slowness with our data size. Table partitioning helps, but there is limit on how many partitions are allowed on a mysql table.
I was interested in Apache Hadoop Hive initially, as it is SQL-like, can handle big data, and have partition table as a feature. However, by trial and error, I have found that without a partition table, it is already very fast using group by directly. It is perhaps to do with the MapReduce paradigm, and its implicit hashing, which is akin to partitioning. (It will be good if there are some feedback on the above ‘analysis’. )
At the end, I moved on from Hive to Twitter Scalding. If the data is homogeneous and uniform,
Hive will be a good tool to complete the task, ie, put all files in an external tables, use a suitable plugin called Serde, and just run a group by SQL.
However, the data files I am working on are not exactly of the same format. To perform it in Hive will require customization with extensions written in Java. On the other hand, Scalding is a library for Scala by Twitter, a general programming language, and I feel is more suited for this task.