Some Thoughts on Aggregating Moderate Amount of Data

As part of my role, I need to perform aggregates (group by operation) on some medium size raw text file data, in the region of 100s to 200s of Gb, as a one-off exercise.

I have tried using R is performing such aggregates, however, it is slow for a relative small size file. Also, there is memory concern as the biggest data file will not fit in memory.
Therefore I hesitate to carry on with this approach.

I have also tried Mysql. Mysql will perform a sort when doing a group by (or use suitable
index if available), which is a cause of slowness with our data size. Table partitioning helps, but there is limit on how many partitions are allowed on a mysql table.

I was interested in Apache Hadoop Hive initially, as it is SQL-like, can handle big data, and have partition table as a feature. However, by trial and error, I have found that without a partition table, it is already very fast using group by directly. It is perhaps to do with the MapReduce paradigm, and its implicit hashing, which is akin to partitioning. (It will be good if there are some feedback on the above ‘analysis’. )

At the end, I moved on from Hive to Twitter Scalding. If the data is homogeneous and uniform,
Hive will be a good tool to complete the task, ie, put all files in an external tables, use a suitable plugin called Serde, and just run a group by SQL.

However, the data files I am working on are not exactly of the same format. To perform it in Hive will require customization with extensions written in Java. On the other hand, Scalding is a library for Scala by Twitter, a general programming language, and I feel is more suited for this task.

Caveat of Partitioning with Timestamp on Mysql

On a project I am working on, I have a fair amount of data to be processed. I wanted to load the data into a huge table so I can analyze the data quality better. While the data set is not Petabytes in size, it is not small either. And as I needed some index on the table, and to alleviate the growth of average insertion speed which is of $$O(n log(n))$$, partitioning comes in mind.

I think the most road traveled is to use TO_DAYS(date) on a Datetime field. However, if for whatever reason, you wanted to use timestamp instead of datetime, one need to be careful how the timestamp field is declared, otherwise it will be defaulted to use CURRENT_TIMESTAMP on insertion and update. I had to default it to NULL to override the default timestamp behaviour(and hence have a unfortunately side effect of allowing nulls).

Below is an example.