As I mentioned before, I need to perform aggregates on some medium size raw text file data. To speed up the process, I experimented with Twitter Scalding, a big data Tool derived from Hadoop family, targeted for programming language Scala.
Apart from the setup, the actual code I need to write is not really rocket science (precisely the reason I have chosen it). However, for trouble shooting, I do want to keep the original file name as one of the data points.
One way to do this is open each file using TextLine, create many Pipes, and merge those pipes at the end. However, opening multiple Files with TextLine and merge afterwards become very slow comparing to just using MultipleTextLineFiles, a class specialised in opening multiple files.
I end up extending MultipleTextLineFiles from scalding, altering it such that it can emit file names as well. If this is of use to anyone, my implementation can be found in Gist. (I have tried to keep everything in Scala, but I encounter some difficulties in extending one of the class from Cascading, and unfortunately for one class I have to revert using Java instead.)