Tuesday, August 6, 2013

How to build Hive from source

Step 1: Download Hadoop and hive distros from github.
             git clone haoop.git hadoop-trunk
             git clone hive.git hive trunk

Step 2:  Building hive for specific version of hadoop, I want to build for hadoop-2.0.5-aplha
              cd hadoop-trunk
              git checkout release-2.0.5-alpha  (Checkout tag)
              cd  ..
 Step 4: Now we can build hive for hadoop-2.0.5-aplha
            cd hive-trunk
            ant clean
            ant -Dhadoop.version=2.0.5-alpha -Dhadoop.root=<HADOOP_SOURCE_TREE_PATH>  \
                  -Dtarget.dir=./build-target package
         HADOOP_SOURCE_TREE_PATH=../hadoop-trunk
         target.dir: where package will be build
               

Tuesday, July 30, 2013

Building Hadoop from source

Hadoop building process from Source Tree

STEP 1:  Install all required packages in respective OS, I am using Ubuntu 13.04
                 * Unix System
                 * JDK 1.6
                 * Maven 3.0
                 * Findbugs 1.3.9 (if running findbugs)
                 * ProtocolBuffer 2.4.1+ (for MapReduce and HDFS)
                 * CMake 2.6 or newer (if compiling native code) 
                 * Internet connection for first build (to fetch all Maven and Hadoop dependencies)

STEP 2: Download source from Git repo as it most suitable for nightly build
              Git Repo: git://git.apache.org/hadoop-common.git

STEP 3: Build hadoop-maven-plugin
               cd hadoop-maven-plugin
               mvn install
               cd ..
STEP 4:  Optional to Build native libs 
               mvn compile -Pnative
STEP 5:  Optional findbugs (need in case want to create site)
                mvn findbugs:findbugs
STEP 6:  Building distributions:

               Create binary distribution without native code and without documentation:
               $ mvn package -Pdist -DskipTests -Dtar

               Create binary distribution with native code and with documentation:
               $ mvn package -Pdist,native,docs -DskipTests -Dtar

               Create source distribution:
               $ mvn package -Psrc -DskipTests

               Create source and binary distributions with native code and documentation:
               $ mvn package -Pdist,native,docs,src -DskipTests -Dtar

               Create a local staging version of the website (in /tmp/hadoop-site)
               $ mvn clean site; mvn site:stage -DstagingDirectory=/tmp/hadoop-site

    

Reference :  GitHub

Monday, July 29, 2013

Hadoop Compression

      As hadoop is distributed framework for processing large amount of data, It may be in Tera bytes or Peta bytes. To store whole data Hadoop Used HDFS equivalent of GFS of google. allow to process it through various means like MapReduce, HBase, Hive, Pig.
     But this processing incurs large Network I/O. So requires to developer to Minimise data transfer. Hadoop does provides several namely, Gzip, BZip2, LZO, LZ4, Sanappy, DEFLATE.  Except BZip2 hadoop have native implementation, That can count for performance gain. So if you are running  64-bit machine then it would be recommended to build it from source for you platform, Since As of now Hadoop does not come with 64-Bit native libraries. But if you do not have native libraries then hadoop do provides java implementation of BZip2, DEFALTE, Gzip.
    But compression has trade-off, Cause it require CPU time to compress and decompress  at source and destination. So we have to choose algorithm by taking disk space and CPU time requirement.  We need to consider how does compression affects processing, If it support splitting.

How do we choose Compression?

  1. We can user high performance compression like snappy, LZ4 or LZO with Container like sequence file, RCFile, SquenceFile, AVRO file.  Which support logical separation of data stored and hence compression and splitting. HBase recommends snappy as it's the fastest of all and HBase is for nearly real time  access pattern.
  2. Use compression that support splitting(BZip), or can be indexed like LZO. This may allow to process all splits separately.


 
Note:
  •  Fast storage like SSD are more expensive than traditional HDD, and HDD's are more cost effective as capacity increases. So it would be better to chose algorithm that give best compression and Decompression than lesser space.
  •   SequenceFileFormat: we can use "mapreduce.output.fileoutputformat.compress.type" property to control compression. by default it is per record. Also it supports block compression, which is preferred record compression. Block size can be defined in bytes by "io.seqfile.compress.blocksize".




References:
 
             

Monday, July 15, 2013

Hadoop Testing statagies

    Importance of effective testing strategy can be understand from the fact that normally developer spend more time in testing and debugging than actual coding. So is for hadoop, But hadoop or any distributed system like R, MPI even Multi-Threaded application comes with it's own problems. Because in real time we can not capture state of distributed environment. This makes us to develop even more powerful testing. But here we will talk only about hadoop.

    ClusterMapReduceTestCase:  Hadoop community is putting lot of effort to approach this very problem by developing MiniMRCluster and MiniDFSCluster for testing purpose. This allow developer to test all aspects of hadoop applications like MapReduce, Yarn application, HBase, etc... Just like real environment. But it has one big disadvantage it takes about about 10 odd second to setup test cluster, and may discourage many developer to test all functionality.

    MRUnit:  To address this cloudera developed MRUnit framework which now apache TLP(Top Level Project) with version 1.0.0.  The MRUnit bridges gap between MapReduce and JUnit testing by providing interfaces to test MapReduce jobs.  We can use functionality provided by MRUnit in JUnit test and test Mapper,  Reducer, Driver and Combiner. With PipelineMapreduceDriver we can test workflow of series of MapReduce Jobs. Currently it does not allows to test partitioner. Also it is not efficient to test whole of infrastructure of MaprReduce Jobs, Since job may contain custom Input/Output Formats, Custom Record Reader/Writer, Data serialization etc...

    LocalJobRunner:   Hadoop comes with LocalJobRunner that runs on single JVM and allows to test complete Hadoop stack.  

    Mockito:   Above all frameworks approach  to test different component by creating test environment, So it makes testing very slow. Most test are functional test or end-to-end test, So for efficiency we can use Mocking framework like Mockito, to mocking out functional components of MapReduce Jobs.

    Note: Now Hadoop also uses Mockito framework for testing: Hadoop JIRA.

    JUnit Test: And other non functions like  parsing inputs can be tested separately.






   

Wednesday, July 10, 2013

Maven switching Multi-Environment

Practical usecase scenario is : If we are working for multiple software vendor, we might need to
switch maven environment setting to customer's configuration settings and vice versa.
This can be done by
1. Maven cli with -s, --settings <args>
    mvn -s customerXXXMvnSetting.xml
2. with git.
    Initialize git repo in maven setting directory and create branch for different customers for example
    git checkout -b customerXXXMvnSetting

Link    

Saturday, June 22, 2013

Hadoop load third party libraries

As we see programmer always create utils libraries to be used with all relevant projects, To reduce
Development time. But Hadoop ecosystem where computation is carried out on cluster of N-number of host. we have following options to load those libs to be used by our application:
  1.  We can copy utils libraries to some specific location on all hosts, but It is not possible to copy such utils libraries to all hosts to run MapReduce task that will make maintenance too cumbersome. Or 
  2. We can embed utils libraries in application itself, That causes increase in size of application and also maintaining library versioning.  Or
  3. We can copy utils libraries in DtributedCache and utilize them. This approach will overcome isses in above mentioned method.                       
Now how can we accomplish this with 3rd method :
STEP  1:
hdfs dfs -copyFromLocal YourLib.jar /lib/YourLib·jar
STEP 2: Now add cached library to apllication by  
                  DistributedCache.addFileToClassPath(new Path("/lib/YourLib.jar.jar"), job);


That's it.

Sample Application to demonstrate : MatrixTranspose


   

Thursday, June 20, 2013

FindBugs

we always see there is a complete team of developer works on single project. it is difficult maintain quality of code or always follow best practices. FindBugs is the right tools for the same goal.

FingBug is useful to detect bed practices like

  • Method might ignore exception.
  • Method might drop exception.
  • Comparison of String parameter using == or !=.
  • Comparison of String objects using == or !=.
  • Class defines compareTo(...) and uses Object.equals().
  • Finalizer does not call superclass finalizer.
  • Format string should use %n rather than \n.