Hadoop with pydoop on CentOS 6.4

Using pydoop instead of the streaming interface and MRJob results in a massive performance boost.
I ran a simple wordcount job with a bit of regex matching on ~2GB of text data. Using the streaming interface the job took about 11 minutes on our hadoop cluster to finish. The same job took about 5 minutes using pydoop.
To install pydoop on CentOS6.4 for CDH3:

export CLASSPATH=/usr/lib/hadoop-0.20/hadoop-core.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-cli-1.2.jar
export JAVA_HOME=/usr/java/latest
export HADOOP_CONF_DIR=/etc/HADOOPDIR/CONFDIR
export HADOOP_HOME=/usr/lib/hadoop-0.20
yum -y install gcc
yum -y install gcc-c++
yum -y install boost-devel
yum -y install python-devel
yum -y install python-pip
yum -y install openssl-devel
/usr/bin/pip-python install argparse
/usr/bin/pip-python install importlib
/usr/bin/pip-python install jlib
/usr/bin/pip-python install pydoop

To install pydoop on CentOS6.4 for CDH4:

export CLASSPATH=/usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar:/usr/lib/hadoop/hadoop-common.jar:/usr/lib/hadoop/lib/commons-*.jar
export JAVA_HOME=/usr/java/latest
export HADOOP_CONF_DIR=/etc/HADOOPDIR/CONFDIR
export HADOOP_HOME=/usr/lib/hadoop/
yum -y install gcc
yum -y install gcc-c++
yum -y install boost-devel
yum -y install python-devel
yum -y install python-pip
yum -y install openssl-devel
/usr/bin/pip-python install argparse
/usr/bin/pip-python install importlib
/usr/bin/pip-python install jlib
/usr/bin/pip-python install pydoop

And if you want to use snappy as compression codec: maybe you are also missing a useful commandline tool for snappy. So, for your convenience (rpms were build from these sources) :
snzip-0.9.0-0.el6.x86_64.rpm
snzip-0.9.0-0.el6.src.rpm

Dieser Beitrag wurde unter /dev/administration veröffentlicht. Setze ein Lesezeichen auf den Permalink.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.