首页 > 大数据平台 > hadoop > hadoop-lzo 让mapreduce能读取压缩了的lzo文件
2016
01-04

hadoop-lzo 让mapreduce能读取压缩了的lzo文件

yum -y install  lzo-devel

(1)安装LZO

wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
tar -zxvf lzo-2.06.tar.gz 
./configure -enable-shared
 
make  && make install

(2)安装LZOP

wget http://www.lzop.org/download/lzop-1.03.tar.gz
tar -zxvf lzop-1.03.tar.gz 
./configure -enable-shared 
make  && make install

测试lzop

lzop 192.168.7.241.2015-11-27-H15.log 
 
# ls 192.168.7.241.2015-11-27-H15.log* -lh
-rw------- 1 root root 265M Nov 27 15:59 192.168.7.241.2015-11-27-H15.log
-rw------- 1 root root  21M Nov 27 15:59 192.168.7.241.2015-11-27-H15.log.lzo

应为都是log文件 所以压缩力度还是蛮大的

安装Hadoop-LZO

当然的还有一个前提,就是配置好maven和svn 或者Git

wget http://mirror.symnds.com/software/Apache/maven/binaries/apache-maven-3.2.2-bin.tar.gz
tar xvf apache-maven-3.2.2-bin.tar.gz
mv apache-maven-3.2.2 /usr/local/maven
 
vim /etc/profile
MAVEN_HOME=/usr/local/maven
PATH=$PATH:$MAVEN_HOME/bin
 
export MAVEN_HOME
export PATH
 
source /etc/profile

验证是否安装成功

mvn -v
 
git clone https://github.com/twitter/hadoop-lzo/

修改pom.xml文件中的一部分。

从:

<properties>  
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>  
  <hadoop.current.version>2.1.0-beta</hadoop.current.version>  
  <hadoop.old.version>1.0.4</hadoop.old.version>  
</properties>

修改为:

<properties>  
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>  
    <hadoop.current.version>2.7.0</hadoop.current.version>  
    <hadoop.old.version>1.0.4</hadoop.old.version>  
  </properties>

我的hadoop是2.7的

再依次执行:

mvn clean package -Dmaven.test.skip=true  
# tar -cBf - -C target/native/Linux-amd64-64/lib . | tar -xBvf - -C /usr/local/hadoop/lib/native/
# cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar /usr/local/hadoop/share/hadoop/common/

接下来就是将/usr/local/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar以及/usr/local/hadoop/lib/native/ 同步到其它所有的hadoop节点。注意,要保证目录/usr/local/hadoop/lib/native/ 下的jar包,你运行hadoop的用户都有执行权限。

配置Hadoop
在文件$HADOOP_HOME/etc/hadoop/hadoop-env.sh中追加如下内容:

export LD_LIBRARY_PATH=/usr/local/lib

因为我的lzo安装在默认目录
在文件$HADOOP_HOME/etc/hadoop/core-site.xml中追加如下内容:

<property>  
        <name>io.compression.codecs</name>  
        <value>org.apache.hadoop.io.compress.GzipCodec,  
                   org.apache.hadoop.io.compress.DefaultCodec,  
                   com.hadoop.compression.lzo.LzoCodec,  
                   com.hadoop.compression.lzo.LzopCodec,  
                   org.apache.hadoop.io.compress.BZip2Codec  
        </value>  
</property>  
<property>  
         <name>io.compression.codec.lzo.class</name>  
         <value>com.hadoop.compression.lzo.LzoCodec</value>  
</property>

在文件$HADOOP_HOME/etc/hadoop/mapred-site.xml中追加如下内容:

<property>    
    <name>mapred.compress.map.output</name>    
    <value>true</value>    
</property>    
<property>    
    <name>mapred.map.output.compression.codec</name>    
    <value>com.hadoop.compression.lzo.LzoCodec</value>    
</property>    
<property>    
    <name>mapred.child.env</name>  
    <value>LD_LIBRARY_PATH=/usr/local/lib</value>    
</property>

这样配置就已经好了,我们来测试下mapreduce读lzo文件

hive新建一张表lzo_test

CREATE TABLE lzo_test(   col String  )  STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"  OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

然后load lzo文件到表里面

LOAD DATA Local INPATH '/home/hadoop/192.168.7.241.2015-11-27-H15.log.lzo' INTO TABLE lzo_test ;

索引LZO文件

hadoop jar /usr/local/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/lzo_test/

会在hdfs的/user/hive/warehouse/lzo_test 下面生成 192.168.7.241.2015-11-27-H15.log.lzo.index

执行“select * from lzo_test”和”select count(1) from lzo_test”正确​

最后编辑:
作者:saunix
大型互联网公司linux系统运维攻城狮,专门担当消防员

留下一个回复