Hadoop

Hadoop

简介

组件

  • hdfs: 分布式文件存储系统,包含namenode,datanode两个组件,

    配置文件: $HADOOP_HOME/etc/hadoop/hdfs-site.xml

    启动进程: $HADOOP_HOME/sbin/hadoop-daemon.sh start/stop namenode | datanode

  • yarn: 资源管理与调度

    配置文件: $HADOOP_HOME/etc/hadoop/yarn-site.xml

    启动进程: $HADOOP_HOME/sbin/yarn-daemon.sh start/stop resourcemanager | nodemanager

  • mr历史日志服务器: mapreduce历史服务器,查看已经运行完的Mapreduce作业记录,比如用了多少个Map、用了多少个Reduce、作业提交时间、作业启动时间、作业完成时间等信息。

    配置文件: $HADOOP_HOME/etc/hadoop/mapred-site.xml

    启动进程: $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

    mapred-site.xml:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    
    <configuration>
      <property>
      <name>yarn.app.mapreduce.am.staging-dir</name>
        <value>/tmp/hadoop-yarn/staging</value>
        </property>
      <property>
        <name>mapreduce.jobhistory.done-dir</name>
        <value>${yarn.app.mapreduce.am.staging-dir}/history/done</value>
        </property>
      <property>
        <name>mapreduce.jobhistory.intermediate-done-dir</name>
        <value>${yarn.app.mapreduce.am.staging-dir}/history/done_intermediate</value>
      </property>
    </configuration>
    

安装

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
$HADOOP_HOME/etc/hadoop/core-site.xml
$HADOOP_HOME/etc/hadoop/hdfs-site.xml
$HADOOP_HOME/etc/hadoop/mapred-site.xml
$HADOOP_HOME/etc/hadoop/yarn-site.xml

//格式化namenode
$HADOOP_HOME/bin/hadoop namenode -format
//启动namenode
$HADOOP_HOME/sbin/hadoop-daemon.sh start namenode
//启动datanode
$HADOOP_HOME/sbin/hadoop-daemon.sh start datanode
//启动resource manager
$HADOOP_HOME/sbin/yarn-daemon.sh start resourcemanager
//启动node manager
$HADOOP_HOME/sbin/yarn-daemon.sh start nodemanager
//启动historyserver
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
  • wordcount

    1
    
    $HADOOP_HOME/bin/hadoop jar $HODOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /input /output
    

Hadoop HA 配置

Hadoop 的 namenode 存在单点故障,为解决这个问题,需对其配置HA方案。

  • hdfs-site.xml

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    
    <!-- ha集群名称 -->
    <property>
      <name>dfs.nameservices</name>
      <value>ha-cluster</value>
    </property>
    <!-- ha包含哪些节点 -->
    <property>
      <name>dfs.ha.namenodes.ha-cluster</name>
      <value>nn1,nn2</value>
    </property>
    <!-- namenode rpc 监听地址 -->
    <property>
      <name>dfs.namenode.rpc-address.ha-cluster.nn1</name>
      <value>machine1.example.com:8020</value>
    </property>
    <property>
      <name>dfs.namenode.rpc-address.ha-cluster.nn2</name>
      <value>machine2.example.com:8020</value>
    </property>
    <!-- namenode http 监听地址 -->
    <property>
      <name>dfs.namenode.http-address.ha-cluster.nn1</name>
      <value>machine1.example.com:50070</value>
    </property>
    <property>
      <name>dfs.namenode.http-address.ha-cluster.nn2</name>
      <value>machine2.example.com:50070</value>
    </property>
    <!-- NameNode读写JNs组的uri -->
    <property>
      <name>dfs.namenode.shared.edits.dir</name>
      <value>qjournal://jnode1.example.com:8485;jnode2.example.com:8485;jnode3.example.com:8485/ha-cluster</value>
    </property>
    <property>
      <name>dfs.client.failover.proxy.provider.mycluster</name>
      <value>org.apache.Hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
    
    <property>
      <name>dfs.ha.fencing.methods</name>
      <value>sshfence</value>
    </property>
    <property>
      <name>dfs.ha.fencing.ssh.private-key-files</name>
      <value>/home/exampleuser/.ssh/id_rsa</value>
    </property>
    
  • core-site.xml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
<property>
  <name>fs.defaultFS</name>
  <value>hdfs://ha-cluster</value>
</property>
<property>
   <name>ha.zookeeper.quorum</name>
   <value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
</property>
<property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/path/to/journal/node/local/data</value>
</property>

启动命令

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
$sbin/hadoop-daemon.sh start journalnode. # 在各个journalnode节点上执行,启动journalnode

$ bin/hdfs namenode -format      #未初始化hdfs集群,在一个namenode节点上执行,格式化namenode
$ bin/hdfs namenode -bootstrapStandby # 同步已格式化的namenode数据到未格式化namenode节点

$ binm/hdfs namenode -initializeSharedEdits     #将非HA集群转换成HA集群

$ bin/hdfs zkfc –formatZK     #格式化zk

$ sbin/start-all.sh  #启动所有服务
$ sbin/stop-all.sh   #关闭所有服务

$ bin/hdfs haadmin -getServiceState nn1    #查看节点HA状态
$ bin/hdfs haadmin -transitionToActive --forceactive --forcemanual nn1  #强制切换active

$ bin/hdfs haadmin -failover --forcefence serviceId serviceId2

Hadoop prometheus exporter

$HADOOP_HOME/etc/hadoop/hadoop-env.sh\

1
2
3
4
5
6
export HADOOP_JMX_NN_PORT=10400
export HADOOP_JMX_DN_PORT=10401
export HADOOP_JMX_EXPORTER_JAR="$HADOOP_HOME/lib/jmx_prometheus_javaagent-0.10.jar"
export HADOOP_JMX_EXPORTER_CONF="$HADOOP_HOME/etc/hadoop/hadoop-exporter.yaml"
export HADOOP_NAMENODE_OPTS="$HADOOP_NAMENODE_OPTS -javaagent:$HADOOP_JMX_EXPORTER_JAR=$HADOOP_JMX_NN_PORT:$HADOOP_JMX_EXPORTER_CONF"
export HADOOP_DATANODE_OPTS="$HADOOP_DATANODE_OPTS -javaagent:$HADOOP_JMX_EXPORTER_JAR=$HADOOP_JMX_DN_PORT:$HADOOP_JMX_EXPORTER_CONF"

Yarn

yarn 内存设置

updatedupdated2024-08-252024-08-25