Hadoop+hbase完全分布式部署
# hadoop+hbase 完全分布式部署
1. 环境准备
这里部署使用两台机器:
注意:由于这里是测试,使用两台机器,在实际应用中当部署hbase,由于涉及到zookeeper,所以在进行完全分布式部署时最好使用奇数台机器。2. hadoop完全分布式部署
2.1 创建hadoop用户
在两台机器上都执行以下操作:
sudo useradd -m hadoop -s /bin/bash //创建hadoop用户,并用/bin/bash作为shell
sudo passwd hadoop //添加密码
sudo adduser hadoop sudo //为hadoop用户增加管理员权限,方便部署
2.2 安装openjdk
在两台机器上都执行以下操作:
在~/.bashrc中添加以下内容:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-loongarch64
export JRE_HOME=/usr/lib/jvm/java-1.8.0-openjdk-loongarch64/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
hadoop@node02:/usr/local/hadoop$ java -version
openjdk version "1.8.0_352"
OpenJDK Runtime Environment (Loongson 8.1.12-loongarch64-Loongnix) (build 1.8.0_352-b08)
OpenJDK 64-Bit Server VM (build 25.352-b08, mixed mode)
2.3 设置机器名称
10.130.0.80机器:执行命令“sudo vim /etc/hostname”写入node01(作为master节点)
10.130.0.91机器:执行命令“sudo vim /etc/hostname”写入node02
在两个机器的/etc/hosts文件中写入:
在两个机器上都执行以下的命令,测试是否相互能ping通:
2.4 ssh无密码登录
部署hadoop必须要让master节点(node01)无密码登录到各个节点上。
首先,生成node01的公匙,因为对主机名进行了修改,所以之前若生成过,必须删除重新生成。具体命令如下:
cd ~/.ssh //若没有该目录,先执行一次ssh localhost
rm ./id_rsa* //删除之前生成的公匙(若存在)
ssh-keygen -t rsa //执行该命令后,遇到提示信息,一直按回车即可
2.4.1 node01节点无密码登录本机
2.4.2 node01节点无密码登录node02:
将node01的密钥传到node02上:
在node02执行以下操作:2.5 安装hadoop-3.3.4
在master节点(node01)上执行:
sudo tar -zxf hadoop-3.3.4.tar.gz -C /usr/local
cd /usr/local/
sudo mv ./hadoop-3.3.4 ./hadoop //修改文件名
sudo chown -R hadoop ./hadoop //修改文件权限
输入以下命令检查hadoop是否可用,成功会显示hadoop版本信息:
hadoop@node01:/usr/local/hadoop$ hadoop version
Hadoop 3.3.4
Source code repository https://github.com/apache/hadoop.git -r a585a73c3e02ac62350c136643a5e7f6095a3dbb
Compiled by root on 2023-02-07T03:38Z
Compiled with protoc 3.7.1
From source with checksum fb9dd8918a7b8a5b430d61af858f6ec
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.4.jar
2.6 配置集群/分布式环境
在进行分布式部署时,需要修改/usr/local/hadoop/etc/hadoop目录下的配置文件,这里仅仅配置了正常启动所必须的设置项:workres, core-site.xml, hdfs-site.xml, mapred-site.xml, yarm-site.xml, hadoop-env.sh 6个文件。 详细的配置文件内容,可以通过官方https://hadoop.apache.org/docs/stable/index.html 最底部的Configuration部分进行学习和查看。
2.6.1 主(node01)节点操作
workers
该文件中记录了所有作为数据节点的主机名称,默认为localhost(即把本机作为数据节点),此时表示本机即作为名称节点也作为数据节点。在本次部署中让master(node01)充当名称节点和数据节点,node02作为数据节点,故写入的内容如下:
hadoop-env.sh
在该文件中设置JAVA_HOME参数:
core-site.xml
该文件是hadoop的核心全局配置文件,可在其他配置文件中引用该文件
core-site.xml
<configuration>
<property>
<!--指定文件系统(namenode)的地址是node01,端口号 是9000-->
<name>fs.defaultFS</name>
<value>hdfs://node01:9000</value>
</property>
<property>
<!-- 配置hadoop的临时目录 -->
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
</configuration>
hdfs-site.xml
该文件用于设置HDFS的NameNode和DataNode
<configuration>
<!-- 指定secondary namenode的http服务器的地址和端口-->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>node01:50090</value>
</property>
<!-- 指定HDFS副本的数量-->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<!-- 指定DFS名称表(名称节点)的存储位置-->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/tmp/dfs/name</value>
</property>
<!-- 指定DFS数据表(数据节点)的存储位置-->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/tmp/dfs/data</value>
</property>
</configuration>
mapred-site.xml
该文件是MapReduce的核心配置文件,用于指定MapReduce运行时框架
<configuration>
<!-- 指定MapReduce的运行时框架-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- 指定服务器IPC的主机:端口号-->
<property>
<name>mapreduce.jobhistory.address</name>
<value>node01:10020</value>
</property>
<!-- 指定服务器Web UI的主机:端口号-->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node01:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
</property>
</configuration>
yarn-site.xml
本文件是yarn框架的核心配置文件,需要指定yarn集群的管理者
<configuration>
<!-- 指定yarn集群管理者的地址-->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node01</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
2.6.2 从(node02)节点操作
上面6个配置文件完成后,需要把主节点上的/usr/local/hadoop文件夹复制到各个节点上。在node02上执行以下操作:
2.7 集群启动
2.7.1 主(node01)节点操作
(1)在主节点上执行执行名称节点的格式化,只需要执行一次,后面再启动hadoop时,不需要再次格式化名称节点(若需要重新进行格式化,则删除tmp和log目录)。 执行命令:
当出现以下内容时表示格式化成功:2023-02-09 11:45:44,384 INFO namenode.FSImage: Allocated new BlockPoolId: BP-2007561254-10.130.0.80-1675914344373
2023-02-09 11:45:44,400 INFO common.Storage: Storage directory /usr/local/hadoop/tmp/dfs/name has been successfully formatted.
2023-02-09 11:45:44,436 INFO namenode.FSImageFormatProtobuf: Saving image file /usr/local/hadoop/tmp/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2023-02-09 11:45:44,592 INFO namenode.FSImageFormatProtobuf: Image file /usr/local/hadoop/tmp/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 401 bytes saved in 0 seconds .
2023-02-09 11:45:44,611 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2023-02-09 11:45:44,647 INFO namenode.FSNamesystem: Stopping services started for active state
2023-02-09 11:45:44,648 INFO namenode.FSNamesystem: Stopping services started for standby state
2023-02-09 11:45:44,652 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2023-02-09 11:45:44,652 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at node01/10.130.0.80
************************************************************/
(2)启动hadoop
(3)查看启动的进程 若正确启动,在主节点上可以看到以下几个进程:
hadoop@node01:/usr/local/hadoop$ jps
18449 ResourceManager
19723 JobHistoryServer
19803 Jps
18061 DataNode
18541 NodeManager
18238 SecondaryNameNode
17967 NameNode
(4)查看启动的数据节点
hadoop@node01:/usr/local/hadoop$ hdfs dfsadmin -report
......
Live datanodes (2):
Name: 10.130.0.80:9866 (node01)
Hostname: node01
Decommission Status : Normal
Configured Capacity: 44310081536 (41.27 GB)
DFS Used: 8192 (8 KB)
Non DFS Used: 12885385216 (12.00 GB)
DFS Remaining: 31424688128 (29.27 GB)
DFS Used%: 0.00%
DFS Remaining%: 70.92%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Thu Feb 09 20:22:37 CST 2023
Last Block Report: Thu Feb 09 18:49:10 CST 2023
Num of Blocks: 0
Name: 10.130.0.91:9866 (node02)
Hostname: node02
Decommission Status : Normal
Configured Capacity: 44310081536 (41.27 GB)
DFS Used: 8192 (8 KB)
Non DFS Used: 10843045888 (10.10 GB)
DFS Remaining: 33467027456 (31.17 GB)
DFS Used%: 0.00%
DFS Remaining%: 75.53%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
2.7.2 从(node02)节点操作
在主节点启动服务后,在从节点上可以看到启动了两个进程DataNode和NodeManager。
2.7.3 通过UI界面查看hadoop运行状态
在浏览器中输入http://10.130.0.80:9870 查看HDFS集群状态。
在浏览器中输入http://10.130.0.80:8088,查看yarn集群状态:
2.8 集群测试
2.8.1 命令行操作(主节点node01)
(1)创建HDFS用户实例
此时在Browse the file system中可以看到创建的用户目录:
通过hdfs dfs -ls / 可以查看到hdfs的根目录下存储的文件:
hadoop@node01:/usr/local/hadoop$ hdfs dfs -ls /
Found 2 items
drwxrwx--- - hadoop supergroup 0 2023-02-20 09:31 /tmp
drwxr-xr-x - hadoop supergroup 0 2023-02-20 09:35 /user
(2)在HDFS中创建input目录,将要操作的文件保存在该目录下
(3)运行MapReduce作业cd /usr/local/hadoop/share/hadoop/mapreduce
hadoop@node01:/usr/local/hadoop/share/hadoop/mapreduce$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar grep input output 'dfs[a-z.]+'
2023-02-09 20:42:05,830 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at node01/10.130.0.80:8032
2023-02-09 20:42:06,461 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1675933900726_0001
2023-02-09 20:42:07,603 INFO input.FileInputFormat: Total input files to process : 0
2023-02-09 20:42:07,671 INFO mapreduce.JobSubmitter: number of splits:0
2023-02-09 20:42:07,845 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1675933900726_0001
2023-02-09 20:42:07,845 INFO mapreduce.JobSubmitter: Executing with tokens: []
在【Utilities】→【Browse the file system】→【usr】→【hadoop】下可以看到生成了output目录:
在【output】下可以看到有两个文件,_SUCCESS表示此次任务执行成功,part-r-0000中存储了grep的结果。单击part-r-0000可以将文件下载到本地,查看其中的内容如下:
1 dfsadmin
1 dfs.replication
1 dfs.namenode.secondary.http
1 dfs.namenode.name.dir
1 dfs.datanode.data.dir
2.8.2 网页端上传文件
点击下图中的上传按钮,上传文件:
注意: 打开浏览器的机器上需要添加以下配置: /etc/hosts:
备注:要关闭代理服务器,否则会导致访问失败。2.8.3 网页端查看文件
点击下图中的Download便可以下载文件,Head/Tail the file可以查看文件:
备注:
1)hdfs存放文件的位置位于节点的datanode路径下,本次部署存放文件的具体目录是:
“/usr/local/hadoop/tmp/dfs/data/current/BP-1730813420-10.130.0.137-1676856486892/current/finalized/subdir0/subdir0”。
2)因为本次部署使用了2个数据备份,所以在node01和node02上都存在块文件blk_1073741826。
3.hbase完全分布式部署
3.1 安装hbase-2.4.16
在master(node01)节点上执行:
在~/.bashrc中添加:3.2 修改配置文件
3.2.1 主节点(node01)操作
3.2.1.1 hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-loongarch64
export HBASE_MANAGES_ZK=true #表示使用hbase自带的zookeeper,若要使用自己单独部署的zookeeper集群,则将true设置为false
3.2.1.2 hbase-site.xml
<configuration>
<!-- 指定hbase在hdfs上存储数据的文件夹,端口要和hadoop的core-site.xml文件中设置的端口号一致-->
<property>
<name>hbase.rootdir</name>
<value>hdfs://node01:9000/hbase</value>
</property>
<!-- 指定hbase的master主机名和端口 -->
<property>
<name>hbase.master</name>
<value>node01:60000</value>
</property>
<!-- 指定hbase的部署方式 -->
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<!-- 指定hbase的临时目录 -->
<property>
<name>hbase.tmp.dir</name>
<value>/usr/local/hbase/tmp</value>
</property>
<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
</property>
<!-- 指定zookeeper集群的主机名 -->
<property>
<name>hbase.zookeeper.quorum</name>
<value>node01,node02</value>
</property>
</configuration>
3.2.1.3 regionserver
配置所有作为数据节点的主机名,在本次部署中主节点也作为数据节点,所以配置如下:
3.2.2 从节点(node02)操作
将主节点上配置好的hbase传到从节点上:
在~/.bashrc中设置:
3.3 集群启动
3.3.1 主节点(node01)操作
hadoop@node01:/usr/local/hbase$ bin/start-hbase.sh
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hbase/lib/client-facing-thirdparty/slf4j-reload4j-1.7.33.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hbase/lib/client-facing-thirdparty/slf4j-reload4j-1.7.33.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
node02: running zookeeper, logging to /usr/local/hbase/bin/../logs/hbase-hadoop-zookeeper-node02.out
node01: running zookeeper, logging to /usr/local/hbase/bin/../logs/hbase-hadoop-zookeeper-node01.out
running master, logging to /usr/local/hbase/logs/hbase-hadoop-master-node01.out
node01: running regionserver, logging to /usr/local/hbase/bin/../logs/hbase-hadoop-regionserver-node01.out
node02: running regionserver, logging to /usr/local/hbase/bin/../logs/hbase-hadoop-regionserver-node02.out
hadoop@node01:/usr/local/hbase$ jps
10034 HMaster
26565 NodeManager
25975 NameNode
26072 DataNode
9928 HQuorumPeer
26473 ResourceManager
27001 JobHistoryServer
26252 SecondaryNameNode
10590 Jps
10239 HRegionServer
同时,可以看到hbase目录已经挂载到hdfs的根目录下:
hadoop@node01:/usr/local/hbase$ hdfs dfs -ls /
Found 5 items
drwxr-xr-x - hadoop supergroup 0 2023-02-22 14:28 /hbase
drwxrwx--- - hadoop supergroup 0 2023-02-20 09:31 /tmp
drwxr-xr-x - hadoop supergroup 0 2023-02-20 09:35 /user
-rw-r--r-- 2 dr.who supergroup 3441 2023-02-20 09:40 /解决冲突
3.3.2 从节点(node02)操作
hadoop@node02:/usr/local/hbase$ jps
15457 NodeManager
20930 HQuorumPeer
21063 HRegionServer
15785 DataNode
21593 Jps
3.3.3 通过UI界面查看hbase运行状态
在浏览器中输入http://10.130.0.80:16010 或者输入 http://node01:16010 可以进行查看。
3.4 集群测试
进入shell:
hadoop@node1:/usr/local/hbase$ hbase shell
创建table;test123
hbase:055:0> create 'test123', 'cf'
Created table test123
Took 1.1293 seconds
=> Hbase::Table - test123
插入数据
hbase:056:0> put 'test123', 'row1', 'cf:a', 'value100'
Took 0.0209 seconds
hbase:057:0> put 'test123', 'row2', 'cf:b', 'value200'
Took 0.0125 seconds
hadoop@node02:/usr/local/hadoop/tmp/dfs/data/current/BP-1730813420-10.130.0.137-1676856486892$ grep -rn value200
匹配到二进制文件 current/rbw/blk_1073741851
3.5 Terasort测试
Terasort是测试hadoop的一个有效的排序程序,通过hadoop自带的Terasort排序程序进行测试,一个Terasort测试分为以下三个步骤:
3.5.1 teragen产生随机数据
执行命令:
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar teragen -Dmapred.map.tasks=128 10000000 /terasort/terasort-input
2023-11-13 11:37:27,450 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2023-11-13 11:37:27,675 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2023-11-13 11:37:27,675 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2023-11-13 11:37:27,935 INFO terasort.TeraGen: Generating 10000000 using 1
2023-11-13 11:37:27,955 INFO mapreduce.JobSubmitter: number of splits:1
2023-11-13 11:37:27,995 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
......
2023-11-13 11:37:50,359 INFO mapreduce.Job: map 100% reduce 0%
2023-11-13 11:37:50,360 INFO mapreduce.Job: Job job_local242609962_0001 completed successfully
2023-11-13 11:37:50,370 INFO mapreduce.Job: Counters: 22
File System Counters
FILE: Number of bytes read=281117
FILE: Number of bytes written=918485
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=0
HDFS: Number of bytes written=1000000000
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
HDFS: Number of bytes read erasure-coded=0
Map-Reduce Framework
Map input records=10000000
Map output records=10000000
Input split bytes=82
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=113
Total committed heap usage (bytes)=1166540800
org.apache.hadoop.examples.terasort.TeraGen$Counters
CHECKSUM=21472776955442690
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=1000000000
hadoop@zhaixiaojuan-loongnix-01:~$ hadoop fs -ls /terasort
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2023-11-13 11:37 /terasort/terasort-input
hadoop@zhaixiaojuan-loongnix-01:~$ hadoop fs -ls /terasort/terasort-input
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2023-11-13 11:37 /terasort/terasort-input/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 1000000000 2023-11-13 11:37 /terasort/terasort-input/part-m-00000
在网页上可以看到:
3.5.2 terasort对产生的数据随机数据进行排序
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar terasort /terasort/terasort-input /terasort/terasort-output
2023-11-14 09:31:40,955 INFO terasort.TeraSort: starting
2023-11-14 09:31:42,493 INFO input.FileInputFormat: Total input files to process : 1
Spent 301ms computing base-splits.
Spent 3ms computing TeraScheduler splits.
Computing input splits took 305ms
Sampling 8 splits of 8
Making 1 from 100000 sampled records
Computing parititions took 582ms
Spent 892ms computing partitions.
......
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1000000000
File Output Format Counters
Bytes Written=1000000000
2023-11-14 09:33:18,090 INFO terasort.TeraSort: done
hadoop@zhaixiaojuan-loongnix-01:~$ hadoop fs -ls /terasort/terasort-output
Found 3 items
-rw-r--r-- 1 hadoop supergroup 0 2023-11-14 09:33 /terasort/terasort-output/_SUCCESS
-rw-r--r-- 10 hadoop supergroup 0 2023-11-14 09:31 /terasort/terasort-output/_partition.lst
-rw-r--r-- 1 hadoop supergroup 1000000000 2023-11-14 09:33 /terasort/terasort-output/part-r-00000
3.5.3 数据校验
teravalidate来验证terasort输出的数据是否有序,如果检查到问题,则会将乱序的key输出到目录
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar teravalidate /terasort/terasort-output /terasort/terasort-validate |&tee sort1G-3.log
2023-11-14 09:43:16,470 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2023-11-14 09:43:16,700 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2023-11-14 09:43:16,700 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2023-11-14 09:43:17,114 INFO input.FileInputFormat: Total input files to process : 1
Spent 164ms computing base-splits.
Spent 5ms computing TeraScheduler splits.
......
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1000000000
File Output Format Counters
Bytes Written=24
hadoop@zhaixiaojuan-loongnix-01:~$ hadoop fs -ls /terasort/
Found 3 items
drwxr-xr-x - hadoop supergroup 0 2023-11-13 11:37 /terasort/terasort-input
drwxr-xr-x - hadoop supergroup 0 2023-11-14 09:33 /terasort/terasort-output
drwxr-xr-x - hadoop supergroup 0 2023-11-14 09:43 /terasort/terasort-validate
hadoop@zhaixiaojuan-loongnix-01:~$ hadoop fs -ls /terasort/terasort-validate
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2023-11-14 09:43 /terasort/terasort-validate/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 24 2023-11-14 09:43 /terasort/terasort-validate/part-r-00000
在网页上可以看到:
备注:若之前创建过terasort,在重新进行测试时需要先使用“hadoop fs -rm -r /terasort”命令进行删除
4. 集群关闭
/usr/local/hbase/bin/stop-hbase.sh
/usr/local/hadoop/sbin/stop-dfs.sh
/usr/local/hadoop/sbin/stop-yarn.sh
5. 问题记录
hadoop上传文件错误 在上传文件时,出现下图中的Couldn’t upload the file xxx。
F12, 查看浏览器错误信息,如下。通过下面的信息是访问权限的问题。
解决方法: 在主节点node01执行命令“hdfs dfs -chmod 777 / ”,此时再重新上传便可以上传成功。
6. 参考
https://book.itheima.net/course/1269935677353533441/1269937996044476418/1269939156776165379
https://dblab.xmu.edu.cn/blog/2775/
https://dblab.xmu.edu.cn/blog/2441/
https://blog.csdn.net/qq_42886289/article/details/90682592
7. 附录
HDFS体系主要由namenode和datanode组成。 namenode: 整个文件系统的管理节点,它维护着整个文件系统的文件目录树,文件/目录的元信息和每个文件对应的数据块列表,接收用户的操作请求。 namenode包含的文件有: 1)fsimage文件:元数据镜像。存储某一个时间namenode内存元数据信息,记录文件分块存储在哪几个数据节点上。 2)edits文件:操作日志文件。 3)fstime文件:保存最近一次checkpoint的时间
secondarynamenode: 1)是HA的一个解决方案。但不支持热备份,配置即可。 2)执行过程:从namenode上下载元数据信息(fsimage,edits),然后把二者合并,生成新的fsimage,在本地保存,并将其推送到namenode,同时重置namenode的edits。 3)默认安装在namenode节点上。
datanode提供真实文件数据的存储服务,主要包括: 1)文件块:最基本的存储单元。对于文件内容而言,一个文件的长度大小是size,则从文件的0偏移开始,按照固定的大小,顺序对文件进行划分并编号,划分好的每一个块称一个block; 2)不同于普通文件系统的是,hdfs中,如果一个文件小于一个数据块的大小,并不占用整个数据块的存储空间,而是占用实际大小的存储空间。 3)Replication:多副本,默认是3个。