(1)hadoop distcp

只能实现集群之间文件的拷贝,hive表字段抽取拷贝无法实现
官网中对distcp命令的描述:
https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html

(2)datax

使用datax将hdfs://192.168.73.128:8020这个集群的hive表test,同步到hdfs://hadoop01:8020这个集群的hive表hdfs2hdfs
注意:在同步之前要在目标集群的hive中先建好hdfs2hdfs表

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "hdfsreader", 
                    "parameter": {
                        "column": ["*"], 
                        "defaultFS": "hdfs://192.168.73.128:8020", 
                        "encoding": "UTF-8", 
                        "fieldDelimiter": "|", 
                        "fileType": "text", 
                        "path": "/user/hive/warehouse/test"
                    }
                }, 
                "writer": {
                    "name": "hdfswriter", 
                    "parameter": {
                        "column": [
                {
                    "name":"word",
                    "type":"STRING"
                },
                {
                    "name":"cnt",
                    "type":"INT"
                }    
            ], 
                        "defaultFS": "hdfs://hadoop01:8020", 
                        "encoding": "UTF-8", 
                        "fieldDelimiter": "|", 
                        "fileType": "text", 
                        "path": "/user/hive/warehouse/hdfs2hdfs",
            "fileName": "hdfs2hdfs",
            "writeMode": "append",
            "compress": "GZIP"
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": "1"
            }
        }
    }
}

(3)sqoop

使用sqoop2可以配置hdfs connector,实现hdfs到hdfs的同步
创建job的时候可以配置要同步的表和列
如下所示

Creating job for links with from id 1 and to id 6  
Please fill following values to create new job object  
Name: mysql_openfire--设置 任务名称  
FromJob configuration  
Schema name:(Required)sqoop --库名:必填  
Table name:(Required)sqoop --表名:必填  
Table SQL statement:(Optional) --选填  
Table column names:(Optional) --选填  
Partition column name:(Optional) id --选填  
Null value allowed for the partition column:(Optional) --选填  
Boundary query:(Optional) --选填  
ToJob configuration  
Output format:  
0 : TEXT_FILE  
1 : SEQUENCE_FILE  
Output format:  
0 : TEXT_FILE  
1 : SEQUENCE_FILE  
Choose: 0 --选择文件压缩格式  
Compression format:  
0 : NONE  
1 : DEFAULT  
2 : DEFLATE  
3 : GZIP  
4 : BZIP2  
5 : LZO  
6 : LZ4  
7 : SNAPPY  
8 : CUSTOM  
Choose: 0 --选择压缩类型  
Custom compression format:(Optional) --选填  
Output directory:hdfs:/ns1/sqoop --HDFS存储目录(目的地)  
Driver Config  
Extractors: 2 --提取器  
Loaders: 2 --加载器  
New job was successfully created with validation status OK and persistent id 1 

具体步骤参考:
https://www.iteye.com/blog/muruiheng-2269162


还有一个要考虑的问题是kerberos认证
(1)datax:可以配置kerberos认证信息,可以参考datax官网hdfsreader的描述
https://github.com/alibaba/DataX/blob/master/hdfsreader/doc/hdfsreader.md
(2)sqoop