本文转载自https://www.cnblogs.com/barneywill/p/10109508.html 偶然看到这篇文章,这位老哥做的对比测试很好,当初我们在新数仓设计时,都没有对比测试的这么详细。
常用格式
textfile 需要定义分隔符,占用空间大,读写效率最低,非常容易发生冲突(分隔符)的一种格式,基本上只有需要导入数据的时候才会使用,比如导入csv文件;
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\u0001’ LINES TERMINATED BY ‘\n’ STORED AS TEXTFILE
json hive3.0后官方支持json格式,之前需要使用第三方,导入jar,http://www.congiu.net/hive-json-serde/ ,
add jar hdfs://nn/jarpath/json-udf-1.3.8-jar-with-dependencies.jar; add jar hdfs://nn/jarpath/json-serde-1.3.8-jar-with-dependencies.jar;
占用空间最大,读写效率低,基本上只有需要导入数据的时候才会使用,比如导入json文件;
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’ STORED AS TEXTFILE
xml http://central.maven.org/maven2/com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.0.0/hivexmlserde-1.0.0.0.jar
1 2 3 4 5 6 7 8 9 10 11 12 13 CREATE TABLE xml_bank(customer_id STRING , income BIGINT , demographics map<string,string>, financial map<string,string>) ROW FORMAT SERDE 'com .ibm.spss.hive.serde2.xml.XmlSerDe ' WITH SERDEPROPERTIES ( "column.xpath.customer_id" ="/record/@customer_id" , "column.xpath.income" ="/record/income/text()" , "column.xpath.demographics" ="/record/demographics/*" , "column.xpath.financial" ="/record/financial/*" ) TBLPROPERTIES ( "xmlinput.start" ="<record customer" , "xmlinput.end" ="</record>" );
lzo 相比textfile多了lzo压缩,占用空间更小;
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS INPUTFORMAT ‘com.hadoop.mapred.DeprecatedLzoTextInputFormat’ OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
orc 列式存储,占用空间最小,非常适合用来做数仓;
STORED AS ORC
压缩
STORED AS ORC TBLPROPERTIES (“orc.compression”=”ZLIB”) STORED AS ORC TBLPROPERTIES (“orc.compression”=”SNAPPY”)
注意设置orc压缩格式前一定要先设置:
set hive.exec.orc.compression.strategy=COMPRESSION;
否则压缩不生效;
parquet 列式存储,占用空间居中,如果后期使用spark来处理,parquet是最佳格式;
STORED AS PARQUET
parquet+snappy
STORED AS PARQUET TBLPROPERTIES (“parquet.compression”=”SNAPPY”)
对比测试 测试表:test_table 测试行数:10亿 测试sql类型:aggregation 测试sql:select col_1, count(1) from test_table group by col_1; 测试结果
fs hdfs kudu format textfile lzo parquet parquet snappy orc orc snappy ** ** capacity 464.0 G 169.4 G 177.2 G 111.3 G 71.5 G 65.7G 184 G 100% 36% 37% 23% 15% 14% 39% Hive2.3.4 816 s 711 s 250 s 158 s 130 s 127 s Hive2.3.4 Tuning 251 s 163 s 109 s 96 s Hive2.3.4 On spark2.4.0 54 s 47 s 149 s 138 s Spark2.1.1 371 s 293 s 17 s 16 s 51 s Spark2.4.0 496 s 297 s 16 s 16 s 21 s 21 s Drill1.15.0 59 s 57 s 75 s 45 s Impala2.12 15 s 16 s Presto0.215 25 s 21 s 13 s 12 s ** ** ** **
从数据大小和查询效率上看,表现最好的是presto+orc+snappy; hive下最佳格式为orc snappy,数据大小最小,并且查询最快; hive切换engine为spark后,对parquet格式的查询有一些提升,但是占用相同资源的情况下,远不如直接使用spark sql快; spark2.3以后对orc格式相比之前有很大优化,已经很接近parquet格式; impala+parquet+hdfs的性能和impala+kudu差不多,kudu的好处是支持实时更新; drill看起来没有必要; spark2.4.0中的parquet为2.4,parquet从2.5开始支持column index,预计以后的spark版本对parquet的查询会更快; impala对orc的支持从3.1开始作为实验功能的一部分;
详细数据 yarn 200g 50core
1 hive-2.3.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 set mapreduce.map.memory.mb=4096 ; set mapreduce.map.java.opts=-Xmx3072m ; hive-textfile: Time taken: 816.202 seconds, Fetched : 32 row(s)Stage -Stage -1 : Map : 1831 Reduce : 1009 Cumulative CPU : 27614.77 sec HDFS Read : 498267775168 HDFS Write : 88861 SUCCESS Total MapReduce CPU Time Spent : 0 days 7 hours 40 minutes 14 seconds 770 msechive-lzo: Time taken: 711.266 seconds, Fetched : 32 row(s)Stage -Stage -1 : Map : 183 Reduce : 711 Cumulative CPU : 13949.24 sec HDFS Read : 181881436157 HDFS Write : 62935 SUCCESS Total MapReduce CPU Time Spent : 0 days 3 hours 52 minutes 29 seconds 240 msechive-orc: Time taken: 130.194 seconds, Fetched : 32 row(s)Stage -Stage -1 : Map : 275 Reduce : 300 Cumulative CPU : 4368.67 sec HDFS Read : 626004573 HDFS Write : 27178 SUCCESS Total MapReduce CPU Time Spent : 0 days 1 hours 12 minutes 48 seconds 670 msechive-orc snappy: Time taken: 127.803 seconds, Fetched : 32 row(s)Stage -Stage -1 : Map : 191 Reduce : 276 Cumulative CPU : 4374.74 sec HDFS Read : 580889407 HDFS Write : 25090 SUCCESS Total MapReduce CPU Time Spent : 0 days 1 hours 12 minutes 54 seconds 740 msechive-orc-tuning: Time taken: 109.539 seconds, Fetched : 32 row(s)Stage -Stage -1 : Map : 275 Reduce : 300 Cumulative CPU : 3051.67 sec HDFS Read : 627064673 HDFS Write : 40321 SUCCESS Total MapReduce CPU Time Spent : 50 minutes 51 seconds 670 msechive-orc snappy-tuning: Time taken: 94.135 seconds, Fetched : 32 row(s)Stage -Stage -1 : Map : 191 Reduce : 276 Cumulative CPU : 2393.92 sec HDFS Read : 581727151 HDFS Write : 37201 SUCCESS Total MapReduce CPU Time Spent : 39 minutes 53 seconds 920 msechive-parquet: Time taken: 250.786 seconds, Fetched : 32 row(s)Stage -Stage -1 : Map : 642 Reduce : 744 Cumulative CPU : 10919.85 sec HDFS Read : 873784253 HDFS Write : 65806 SUCCESS Total MapReduce CPU Time Spent : 0 days 3 hours 1 minutes 59 seconds 850 msechive-parquet snappy: Time taken: 158.009 seconds, Fetched : 32 row(s)Stage -Stage -1 : Map : 367 Reduce : 467 Cumulative CPU : 6246.0 sec HDFS Read : 721915438 HDFS Write : 41707 SUCCESS Total MapReduce CPU Time Spent : 0 days 1 hours 44 minutes 6 seconds 0 msec
### 2 hive-2.3.4 on spark-2.4.0
1 2 3 4 5 6 7 8 9 10 11 set spark.driver.memory=4 g; set spark.executor.memory=4 g; set spark.executor.instances=10 ; hive on spark-parquet: Time taken: 54.446 seconds, Fetched : 32 row(s)hive on spark-parquet snappy: Time taken: 47.364 seconds, Fetched : 32 row(s)hive on spark-orc: Time taken: 149.901 seconds, Fetched : 32 row(s)hive on spark-orc snappy: Time taken: 138.844 seconds, Fetched : 32 row(s)
### 3 impala-2.12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 MEM_LIMIT =20 g * 3 impala-parquet snappy: Fetched 32 row(s) in 15.10 s+--------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est . #Rows | Peak Mem | Est . Peak Mem | Detail | +--------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------------------------+ | 04 :EXCHANGE | 1 | 211.45 us | 211.45 us | 32 | 50 | 208.00 KB | 0 B | UNPARTITIONED | | 03 :AGGREGATE | 3 | 2.58 ms | 2.91 ms | 32 | 50 | 34.03 MB | 128.00 MB | FINALIZE | | 02 :EXCHANGE | 3 | 29.23 us | 30.92 us | 96 | 1.04 B | 32.00 KB | 0 B | HASH (cpp_addr_province) | | 01 :AGGREGATE | 3 | 13.29 s | 13.97 s | 96 | 1.04 B | 34.05 MB | 128.00 MB | STREAMING | | 00 :SCAN HDFS | 3 | 723.09 ms | 760.01 ms | 1.04 B | 1.04 B | 36.55 MB | 88.00 MB | temp.app_ba_userprofile_prop_nonpolar_view_ext_ps | +--------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------------------------+ impala-kudu: Fetched 32 row(s) in 15.61 s
### 4 drill-1.15
1 2 3 4 5 6 7 8 9 10 g+10 g+1 g+1 g * 3 drill-parquet: 32 rows selected (59.501 seconds)drill-parquet snappy: 32 rows selected (57.653 seconds)drill-orc: 32 rows selected (75.749 seconds)drill-orc snappy: 32 rows selected (45.323 seconds)
5 spark-sql –master yarn –num-executors 10 –executor-memory 4g –driver-memory 4g
5.1 spark-2.1.1 1 2 3 4 5 6 7 8 9 10 spark sql-textfile: Time taken: 371.77 seconds, Fetched 32 row(s)spark sql-lzo: Time taken: 293.391 seconds, Fetched 32 row(s)spark sql-parquet: Time taken: 17.338 seconds, Fetched 32 row(s)spark sql-parquet snappy: Time taken: 16.609 seconds, Fetched 32 row(s)spark sql-orc: Time taken: 51.959 seconds, Fetched 32 row(s)
5.2 spark-2.4.0 1 2 3 4 5 6 7 8 9 10 11 12 spark sql-textfile: Time taken: 496.395 seconds, Fetched 32 row(s)spark sql-lzo: Time taken: 297.142 seconds, Fetched 32 row(s)spark sql-parquet: Time taken: 16.728 seconds, Fetched 32 row(s)spark sql-parquet snappy: Time taken: 16.879 seconds, Fetched 32 row(s)spark sql-orc: Time taken: 21.432 seconds, Fetched 32 row(s)spark sql-orc snappy: Time taken: 21.935 seconds, Fetched 32 row(s)
6 presto 1 2 3 4 5 6 7 8 9 10 11 12 presto-parquet: Splits : 3 ,182 total, 3 ,182 done (100.00 %)0 :25 [1.04 B rows, 612 MB] [42.2 M rows/s, 24.9 MB/s]presto-parquet snappy: Splits : 2 ,088 total, 2 ,088 done (100.00 %)0 :21 [1.04 B rows, 584 MB] [49.3 M rows/s, 27.8 MB/spresto-orc: Splits : 1 ,532 total, 1 ,532 done (100.00 %)0 :13 [1.04 B rows, 850 MB] [81.7 M rows/s, 66.8 MB/s]presto-orc snappy: Splits : 1 ,353 total, 1 ,353 done (100.00 %)0 :12 [1.04 B rows, 1.13 GB] [87.5 M rows/s, 97.4 MB/s]