phData has customers across the spectrum of use cases. One of our customers stores vast volumes of XML. One of our engineers was recently asked: Hive sometimes corrupts my data and other times it does not. What is going on?
The answer is quite interesting, so I thought I would share. Specifically, the query they were running was of the form:
select * from table where partitioned_column = X
The data came back fine, but if they ran:
select * from table where partitioned_column = X AND other_col = Y
Both Hive CLI and HUE (via HiveServer2) returned corrupt results. The reason for the difference between the two queries is that the first query doesn’t run a MR job and the bug occurs as part of the MR job.
When Hive runs a MR job, query results are written to HDFS which is in turn read by the Hive CLI or HS2 and returned to the user. The default format for that output is text, so fields with embedded newlines cause corruption for the end user.
(Below we use impala to insert the data since provides a simple insert syntax for testing.)
impala> describe newlinetest2
| name | type | comment |
| c1 | string | |
| c2 | int | |
| c3 | int | |
Then insert a value with an embedded newline:
impala> insert into newlinetest2 values (“arnb”, 1, 2);
Now over in Hive, if we do a query which won’t run an MR job, everything is fine:
hive> select * from newlinetest2;
b 1 2
Time taken: 1.574 seconds, Fetched: 1 row(s)
but any query which runs an MR job results in corruption:
hive> select * from newlinetest2 where c3 = 2;
a NULL NULL
b 1 2
Time taken: 18.057 seconds, Fetched: 2 row(s)
If you set hive.query.result.fileformat to SequenceFile, the query works as expected since the query output is written in a binary format as opposed to a text format.
Side-note, the query below will not execute a MapReduce job if the property hive.fetch.task.conversionis set to a value of more as in that case the filter will run locally, outside of an MR job.