Hive Corruption Due to Newlines and Carriage Returns

Computer Code

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

phData has customers across the spectrum of use cases. One of our customers stores vast volumes of XML. One of our engineers was recently asked: Hive sometimes corrupts my data and other times it does not. What is going on?

The answer is quite interesting, so I thought I would share. Specifically, the query they were running was of the form:

select * from table where partitioned_column = X

The data came back fine, but if they ran:

select * from table where partitioned_column = X AND other_col = Y

Both Hive CLI and HUE (via HiveServer2) returned corrupt results. The reason for the difference between the two queries is that the first query doesn’t run a MR job and the bug occurs as part of the MR job.

When Hive runs a MR job, query results are written to HDFS which is in turn read by the Hive CLI or HS2 and returned to the user. The default format for that output is text, so fields with embedded newlines cause corruption for the end user.

For example:

(Below we use impala to insert the data since provides a simple insert syntax for testing.)

impala> describe newlinetest2


| name | type   | comment |


| c1   | string |         |

| c2   | int   | |

| c3   | int   | |


Then insert a value with an embedded newline:

impala> insert into newlinetest2 values (“arnb”, 1, 2);

Now over in Hive, if we do a query which won’t run an MR job, everything is fine:

hive> select * from newlinetest2;


b 1 2

Time taken: 1.574 seconds, Fetched: 1 row(s)

but any query which runs an MR job results in corruption:

hive> select * from newlinetest2 where c3 = 2;


b 1 2

Time taken: 18.057 seconds, Fetched: 2 row(s)

If you set hive.query.result.fileformat to SequenceFile, the query works as expected since the query output is written in a binary format as opposed to a text format.

Side-note, the query below will not execute a MapReduce job if the property hive.fetch.task.conversionis set to a value of more as in that case the filter will run locally, outside of an MR job.

More To Explore

The Impact of COVID-19 on Machine Learning Models

As the historic COVID-19 pandemic continues to unfold, governments, businesses, and individuals all around the world are making unprecedented changes. Most countries have implemented policies

Want to learn more about phData?

Image of desk