Hive Corruption Due to Newlines and Carriage Returns

phData has customers across the spectrum of use cases. One of our customers stores vast volumes of XML. One of our engineers was recently asked: Hive sometimes corrupts my data and other times it does not. What is going on?

The answer is quite interesting, so I thought I would share.

Uncovering Hive Corruption

The query they were running was of the form:

select * from table where partitioned_column = X

The data came back fine, but if they ran:

select * from table where partitioned_column = X AND other_col = Y

Both Hive CLI and HUE (via HiveServer2) returned corrupt results. The reason for the difference between the two queries is that the first query doesn’t run a MR job and the bug occurs as part of the MR job.

When Hive runs a MR job, query results are written to HDFS which is in turn read by the Hive CLI or HS2 and returned to the user. The default format for that output is text, so fields with embedded newlines cause corruption for the end user.

For example:

(Below we use impala to insert the data since provides a simple insert syntax for testing.)

impala> describe newlinetest2

+——+——–+———+

| name | type | comment |

+——+——–+———+

| c1 | string | |

| c2 | int | |

| c3 | int | |

+——+——–+———+

Then insert a value with an embedded newline:

impala> insert into newlinetest2 values (“arnb”, 1, 2);

Now over in Hive, if we do a query which won’t run an MR job, everything is fine:

hive> select * from newlinetest2;

b 1 2

Time taken: 1.574 seconds, Fetched: 1 row(s)

but any query which runs an MR job results in corruption:

hive> select * from newlinetest2 where c3 = 2;

a NULL NULL

b 1 2

Time taken: 18.057 seconds, Fetched: 2 row(s)

If you set hive.query.result.fileformat to SequenceFile, the query works as expected since the query output is written in a binary format as opposed to a text format.

Side-note, the query below will not execute a MapReduce job if the property hive.fetch.task.conversionis set to a value of more as in that case the filter will run locally, outside of an MR job.

To learn more, visit our Cloudera services page to see how phData can help your organization.

Hive Corruption Due to Newlines and Carriage Returns

Uncovering Hive Corruption

More to explore

Snowflake Query Tagging Best Practices

Data Ingestion from PostgreSQL to Snowflake using Openflow

The GenAI Strategy Playbook

Join our team

Partners

Resources

Software

Accelerate and automate your data projects with the phData Toolkit

Industries

Solutions

Company

Technology Partners

Other Technology Partners

Check out our latest insights

Snowflake Query Tagging Best Practices

Data Ingestion from PostgreSQL to Snowflake using Openflow

Data Engineering

Consulting, Migrations, Data Pipelines, DataOps

Change Management, Enablement & Learning

COE, Coaching, PMO

Data Science and Machine Learning Services

MLOps Enablement, Prototyping, Model Development and Deployment

Strategy Services

Data, Analytics, and AI Strategy, Architecture and Assessments

Reporting, Analytics, and Visualization Services

Self-Service, Integrated Analytics, Dashboards, Automation

Elastic Operations

Data Platforms, Data Pipelines, and Machine Learning