Incremental Merge with Apache Spark

Big Data

It is common to ingest a large amount of data into the Hadoop Distributed File System (HDFS) for analysis. And more often than not, we need to periodically update that data with new changes. For a long time, the most common way to achieve this was to use Apache Hive to incrementally merge new or […]

Databricks Names phData 2020 Rising Star Award Winner

Databricks Rising Star Award 2020

Databricks announced this week during their Partner Summit that phData has been named the 2020 Rising Star Award winner. With multiple joint customer wins over the past year, we’re honored to be recognized with such a high distinction and excited about our future in the Databricks ecosystem. You can learn more about the award in […]

Spark Job History Server Outofmemoryerror

Spark Job History

One of phData’s customers hit an issue where the Spark Job History was running out of memory every few hours. The heap size was set to 4GB and the customer was not a heavy user of Spark, submitting no more than a couple jobs a day. We noticed that they had many long running spark-shell […]