Blog

Enabling Big Data Analytics with Arcadia Data

As distributed data platforms like Hadoop and Cloud grow in adoption, there increasingly needs to be a more distributed approach to business intelligence (BI) and visual analytics.  Traditional BI tools no longer scale to the increased business needs. At phData we continue to run into traditional BI tools failing to adapt to the increasing data […]

Read More

Hadoop meets Blockchain: Trust your (Big) Data

At a simple level, Blockchains solve a trust problem. Increasingly, companies are relying on third parties to help drive brand recognition and gain consumer trust, this includes trusting third party data.  For these companies to succeed it is vital that the data they receive is trustworthy and accurate. Each organization involved needs to trust that […]

Read More

Log Aggregation, Search, and Alerting on CDH with Pulse

In mid-2017, we were working with one of the world’s largest healthcare companies to put a new data application into production. The customer had grown through acquisition and in order to maintain compliance with the FDA, they needed to aggregate data in real-time from dozens of different divisions of the company. The consumers of this […]

Read More

Getting Started with Kudu

Five years ago, enabling Data Science and Advanced Analytics on the Hadoop platform was hard. Organizations required strong Software Engineering capabilities to successfully implement complex Lambda architectures or even simply implement continuous ingest. Updating or deleting data, were simply nightmare. General Data Protection Regulation (GDPR) would have been an extreme challenge at that time.   […]

Read More

Data Science Enablement: The First of its Kind

Here at phData we are proud to announce our newest service offering in the Big Data space, Data Science Enablement.  phData has driven success in Big Data initiatives for the enterprise for years and we’re excited to now complete the circle of services by enabling organizations to maximize return on the investments they’ve made in Data […]

Read More

Cloudera Altus – First Look

I was lucky enough to attend StrataEU 2017 and one of the sessions was Deploying and managing Hive, Spark, and Impala in the public cloud led by Philip Langdale, Eugene Fratkin, and Jennifer Wu. I assumed this was a Cloudera Director session which we have lots of experience with, but I decided to pop my […]

Read More

My first year at phData

Last month marked my one year anniversary working at phData, and since a lot of my friends and future applicants have been asking me about my experience so far, I decided to write up a little blog post about it. My experience at phData can be summed up succinctly as different and constantly evolving. I […]

Read More

Archiving Navigator Audit Data with StreamSets and Kafka

Andy Stadtler helped with this post Many of phData’s customers are heavy users of Cloudera Navigator. Cloudera Navigator provides metadata information to the user who can also audit all actions performed on data in the cluster. Per day one customer generates an average of 4GB Audit Data, which is stored by default in the mysql […]

Read More

Visualizing NetFlow Data with Apache Kudu, Apache Impala (incubating), StreamSets Data Collector, and D3.js

NetFlow is a data format that reflects the IP statistics of all network interfaces interacting with a network router or switch. Netflow records can be generated and collected in near real-time for the purposes of cybersecurity, network quality of service, and capacity planning. For network and cybersecurity analysts interested in these data, being able to […]

Read More