Joining Hadoop and Relational Data using Cirro

Organizations that embark on big-data projects can now leverage enterprise-level implementations of Hadoop from vendors like Cloudera, Hortonworks, and MapR. These vendors offer robust MapReduce processing of large distributed data sets. Moreover, Amazon Elastic MapReduce eliminates the need for large upfront hardware investment, reducing the risk of such projects.

While this is good and well, some challenges emerge:

  1. How to join together data from MapReduce and traditional relational data sources for analysis.
  2. How to do this efficiently enough to allow for ad-hoc query.

To answer both questions, consider the combination of Explore Analytics and Cirro.

Cirro provides customers with enterprise-class join capabilities for big data and traditional data sources. With Cirro it is easy to access, join, share, interact and iteratively analyze any data using SQL. You can think of Cirro as a data hub. It receives SQL queries from BI tools, decomposes these queries to run against multiple data sources and assemble the results in real time. A great advantage of Cirro is its ability to federate processing of data conditions and aggregation to each data source and thereby reduce the amount of data that needs to be moved and assembled together.

Explore Analytics is a sophisticated tool for data analysis and visualization.  It’s a SaaS solution that lets users access their data from anywhere and using any device including desktops and tablets. Non-technical users can build sophisticated queries that join, filter, and aggregate data. They can slice-and-dice data and create interactive visualizations.

Runaway queries that pull large amounts of data have always been a concern when implementing ad-hoc query capabilities. Such runaway queries are not possible with Explore Analytics. The key to this feature is the ability to push filtering, joins, and aggregation to the data source and always perform limited queries. This is where the integration of Explore Analytics with Cirro shines. Using Cirro, Explore Analytics can join data from heterogeneous data sources that include Hadoop and relational.

Explore Analytics hands Cirro a detailed query that is then decomposed to the different data sources and results are assembled together to deliver the exact results that the user desired.

Conclusion

Together, Explore Analytics and Cirro serve to enable non-technical users to access big data and combine it with traditional data sources. Users can focus on their real goal of obtaining actionable information, understanding business drivers, and predicting future trends. Isn’t that what Business Intelligence is all about?

Choropleth

A Map chart is a great way to visualize spatial relationships in data by indicating data on a geographical map. People are very good at reading maps, a fact that allows a map chart to effectively communicate a great deal of information.

In this blog post I’d like to focus on a particular type of map chart called a choropleth. In this type of map chart we color areas on a map to indicate a value or a category of data for each area.
In the following example, we show US unemployment data at the state level. The chart below is interactive and I encourage you to play with the settings. We explain these settings later in this article.

Data

The data shown in a choropleth can be categories or numerical values. The unemployment rates in our example are numerical values, of course. For an example of categories we could show the result of an election with a color for every political party.

When showing categories, the colors are assigned from a palette of distinctive colors similar to the colors used in a pie chart. When showing numerical values, we use color schemes that reflect the numerical value using color shades.

Let’s consider two types of color schemes. The first, sequential, is designed for data that has values that progress from low to high. The second, diverging, puts equal emphasis on mid-range critical values and extremes at both ends of the data range Levels. An example of diverging can be acidity where PH of 7 is neutral and higher and lower PH values diverge to acidity and alkalinity.

In our example we used a sequential scheme that reflects progression from low unemployment in states such as North Dakota to high unemployment in states such as Nevada.

Levels

The range of numerical values is divided into a number of levels that are then mapped to colors. You can set the number of levels to between 3 and 9 to control the granularity of this mapping. In our example, we used just 5 levels.

Scale

There are two options for mapping numerical values to colors:

  • In a linear scale, the range of data values is equally divided to the specified number of levels.
  • In a quantile scale, the range of data values is divided based on data frequencies.

In our example, we use quantile. Since we selected 5 levels we have 5 quintiles. The first quintile has the 20% of states with the lowest unemployment; the next quintile has the next 20% and so on.

Color Schemes

Based on the choice of sequential or diverging data we have a choice of several appropriate color schemes. The scheme you see in our example is based on design by Cynthia Brewer and Mark Harrower at The Pennsylvania State University.