Last month Engine released a data visualization that showed the growth of tech jobs all across America. It was displayed by Google at both major presidential nominating conventions (the RNC and DNC) with the goal of showing politicians that technology is a major positive factor for job growth, and not just in Silicon Valley and the East Coast. It turned out beautifully and was received very well. Check it out here:

Tech Jobs Across America »

It was built in about 2 weeks by an engineer (myself), a designer, a copy-writer, and a policy analyst. This post is going to be a very technical how-we-did-it of the techjobs dataviz. The full source is now available on Github and all the FusionTables we use are public, so feel free to poke around if you’re the type that learns best by reading code.

There were four main parts we had to tackle:

  1. The data, which had to be somehow sorted and processed to give us a workable set of numbers. (I used a local MySQL install for this)
  2. The visualization, which needed to display those numbers in some kind of viewer-friendly format. (We ended up mapping the data with Fusion Tables)
  3. The interface, which had to be designed and engineered to display 20 different sets of data without overwhelming the user. (jQuery ftw)
  4. The final numbers, which had to be displayed next to the map and change on some interaction by the user.

This is the story of how we did it.

If you’re not interested in reading the full epic tale, here are some lessons that I learned the hard way:

  • It’s really hard to “see” a picture made up of 3,000 numbers. Use technology to make pretty pictures instead!
  • Fusion Tables are NOT AVAILABLE to Google Apps accounts – you have to use a regular Google account.
  • FusionTableLayers don’t support mouseover event listeners. They just don’t.
  • Other than that, Fusion Tables are pretty awesome.
  • The Google Maps JS API is super customizable, visually. (see the wizard to play with styles)

1. Wrangling the Data

The data came to us as massive CSV files: the largest one was over a million lines. It was obviously way more data than we could possibly use for this one visualization, but that’s the mark of any good research project. To make sense of this giant pile of information we needed to be able to ask it questions and pull out only the data we needed. Enter MySQL.

I created a MySQL database on my local machine to house all of our data and tried to import the CSV files using PhpMyAdmin. Unsurprisingly, it failed almost immediately and gave me a ‘file size exceeded’ error. I ended up having to create the table by hand through PhpMyAdmin and then populate it through the command line.

I created the tables using PhpMyAdmin’s GUI table creator, but if you’re curious about what one of these raw data tables looked like, here’s a sample create table statement:

CREATE TABLE IF NOT EXISTS `county_qcew` (
`id` int(11) NOT NULL AUTO_INCREMENT,`
`year` int(4) DEFAULT NULL,
`qtr` int(1) DEFAULT NULL,
`fips` int(11) DEFAULT NULL,
`state_abb` varchar(2) DEFAULT NULL,
`county_name` varchar(300) DEFAULT NULL,
`level` varchar(100) DEFAULT NULL,
`lq` float DEFAULT NULL,
`share` float DEFAULT NULL,
`emp` float DEFAULT NULL,
`emp_qtr_chg` float DEFAULT NULL,
`emp_yr_chg` float DEFAULT NULL,
`emp_5yr_chg` float DEFAULT NULL,
`estab` float DEFAULT NULL,
`estab_qtr_chg` float DEFAULT NULL,
`estab_yr_chg` float DEFAULT NULL,
`estab_5yr_chg` float DEFAULT NULL,
`avg_wage` float DEFAULT NULL,
`avg_wage_qtr_chg` float DEFAULT NULL,
`avg_wage_yr_chg` float DEFAULT NULL,
`avg_wage_5yr_chg` float DEFAULT NULL,
PRIMARY KEY (`id`)
)

I used the ‘load data local infile’ command from the MySQL command line interface to populate the table from the CSV file. I had to go through and make copies of all the CSV files so I could remove the column names in the first row, because as far as I could tell there’s no way to make this command ignore column names. Here’s an example that would populate the above table (run after logging into the MySQL command line and selecting your database):

load data local infile '/path/to/county_qcew_nocolnames.csv'
into table county_qcew
fields terminated by ',' enclosed by '"'
lines terminated by '\n'
(year,qtr,fips,state_abb,county_name,level,lq,share,emp,emp_qtr_chg,emp_yr_chg,emp_5yr_chg,estab,estab_qtr_chg,estab_yr_chg,estab_5yr_chg,avg_wage,avg_wage_qtr_chg,avg_wage_yr_chg,avg_wage_5yr_chg);

If you know anything about SQL you’ll notice I used a basic auto-incrementing integer ‘id’ field for the primary key, which was otherwise worthless but kept me from having to pull some crazy four-part compound primary key from the existing data (year + quarter + industry + county, for example). I never used that ‘id’ field again, but it let everything import smoothly and didn’t carry any risk of skipping rows.

Once everything was in the database we spent a lot of time exploring different ways of sorting the data so we could tell a compelling story. We knew we wanted to talk about how technology companies are vitally important to the economy of the entire country, not just Silicon Valley and New York, but which numbers would work best to do that? Was it the number of companies? The difference in job growth rates between tech and overall? Should it be shown by state, county, metro area, or congressional district? We also wanted to emphasize how big a role startups had in tech’s overall impact on the economy, and we had a separate table for that data – but only up to 2009, whereas the rest of our research went up to 2011. Did we want to combine the data sets? Show them side by side? On different pages entirely?  Maybe overlapping on the same map?

We spent over a week trying to answer those questions. Although nothing was really ‘built’ during that time, I was running tons of SQL queries, trying to extract an overall picture of several different numbers so we could decide what to use.

Unfortunately, it’s kinda hard to ‘see’ anything in a giant list of over 3,000 numbers. (there’s a lot of counties in the US, folks!) Before we could decide what data we were going to display, I had to figure out a way to display it.

2. Gettin’ Jiggy With Fusion Tables

Hattery was founded by three ex-Googlers, so of course as soon as the idea for geo-mapping came up they all said “use Fusion Tables!” Since Google was also a sponsor of this project, Fusion Tables it was. If you’ve never heard of them, Google Fusion Tables are kind of like single-table databases hosted inside Google Docs that can support huge amounts of data and be used to easily graph or chart said data. (if you’re a non-technical kind of person, consider them a fancy spreadsheet)

I’d encountered them before, but only in the casual way in which one learns what others are passionate about. Before this project I’d never touched FTs, but I was about to spend the next few weeks of my life becoming intimately familiar with all their beautiful quirks and infuriating flaws.

First, let me give you a tip that took me a good hour to figure out: Fusion Tables are not available in Google Apps accounts. There is a long and impassioned feature request for it, but for now you have to log in with a regular ol’ google (@gmail) account to create and edit Fusion Tables.

After I finished banging my head on my desk at that discovery, creating FTs was super easy. Just go to your Google Docs (or Drive or whatever the hell it’s called now) then click Create > More > Table (beta), after which you are given lots of lovely options to fill in your new table. One of these ways is by uploading a CSV file, so I went in to my database and created a SQL query that got me just the information I needed for one year.

select fips, emp_yr_chg, avg_wage
from county_qcew
where year = 2011 and qtr = 4 and level = “High Tech”

This query returns data on employment change and average wage for the high-tech industry across all counties in the US for the fourth quarter of 2011. Note that I also included ‘fips,’ which is the numerical code for counties, and is incredibly important as it allows your data to be matched to counties on the map. After running this query in PhpMyAdmin, you can export the result into a CSV file with column name headers that’s perfect for uploading into an FT. Here are the first 4 rows of the 3,130 line CSV this query results in:

“fips”,”emp_yr_chg”,”avg_wage”
“1001″,”0.0843373″,”89606.7″
“1003″,”-0.0042076″,”57319.3″
“1005″,”-0.121212″,”45170.2″

You can create a new FT by uploading this file in the creation dialogue, and voila! You’ve now got one year’s worth of county data in a table online. Which, ok, isn’t incredibly useful because it’s still just a giant bunch of numbers. The real magic comes in this next bit.

(Sidenote: Technically, you could upload the original huge CSV into FT as one table and filter it internally, but it’s incredibly inefficient and makes the system die a little, especially while generating maps.)

Now we need to match those fips codes with the geographical shape of the county. Google, in its infinite wisdom, provides public Fusion Tables with KML district boundaries for US states, counties, and congressional districts for just such a purpose.

Google Fusion Tables: USA State, County and Congressional District Boundaries

Discovering the above webpage was one of those experiences where it felt like the clouds opened up and sunshine and angel choirs suddenly poured down upon my head. That, combined with this tutorial on making custom intensity maps, gave me everything I needed to make a full picture of the yearly employment change in the high-tech Industry for 2011, broken down by county. (in the example I’ve been giving here, I linked the ‘fips’ code to the ‘GEO_ID2′ field in the US County Boundaries table)

 

BAM, done. Like magic.

Obviously, that’s not a very clear picture of anything, but that’s why we did it! Now we could start evaluating different data sets for how well they visualize on a map. Once I got used to the process I could go from someone saying “Clare, can we see the 2006 Tech Job Share broken down by state?” to having the map on the screen in under 2 minutes.

The process is: SQL Query > Export to CSV File > Import to Fusion Table > Merge with Boundary Lines > View and Colorize Map

At this point some of you with more Fusion Tables experience might be saying: “Now, Clare, I know Fusion Tables has built in geo-coding; why didn’t you just upload the data with a zipcode or county name or something and have it do this automatically?”

There are two big reasons why I didn’t:

  1. Fusion Tables has a geo-coding limit of 2,500 addresses per day. Since each map has over 3,000 counties and I wanted to generate multiple maps every day, doing it automatically wasn’t feasible.
  2. The point of doing geo-coding is to get lat/long for a specific point, usually so you can place one of those little red markers on it. Fusion Tables can also do ‘heat maps’ or intensity maps with country boundaries with lat/long points, but none of those things are actually what we want. The only way to get a color-coded intensity map of counties or states is by merging your data with one of the KML tables mentioned above.

Now that we could actually see the meaning in the numbers, we were finally able to make some decisions about what the final map would actually display. We chose ‘High-Tech Job Share’ for the main display (that’s the percent of employed people who work at a high-tech company) and a couple additional data points for a side display, including percent job growth and startups founded that year. Our designers got to work concepting the interface, our analyst started building the final maps and figuring out what ‘buckets’ would be best to represent the distribution of values on the map, and I got to sit down and figure out how to put these maps into a web page.

3. Make It Interactive

So far we’d created a bunch of really great-looking visualizations for our data but they were all inside of Google Docs – not really something we could put on show at a national political convention. We needed a way to display these maps inside a custom web page that also provided an easy way to switch between years and state/county.

The solution was the FusionTablesLayer, a part of the Google Maps JavaScript API v3. Google says this layer is still “experimental,” but I like to live on the edge. The FTLayer does exactly what you would guess: it loads a FusionTables Map as a layer onto a Google Maps object in your webpage. Perfect!

So, I built a couple JavaScript objects to store my FusionTable IDs (just pull it out of the URL when you’re looking at your Fusion Table), one for county and one for state, and made the year the key for easy lookup. (I cut out the middle lines of each object here just for conciseness)

var countyLayers = {};
var countyMaps = {
   "2002": "1i9RCtf6irqJRw8mzDxKWq2lUeFjT8JqMObfu5uA",
   "2003": "1-uxBHA6RUb8gu69UY9bWTgFybP1vLmvszgbLDH0",
   "2004": "1imU3aS3Rfd3JSfRzmwfDlG92rw61B6WVQWIcD0k",
   ...
   "2011": "1zIwNKwoFPH_IRIH3tSSUI03XPT-VsKN_-1RsDWo"
};

var stateLayers = {};
var stateMaps = {
   "2002": "1Hho4EcTvg91-QXIVxnmVt53U6Gn3iJJRR6FaXaY",
   ...
   "2011": "1aulYPwQ5hCrXYYeNYLAukUzS6TGYvlkyQaSP9Do"
};

Then in initialize() where you’re setting up your Google Map object (if you need an introduction to working with Google maps in Javscript, they have an excellent tutorial) loop through those IDs, create your FusionTable Layers, do not set the map yet, and store them away, again with the year as the key.

function initialize() {
   var usa = new google.maps.LatLng(38, -97);
   map = new google.maps.Map(document.getElementById('map-canvas'), {
      center: usa,
      zoom: 5,
      mapTypeId: google.maps.MapTypeId.ROADMAP
   });

   for (county in countyMaps){
      countyLayers[county] = createLayer(countyMaps[county]);
   }

   for (state in stateMaps){
      stateLayers[state] = createLayer(stateMaps[state]);
   }
}

function createLayer(tableID) {
   var layer = new google.maps.FusionTablesLayer({
   query: {
      select: 'geometry',
      from: tableID
   },
   suppressInfoWindows: true
   });
return layer;
}

Now the question becomes: how to keep track of which view is active, which view should be active next, and how to update them? I did this with three global variables.

var activeArea = countyLayers;
var activeYear = “2011″;
var activeLayer;

activeArea will always be set to either the countyLayers or stateLayers objects, activeYear will be set to a string between 2002 and 2011, and activeLayer will be a reference to whichever layer is currently in display. Our “update” function is thus:

function update(){
   activeLayer.setMap(null);
   activeLayer = activeArea[activeYear];
   activeLayer.setMap(map);
}

The code that hooks into the UI elements is super simple. For the county/state buttons:

$("#stateBtn").click(function(){
   activeArea = stateLayers;
   update();
});
$("#countyBtn").click(function(){
   activeArea = countyLayers;
   update();
});

And for the years, which I implemented as a jQuery UI Slider:

$('#slider').slider({
   value:2011,
   min: 2002,
   max: 2011,
   step: 1,
   change: function(event, ui) {
      activeYear = $('#slider').slider("value");
      update();
   }
});

If, for some reason, we wanted a button that would go immediately to the 2003 State view it would look like:

$("#specificBtn").click(function(){
   activeArea = stateLayers;
   activeYear = "2003";
   update();
});

The last trick here was getting those thick state boundary lines to display on top of every map. Those are actually in a separate FTLayer loaded from the public State boundary table and heavily styled inside Javascript (whereas the rest of the maps are pre-styled inside FT).

var stateLines = "17aT9Ud-YnGiXdXEJUyycH2ocUqreOeKGbzCkUw";
var stateLinesMap = new google.maps.FusionTablesLayer({
   query: {
      select: 'geometry',
      from: stateLines
   },
   suppressInfoWindows: true,
   clickable: false,
   styles: [{
      polygonOptions: {
         fillColor: "#ffffff",
         fillOpacity: "0",
         strokeColor: "#666666",
         strokeWeight: "1"
      }
   }]
});

One of the features lacking in FusionTableLayers is the ability to control z-index stacking like you can with normal DOM elements in CSS. Instead, layers are stacked in the order in which they are loaded, which means to have the map lines on top at all times you must reactivate that layer every time you update the data layer.

update() now looks like:

function update(){
   activeLayer.setMap(null);
   activeLayer = activeArea[activeYear];
   activeLayer.setMap(map);
   stateLinesMap.setMap(map);
}

Now we have an embedded Google Map with FusionTablesLayers which can be changed with the county/state buttons or the year slider. Whoo!

Something you might notice here is how nicely the colored layers are shown off against the monochrome background map. This is accomplished by applying a ton of styling to the map when it’s created, basically dropping the saturation and removing all the roads, points of interest, and labels. The map creation code inside of initialize() actually looks like this:

var usa = new google.maps.LatLng(38, -97);
   map = new google.maps.Map(document.getElementById('map-canvas'), {
      center: usa,
      zoom: 5,
      scrollwheel: false,
      streetViewControl: false,
      mapTypeControl: false,
      mapTypeId: google.maps.MapTypeId.ROADMAP
   });
   var styles = [
      {
         stylers: [
            { saturation: -100 }
         ]
      },{
         featureType: "road",
         stylers: [
            { visibility: "off" }
         ]
      },{
         featureType: "poi",
         stylers: [
            { visibility: "off" }
         ]
      },{
         elementType: "labels",
         stylers: [
            { visibility: "off" }
         ]
      }
   ];

   map.setOptions({styles: styles});

Now there’s only one big part left to do – the additional data that gets displayed in the bar on top of the map.

4. Show Me The Numbers

Originally these dashboard numbers were intended to update whenever the viewer moved their mouse over one of the counties/states on the data map. Unfortunately, the mouseover listener is disabled in FusionTablesLayers (again, there’s an active Feature Request for it) and all of the recommended workarounds involve loading all the KML shapes from the FusionTable as native Google Maps object. That’s not exactly feasible when your FT has over 3,000 distinct polygons. :\

So, after a full day of trying (and failing) to figure out a way to trick the system into doing mouseovers, onClick it was!

The data represented by any given county/state is, conveniently, included in the object passed to the click listener for FusionTableLayers. You set a listener with google.maps.event.addListener on the FT layers after setting them up, and then you’re good to go.

for (county in countyMaps){
   countyLayers[county] = createLayer(countyMaps[county]);

   google.maps.event.addListener(countyLayers[county], 'click', function(e) {
      $("#areaDisplay").text(e['row']['County Name']['value']);
      updateData(e['row']);
   });
}

for (state in stateMaps){
   stateLayers[state] = createLayer(stateMaps[state]);
   google.maps.event.addListener(stateLayers[state], 'click', function(e) {
      $("#areaDisplay").text(e['row']['state']['value']);
      updateData(e['row']);
   });
}

Most of this step involved me printing the listener objects to the console and digging through them to find the right string of array keys to extract the data I wanted. (for example, getting the state abbreviation from the clicked-on element is e['row']['state']['value']) Then it’s just a matter of formatting your numbers correctly (getting “2.314″ to be “2.3%” and adding commas to anything > 1,000) and using jQuery to stick them in the appropriate boxes.

The update and format functions depend heavily on your particular data and your HTML setup, but here’s mine if you want an example:

function updateData(data){
   var techjobs = data['ht_share']['value'];
   if (techjobs == -1){
      techjobs = "0.0";
   }
   $("#tech-jobs .data").text(formatPercent(techjobs));
   $("#tech-growth .data").text(formatPercent(data['tech_growth']['value']));
   $("#total-growth .data").text(formatPercent(data['total_growth']['value']));
   if(data['state']){
      $("#area .data").text(state_abbr[data['state']['value']]);
   }else{
      $("#area .data").text(data['County Name']['value']+" County, "+data['State Abbr.']['value']);
   }
   var starts = data['starts']['value'];
   if ((starts == "NULL") || (starts == "")){
      starts = "<span class='nslasha'>N/A**</span>";
   }else if (starts >= 1000){
      starts = starts.replace(/\B(?=(\d{3})+(?!\d))/g, ",")
   }
   $('#starts .data').html(starts);
}

function formatPercent(num){
   if (num == ""){
      return "0.0%";
   }
   var newNum = parseFloat(num);
   if(Math.abs(newNum) >= 100){
      newNum = Math.round(newNum);
   }else{
      newNum = newNum.toPrecision(2);
   }
   return newNum + '%';
}

 And, that’s it!

Nothing to it, right? :p

There’s also a whole bunch of work that went into the ‘quiz’ component of the intro for the map, but that was way more straight-forward than all of this. If you’re really interested, you can check it out in the source code.