Monday, October 29, 2012

Pull Yahoo Finance Key-Statistics Instantaneously Using XML and XPath in R


This two-part blog post I published a day ago required key-stats from Yahoo Finance for all the companies in the control group I created for my research.  I wanted all the key-stats pulled, arranged in a data-frame and then present them side-by-side to form my opinions.

Quantmod package has "getQuote" method which should return the desired metrics.  The number and names of the metrics can be controlled via "yahooQF" (see the script below.)  Unfortunately, this method seems to be broken as the resulting data-frame had large amount of null values for some metrics.  Here is the script nonetheless if one wishes to experiment:

#######################################################################
# Script to download key metrics for a set of stock tickers using the quantmod package
#######################################################################
require(quantmod)
require("plyr")
what_metrics <- yahooQF(c("Price/Sales", 
                          "P/E Ratio",
                          "Price/EPS Estimate Next Year",
                          "PEG Ratio",
                          "Dividend Yield", 
                          "Market Capitalization"))

tickers <- c("AAPL", "FB", "GOOG", "HPQ", "IBM", "MSFT", "ORCL", "SAP")
# Not all the metrics are returned by Yahoo.
metrics <- getQuote(paste(tickers, sep="", collapse=";"), what=what_metrics)

#Add tickers as the first column and remove the first column which had date stamps
metrics <- data.frame(Symbol=tickers, metrics[,2:length(metrics)]) 

#Change colnames
colnames(metrics) <- c("Symbol", "Revenue Multiple", "Earnings Multiple", 
                       "Earnings Multiple (Forward)", "Price-to-Earnings-Growth", "Div Yield", "Market Cap")

#Persist this to the csv file
write.csv(metrics, "FinancialMetrics.csv", row.names=FALSE)

#######################################################################

After some digging around and staring at the raw HTML for Yahoo's KeyStats page for sometime, I decided to use the XML package and the XPath operators to get all the nodes which host key stats (name and values).  This turned out to be lot simpler.  Let's walk through this using three easy steps:

1) The CSS class name for the HTML nodes which host the name of the metric such as Market Cap or Enterprise value is "yfnc_tablehead1".  This made it quite easy to grab all the elements from the HTML tree with this class name:
 nodes <- getNodeSet(html_text, "/*//td[@class='yfnc_tablehead1']")

2) Now all I needed to do was get the value of this node using xmlValue function to get the name of the metric (Enterprise Value as an example):
 measures <- sapply(nodes, xmlValue)

3) Next, to get the value of any metric, I used the getSibling function to get the adjacent node (i.e. sibling) and used xmlValue function to get the value.  Here is how it was done:
values <- sapply(nodes, function(x)  xmlValue(getSibling(x)))

This is it, I then used some other common functions to clean up column names and constructed a data-frame to arrange the key-stats in a columnar fashion.  Here is the final script and the result is shown in the graphics below.  Please feel free to use this and share it with other R enthusiasts:

#######################################################################
##Alternate method to download all key stats using XML and x_path - PREFERRED WAY
#######################################################################

setwd("C:/Users/i827456/Pictures/Blog/Oct-25")
require(XML)
require(plyr)
getKeyStats_xpath <- function(symbol) {
  yahoo.URL <- "http://finance.yahoo.com/q/ks?s="
  html_text <- htmlParse(paste(yahoo.URL, symbol, sep = ""), encoding="UTF-8")

  #search for <td> nodes anywhere that have class 'yfnc_tablehead1'
  nodes <- getNodeSet(html_text, "/*//td[@class='yfnc_tablehead1']")
  
  if(length(nodes) > 0 ) {
   measures <- sapply(nodes, xmlValue)
   
   #Clean up the column name
   measures <- gsub(" *[0-9]*:", "", gsub(" \\(.*?\\)[0-9]*:","", measures))   
   
   #Remove dups
   dups <- which(duplicated(measures))
   #print(dups) 
   for(i in 1:length(dups)) 
     measures[dups[i]] = paste(measures[dups[i]], i, sep=" ")
   
   #use siblings function to get value
   values <- sapply(nodes, function(x)  xmlValue(getSibling(x)))
   
   df <- data.frame(t(values))
   colnames(df) <- measures
   return(df)
  } else {
    break
  }
}

tickers <- c("AAPL")
stats <- ldply(tickers, getKeyStats_xpath)
rownames(stats) <- tickers
write.csv(t(stats), "FinancialStats_updated.csv",row.names=TRUE)  

#######################################################################




Happy Analyzing!
All Things R &
All Things Analytics (http://goo.gl/CBYQI)


Wednesday, May 23, 2012

If You are a R Developer, Then You Must Try SAP HANA for Free.



This is a guest blog from 
Alvaro Tejada Galindo, my colleague and fellow R and SAP HANA enthusiast.  I am thankful to Alvaro for coming and posting on "AllThingsR".

Are you an R developers? Have ever heard of SAP HANA? Would you like to test SAP HANA for free?

SAP HANA is an In-Memory Database Technology allowing developers to analyze big data in real-time.

Processes that took hours now take seconds due to SAP HANA's power to keep everything on RAM memory.

As announced in SAP Sapphire Now event in Orlando, Florida, SAP HANA is free for developers. You just need to download and install both the SAP HANA Client and the SAP HANA Studio, and create an SAP HANA Server on the Amazon Web Services as described in the following document:
Get your own SAP HANA DB server on Amazon Web Services - http://scn.sap.com/docs/DOC-28294

Why should this interest you? Easy...SAP HANA is an agent of change bringing speed to its limits and it can also be integrated with R as described in the following blog:

Want to know more about SAP HANA? Read everything you need here: http://developers.sap.com

You're convinced but don't want to pay for the Amazon Web Services? No problem. Just leave a comment including your name, company and email. We will reach you and send you an Amazon Gift Card so you can get started. Of course, your feedback would be greatly appreciated. Of course, we only a limited set of gift cards, so be quick or be out.

Author Alvaro Tejada Galindo, mostly known as "Blag" is a Development Expert working for the Technology Innovation and Developer Experience team in SAP Labs.  He can be contacted at a.tejada.galindo@sap.com.

Alvaro's background in his own words: I used to be an ABAP Consultant for 11 years. I worked in implementations on Peru and Canada. I’m also a die hard developer using R, Python, Ruby, PHP, Flex and many more languages. Now, I work for SAP Labs and my main roles are evangelize SAP technologies by writing blogs, articles, helping people on the forums, attending SAP events, besides many other “Developer engagement” activities.
I maintain a blog called “Blag’s bag of rants” at blagrants.blogspot.com

Wednesday, May 2, 2012

Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part III


Mash-up Airlines Performance Data with Historical Weather Data to Pinpoint Weather Related Delays

For this exercise, I combined following four separate blogs that I did on BigData, R and SAP HANA.  Historical airlines and weather data were used for the underlying analysis. The aggregated output of this analysis was outputted in JSON which was visualized in HTML5, D3 and Google Maps.  The previous blogs on this series are:
  1. Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part II
  2. Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps
  3. Getting Historical Weather Data in R and SAP HANA 
  4. Tracking SFO Airport's Performance Using R, HANA and D3
In this blog, I wanted to mash-up disparate data sources in R and HANA by combining airlines data with weather data to understand the reasons behind the airport/airlines delay.  Why weather - because weather is one of the commonly cited reasons in the airlines industry for flight delays.  Fortunately, the airlines data breaks up the delay by weather, security, late aircraft etc., so weather related delays can be isolated and then the actual weather data can be mashed-up to validate the airlines' claims.  However, I will not be doing this here, I will just be displaying the mashed-up data.

I have intentionally focused on the three bay-area airports and have used last 4 years of historical data to visualize the airport's performance using a HTML5 calendar built from scratch using D3.js.  One can use all 20 years of data and for all the airports to extend this example.  I had downloaded historical weather data for the same 2005-2008 period for SFO and SJC airports as shown in my previous blog (For some strange reasons, there is no weather data for OAK, huh?).  Here is how the final result will look like in HTML5:



Click here to interact with the live example.  Hover over any cell in the live example and a tool tip with comprehensive analytics will show the break down of the performance delay for the selected cell including weather data and correct icons* - result of a mash-up.  Choose a different airport from the drop-down to change the performance calendar. 
* Weather icons are properties of Weather Underground.

As anticipated, SFO airport had more red on the calendar than SJC and OAK.  SJC definitely is the best performing airport in the bay-area.  Contrary to my expectation, weather didn't cause as much havoc on SFO as one would expect, strange?

Creating a mash-up in R for these two data-sets was super easy and a CSV output was produced to work with HTML5/D3.  Here is the R code and if it not clear from all my previous blogs: I just love data.table package.


###########################################################################################  

# Percent delayed flights from three bay area airports, a break up of the flights delay by various reasons, mash-up with weather data

###########################################################################################  

baa.hp.daily.flights <- baa.hp[,list( TotalFlights=length(DepDelay), CancelledFlights=sum(Cancelled, na.rm=TRUE)), 

                             by=list(Year, Month, DayofMonth, Origin)]
setkey(baa.hp.daily.flights,Year, Month, DayofMonth, Origin)

baa.hp.daily.flights.delayed <- baa.hp[DepDelay>15,
                                     list(DelayedFlights=length(DepDelay), 
                                      WeatherDelayed=length(WeatherDelay[WeatherDelay>0]),
                                      AvgDelayMins=round(sum(DepDelay, na.rm=TRUE)/length(DepDelay), digits=2),
                                      CarrierCaused=round(sum(CarrierDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),
                                      WeatherCaused=round(sum(WeatherDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),
                                      NASCaused=round(sum(NASDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),
                                      SecurityCaused=round(sum(SecurityDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),
                                      LateAircraftCaused=round(sum(LateAircraftDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2)), by=list(Year, Month, DayofMonth, Origin)]
setkey(baa.hp.daily.flights.delayed, Year, Month, DayofMonth, Origin)

# Merge two data-tables
baa.hp.daily.flights.summary <- baa.hp.daily.flights.delayed[baa.hp.daily.flights,list(Airport=Origin,
                           TotalFlights, CancelledFlights, DelayedFlights, WeatherDelayed, 
                           PercentDelayedFlights=round(DelayedFlights/(TotalFlights-CancelledFlights), digits=2),
                           AvgDelayMins, CarrierCaused, WeatherCaused, NASCaused, SecurityCaused, LateAircraftCaused)]
setkey(baa.hp.daily.flights.summary, Year, Month, DayofMonth, Airport)

# Merge with weather data
baa.hp.daily.flights.summary.weather <-baa.weather[baa.hp.daily.flights.summary]
baa.hp.daily.flights.summary.weather$Date <- as.Date(paste(baa.hp.daily.flights.summary.weather$Year, 
                                                           baa.hp.daily.flights.summary.weather$Month, 
                                                           baa.hp.daily.flights.summary.weather$DayofMonth, 
                                                           sep="-"),"%Y-%m-%d")
# remove few columns
baa.hp.daily.flights.summary.weather <- baa.hp.daily.flights.summary.weather[, 
            which(!(colnames(baa.hp.daily.flights.summary.weather) %in% c("Year", "Month", "DayofMonth", "Origin"))), with=FALSE]

#Write the output in both JSON and CSV file formats
objs <- baa.hp.daily.flights.summary.weather[, getRowWiseJson(.SD), by=list(Airport)]
# You have now (Airportcode, JSONString), Once again, you need to attach them together.
row.json <- apply(objs, 1, function(x) paste('{\"AirportCode\":"', x[1], '","Data\":', x[2], '}', sep=""))
json.st <- paste('[', paste(row.json, collapse=', '), ']')
writeLines(json.st, "baa-2005-2008.summary.json")                 
write.csv(baa.hp.daily.flights.summary.weather, "baa-2005-2008.summary.csv", row.names=FALSE)


Happy Coding!

Wednesday, April 25, 2012

Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part II


In my last blog, Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps, I analyzed historical airlines performance data set using R and SAP HANA and put the aggregated analysis on Google Maps.  Undoubtedly, Map is a pretty exciting canvas to view and analyze big data sets. One could draw shapes (circles, polygons) on the map under a marker pin, providing pin-point information and display aggregated information in the info-window when a marker is clicked.  So I enjoyed doing all of that, but I was craving for some old fashion bubble charts and other types of charts to provide comparative information on big data sets.  Ultimately, all big data sets get aggregated into smaller analytical sets for viewing, sharing and reporting.  An old fashioned chart is the best way to tell a visual story!

On bubble charts, one could display four dimensional data for comparative analysis. In this blog analysis, I used the same data-set which had 200M data points and went deeper looking at finer slices of information.  I leveraged D3, R and SAP HANA for this blog post.  Here I am publishing some of this work:  

In this first graphics, the performance of top airlines is compared for 2008.  As expected, Southwest, the largest airlines (when using total number of flights as a proxy), performed well for its size (1.2M flights, 64 destinations but average delay was ~10 mins.)  Some of the other airlines like American and Continental were the worst performers along with Skywest.  Note, I didn't remove outliers from this analysis.  Click here to interact with this example (view source to get D3 code).


In the second analysis, I replaced airlines dimension with airports dimension but kept all the other dimensions the same.  To my disbelief, Newark airport is the worst performing airport when it comes to departure delays.  Chicago O'Hare, SFO and JFK follow.  Atlanta airport is the largest airport but it has the best performance. What are they doing differently at ATL?  Click here to interact with this example (view source to get D3 code).


It was hell of a fun playing with D3, R and HANA, good intellectual stimulation if nothing else!  Happy Analyzing and remember possibilities are endless!

As always, my R modules are fairly simple and straightforward:
###########################################################################################  
#ETL - Read the AIRPORT Information, get major aiport informatoin extracted and upload this 
#transfromed dataset into HANA
###########################################################################################
major.airports <- data.table(read.csv("MajorAirports.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE))
setkey(major.airports, iata)

all.airports <- data.table(read.csv("AllAirports.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE)) 
setkey(all.airports, iata)

airports.2008.hp <- data.table(read.csv("2008.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE)) 
setkey(airports.2008.hp, Origin, UniqueCarrier)

#Merge two datasets
airports.2008.hp <- major.airports[airports.2008.hp,]


###########################################################################################  
# Get airport statisitics for all airports
###########################################################################################
airports.2008.hp.summary <- airports.2008.hp[major.airports,     
    list(AvgDepDelay=round(mean(DepDelay, na.rm=TRUE), digits=2),
    TotalMiles=prettyNum(sum(Distance, na.rm=TRUE), big.mark=","),
    TotalFlights=length(Month),
    TotalDestinations=length(unique(Dest)),
    URL=paste("http://www.fly", Origin, ".com",sep="")), 
                    by=list(Origin)][order(-TotalFlights)]
setkey(airports.2008.hp.summary, Origin)
#merge two data tables
airports.2008.hp.summary <- major.airports[airports.2008.hp.summary, 
                                                     list(Airport=airport, 
                                                          AvgDepDelay, TotalMiles, TotalFlights, TotalDestinations, 
                                                          Address=paste(airport, city, state, sep=", "), 
                                                          Lat=lat, Lng=long, URL)][order(-TotalFlights)]

airports.2008.hp.summary.json <- getRowWiseJson(airports.2008.hp.summary)
writeLines(airports.2008.hp.summary.json, "airports.2008.hp.summary.json")                 
write.csv(airports.2008.hp.summary, "airports.2008.hp.summary.csv", row.names=FALSE)

Thursday, April 19, 2012

Getting Historical Weather Data in R and SAP HANA

For many of my latest data blogs, I needed historical weather data to perform data mash-ups to pin-point the cause.  For example, for my continued exploration into the airlines/airports historical data using SAP HANA and R, I wanted to find out whether the weather was behind the extreme delay experienced out of a particular airport for a particular day/hour.  So I needed to mash-up the weather data with the airlines data for this analysis.

I looked around but could not find a better way to get the weather data.  So I turned to R.  Now, to get historical weather data, I am using Weather Underground's REST APIs and I put together a simple program in R to get the weather data in a data.frame.  This R module gets called from SAP HANA and it inserts a new table into HANA with the right weather information.  Once, I have the data in HANA, I performed mash-ups in HANA and off I go on my intellectual pursuit.

Weather Underground returns the data in both XML and JSON file formats.  The program logic is very simple, [once you have spent hours cracking it, the end product looks simple anyways :-)] and there are appropriate comments in the code below for self-learning.

I want to mention that you are not limited to just getting the historical view on weather data.  You can get the weather forecast for next 10 days, perform your analysis and predict future!

Make sure to register with Weather Underground (API documentation link), comply with their rules and get your own key to access their APIs.
############################################################################
getHistoricalWeather <- function(airport.code="SFO", date="Sys.Date()")
{
  base.url <- 'http://api.wunderground.com/api/{your key here}/'
  # compose final url
  final.url <- paste(base.url, 'history_', date, '/q/', airport.code, '.json', sep='')


  # reading in as raw lines from the web service
  conn <- url(final.url)
  raw.data <- readLines(conn, n=-1L, ok=TRUE)
 # Convert to a JSON
  weather.data <- fromJSON(paste(raw.data, collapse=""))
  close(conn)
  return(weather.data)
}



# get data for 10 days - restriction by Weather Underground for free usage
date.range <- seq.Date(from=as.Date('2006-1-01'), to=as.Date('2006-1-10'), by='1 day')


# Initialize a data frame
hdwd <- data.frame()



# loop over dates, and fetch weather data
for(i in seq_along(date.range)) {
    weather.data <- getHistoricalWeather('SFO', format(date.range[i], "%Y%m%d"))                 
      hdwd <- rbind(hdwd, ldply(weather.data$history$dailysummary, 
          function(x) c('SJC', date.range[i], x$fog, x$rain, x$snow,  x$meantempi, x$meanvism, x$maxtempi, x$mintempi)))
}
colnames(hdwd) <- c("Airport", "Date", 'Fog', 'Rain', 'Snow','AvgTemp', 'AvgVisibility','MaxTemp','MinTemp')


# save to CSV
write.csv(hdwd, file=gzfile('SFC-Jan2006.csv.gz'), row.names=FALSE)

############################################################################
Results - 

Airport Date Fog Rain Snow AvgTemp AvgVisibility MaxTemp MinTemp
SFO 13149 0 1 0 55 14 62 47
SFO 13150 0 1 0 53 11 55 50
SFO 13151 0 1 0 51 14 56 46
SFO 13152 0 0 0 56 16 62 50
SFO 13153 0 0 0 54 14 60 48
SFO 13154 0 1 0 52 14 59 45
SFO 13155 0 1 0 56 14 61 50
SFO 13156 0 0 0 51 16 57 45
SFO 13157 0 0 0 49 16 56 41
SFO 13158 0 0 0 54 10 61 46


Happy Analyzing!

Wednesday, April 11, 2012

Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps

Technologies: SAP HANA, R, HTML5, D3, Google Maps, JQuery and JSON
For this fun exercise, I analyzed more than 200 million data points using SAP HANA and R and then brought in the aggregated results in HTML5 using D3, JSON and Google Maps APIs.  The 2008 airlines data is from the data expo and I have been using this entire data set (123 million rows and 29 columns) for quite sometime. See my other blogs

The results look beautiful:

Each airport icon is clickable and when clicked displays an info-window describing the key stats for the selected airport:
I then used D3 to display the aggregated result set in the modal window (light box):
D3 made it looks ridiculously simpler to generate a table from a JSON file. 
Unfortunately, I can't provide the live example due to the restrictions put in by Google Maps APIs and I am approaching my free API limits. 

Fun fact:  The Atlanta airport was the largest airport in 2008 on many dimensions: Total Flights Departed, Total Miles Flew, Total Destinations.  It also experienced lower average departure delay in 2008 than Chicago O'Hare. I always thought Chicago O'Hare is the largest US airport.

As always, I just needed 6 lines of R code including two lines of code to write data in JSON and CSV files:
################################################################################
airports.2008.hp.summary <- airports.2008.hp[major.airports,     
    list(AvgDepDelay=round(mean(DepDelay, na.rm=TRUE), digits=2),
    TotalMiles=prettyNum(sum(Distance, na.rm=TRUE), big.mark=","),
    TotalFlights=length(Month),
    TotalDestinations=length(unique(Dest)),
    URL=paste("http://www.fly", Origin, ".com",sep="")), 
                    by=list(Origin)][order(-TotalFlights)]
setkey(airports.2008.hp.summary, Origin)
#merge the two data tables
airports.2008.hp.summary <- major.airports[airports.2008.hp.summary, 
                                                     list(Airport=airport, 
                                                          AvgDepDelay, TotalMiles, TotalFlights, TotalDestinations, 
                                                          Address=paste(airport, city, state, sep=", "), 
                                                          Lat=lat, Lng=long, URL)][order(-TotalFlights)]

airports.2008.hp.summary.json <- getRowWiseJson(airports.2008.hp.summary)
writeLines(airports.2008.hp.summary.json, "airports.2008.hp.summary.json")                 
write.csv(airports.2008.hp.summary, "airports.2008.hp.summary.csv", row.names=FALSE)
##############################################################################

Happy Coding and remember the possibilities are endless!

Thursday, March 22, 2012

Tracking SFO Airport's Performance Using R, HANA and D3

This is my first introduction to D3 and I am simply blown away.  Mike Bostock (@mbostock), you are genius and thanks for creating D3!  With HANA, R, D3, HTML5 and iPad, and you got yourself a KILLER combo!

I have been burning my midnight oil on piecing together my big data story using HANA, R, JSON and HTML5.  If you recall, I did a technical session on R and SAP HANA at DKOM, SAP's Development Kickoff Event last week where I showcased the supreme powers of R and HANA when analyzing 124 million records in real time.  R and SAP HANA: A Highly Potent Combo for Real Time Analytics on Big Data

Since last week, I have been looking for other creative ways to analyze and then visualize this airlines data. I am very fortunate to come across D3.  After spending couple of hours with D3, I decided to build the calendar view for the airlines data I have.  The calendar view is the first example Mike shows on his D3 page. Amazingly awesome!

I created this calendar view capturing the percent of delayed flight from SFO airports that departed daily between 2005-2008.  For this analysis, I used HANA to get the data out for SFO (out of 250 plus airports) over this 4 years period in seconds and then did all the aggregation in R including creating a JSON and .CSV file in seconds again.  Later, I moved to HTML5 and D3 to generate this beautiful calendar view showing SFO's performance.  Graphics is presented below:


As expected, December and January are two notorious months for flights delay.  Have fun with the live example hosted in the Amazon cloud.


Once again, my R code is very simple:

## Depature Delay for SF Airport
ba.hp.sfo <- ba.hp[Origin=="SFO",]

ba.hp.sfo.daily.flights <- ba.hp.sfo[,list(DailyFlights=length(DepDelay)), by=list(Year, Month, DayofMonth)][order(Year,Month,DayofMonth)]
ba.hp.sfo.daily.flights.delayed <- ba.hp.sfo[DepDelay>15,list(DelayedDailyFlights=length(DepDelay)), by=list(Year, Month, DayofMonth)][order(Year,Month,DayofMonth)]
setkey(ba.hp.sfo.daily.flights.delayed, Year, Month, DayofMonth)
response <- ba.hp.sfo.daily.flights.delayed[ba.hp.sfo.daily.flights]
response <- response[,list(Date=as.Date(paste(Year, Month, DayofMonth, sep="-"),"%Y-%m-%d"),
                           #DailyFlights,DelayedDailyFlights,
                           PercentDelayedFlights=round((DelayedDailyFlights/DailyFlights), digits=2))]
objs <- apply(response, 1, toJSON)
res <- paste('{"dailyFlightStats": [', paste(objs, collapse=', '), ']}')
writeLines(res, "dailyFlightStatsForSFO.json")                
write.csv(response, "dailyFlightStatsForSFO.csv", row.names=FALSE)

For D3 and HTML code, please take a look at this example from D3 website.

Happy Analyzing and Keep That Mid Night Oil Burning!

Tuesday, March 20, 2012

Geocode and reverse geocode your data using, R, JSON and Google Maps' Geocoding API


Geocode and reverse geocode your data using, R, JSON and Google Maps' Geocoding API

To geocode and reverse geocode my data, I use Google's Geocoding service which returns the geocoded data in a JSON. I will recommend that you register with Google Maps API and get a key if you have large amount of data and would do repeated geo coding.

Geocode:

getGeoCode <- function(gcStr)  {
  library("RJSONIO") #Load Library
  gcStr <- gsub(' ','%20',gcStr) #Encode URL Parameters
 #Open Connection
 connectStr <- paste('http://maps.google.com/maps/api/geocode/json?sensor=false&address=',gcStr, sep="") 
  con <- url(connectStr)
  data.json <- fromJSON(paste(readLines(con), collapse=""))
  close(con)
  #Flatten the received JSON
  data.json <- unlist(data.json)
  if(data.json["status"]=="OK")   {
    lat <- data.json["results.geometry.location.lat"]
    lng <- data.json["results.geometry.location.lng"]
    gcodes <- c(lat, lng)
    names(gcodes) <- c("Lat", "Lng")
    return (gcodes)
  }
}
geoCodes <- getGeoCode("Palo Alto,California")
> geoCodes
           Lat            Lng 
  "37.4418834" "-122.1430195" 

Reverse Geocode:
reverseGeoCode <- function(latlng) {
latlngStr <-  gsub(' ','%20', paste(latlng, collapse=","))#Collapse and Encode URL Parameters
  library("RJSONIO") #Load Library
  #Open Connection
  connectStr <- paste('http://maps.google.com/maps/api/geocode/json?sensor=false&latlng=',latlngStr, sep="")
  con <- url(connectStr)
  data.json <- fromJSON(paste(readLines(con), collapse=""))
  close(con)
  #Flatten the received JSON
  data.json <- unlist(data.json)
  if(data.json["status"]=="OK")
    address <- data.json["results.formatted_address"]
  return (address)
}
address <- reverseGeoCode(c(37.4418834, -122.1430195))
> address
                    results.formatted_address 
"668 Coleridge Ave, Palo Alto, CA 94301, USA" 

 Happy Coding!