Monday, October 29, 2012

Pull Yahoo Finance Key-Statistics Instantaneously Using XML and XPath in R


This two-part blog post I published a day ago required key-stats from Yahoo Finance for all the companies in the control group I created for my research.  I wanted all the key-stats pulled, arranged in a data-frame and then present them side-by-side to form my opinions.

Quantmod package has "getQuote" method which should return the desired metrics.  The number and names of the metrics can be controlled via "yahooQF" (see the script below.)  Unfortunately, this method seems to be broken as the resulting data-frame had large amount of null values for some metrics.  Here is the script nonetheless if one wishes to experiment:

#######################################################################
# Script to download key metrics for a set of stock tickers using the quantmod package
#######################################################################
require(quantmod)
require("plyr")
what_metrics <- yahooQF(c("Price/Sales", 
                          "P/E Ratio",
                          "Price/EPS Estimate Next Year",
                          "PEG Ratio",
                          "Dividend Yield", 
                          "Market Capitalization"))

tickers <- c("AAPL", "FB", "GOOG", "HPQ", "IBM", "MSFT", "ORCL", "SAP")
# Not all the metrics are returned by Yahoo.
metrics <- getQuote(paste(tickers, sep="", collapse=";"), what=what_metrics)

#Add tickers as the first column and remove the first column which had date stamps
metrics <- data.frame(Symbol=tickers, metrics[,2:length(metrics)]) 

#Change colnames
colnames(metrics) <- c("Symbol", "Revenue Multiple", "Earnings Multiple", 
                       "Earnings Multiple (Forward)", "Price-to-Earnings-Growth", "Div Yield", "Market Cap")

#Persist this to the csv file
write.csv(metrics, "FinancialMetrics.csv", row.names=FALSE)

#######################################################################

After some digging around and staring at the raw HTML for Yahoo's KeyStats page for sometime, I decided to use the XML package and the XPath operators to get all the nodes which host key stats (name and values).  This turned out to be lot simpler.  Let's walk through this using three easy steps:

1) The CSS class name for the HTML nodes which host the name of the metric such as Market Cap or Enterprise value is "yfnc_tablehead1".  This made it quite easy to grab all the elements from the HTML tree with this class name:
 nodes <- getNodeSet(html_text, "/*//td[@class='yfnc_tablehead1']")

2) Now all I needed to do was get the value of this node using xmlValue function to get the name of the metric (Enterprise Value as an example):
 measures <- sapply(nodes, xmlValue)

3) Next, to get the value of any metric, I used the getSibling function to get the adjacent node (i.e. sibling) and used xmlValue function to get the value.  Here is how it was done:
values <- sapply(nodes, function(x)  xmlValue(getSibling(x)))

This is it, I then used some other common functions to clean up column names and constructed a data-frame to arrange the key-stats in a columnar fashion.  Here is the final script and the result is shown in the graphics below.  Please feel free to use this and share it with other R enthusiasts:

#######################################################################
##Alternate method to download all key stats using XML and x_path - PREFERRED WAY
#######################################################################

setwd("C:/Users/i827456/Pictures/Blog/Oct-25")
require(XML)
require(plyr)
getKeyStats_xpath <- function(symbol) {
  yahoo.URL <- "http://finance.yahoo.com/q/ks?s="
  html_text <- htmlParse(paste(yahoo.URL, symbol, sep = ""), encoding="UTF-8")

  #search for <td> nodes anywhere that have class 'yfnc_tablehead1'
  nodes <- getNodeSet(html_text, "/*//td[@class='yfnc_tablehead1']")
  
  if(length(nodes) > 0 ) {
   measures <- sapply(nodes, xmlValue)
   
   #Clean up the column name
   measures <- gsub(" *[0-9]*:", "", gsub(" \\(.*?\\)[0-9]*:","", measures))   
   
   #Remove dups
   dups <- which(duplicated(measures))
   #print(dups) 
   for(i in 1:length(dups)) 
     measures[dups[i]] = paste(measures[dups[i]], i, sep=" ")
   
   #use siblings function to get value
   values <- sapply(nodes, function(x)  xmlValue(getSibling(x)))
   
   df <- data.frame(t(values))
   colnames(df) <- measures
   return(df)
  } else {
    break
  }
}

tickers <- c("AAPL")
stats <- ldply(tickers, getKeyStats_xpath)
rownames(stats) <- tickers
write.csv(t(stats), "FinancialStats_updated.csv",row.names=TRUE)  

#######################################################################




Happy Analyzing!
All Things R &
All Things Analytics (http://goo.gl/CBYQI)