That code thing: screen scrapping using YQL

Just attended the Cambridge DevDay, which I really enjoyed.

Christian Heilmann talked about the Yahoo query language, a very powerful tool for querying not only Yahoo dataset but arbitrary third party ones as well as a bit of URL fetching. The YQL stuff that Christian demo'd was pretty slick, but the bit that really caught my eye was the following little snippit:


select * from html where url="http://finance.yahoo.com/q?s=yhoo" and
xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'

Try it for your self over at the YQL console.

So I thought I'd have a bit of a bash at screen scraping my SO profile page to see it I can my answer list on my site.


<c:import url="http://query.yahooapis.com/v1/public/yql" var="feed">
 <c:param name="q">select * from html where url="http://stackoverflow.com/users/31480/gid" and
xpath='//div[@class="answer-summary"]'</c:param>
 <c:param name="format" value="xml"/>
</c:import>
<x:parse var="xml">${feed}</x:parse>
<ul class="so-answers">
<x:forEach select="$xml/query/results/div[@class='answer-summary']" var="answer"
                       end="10" >
 <li class="answer">
   <x:set var="votes" select="$answer/div[contains(@class,'answer-votes')]"/>
   <div class="<x:out select="$votes/@class"/>"
        title="<x:out select="$votes/@title"/>">
    <x:out select="$votes"/>
   </div>

   <x:set var="a" select="$answer//a[contains(@class,'answer-hyperlink')]"/>
   <div class="answer-link">
     <a href="http://stackoverflow.com/<x:out select="$a/@href"/>">
       <x:out select="$a"/>
     </a>
   </div>
 </li>
</x:forEach>
</ul>

Check it out running

In theory it I could actually point the above statement directly at stackoverflow, but the rather picky xerces parser (used under the covers of the ) complains bitterly about DTD's and all that jazz. The YQL fetch has the nice side effect of tidying up any html ugly ness and spits out easily parsable XML.

That code thing

Saturday, 31 October 2009

screen scrapping using YQL

No comments:

Blog Archive

About Me