Christian Heilmann talked about the Yahoo query language, a very powerful tool for querying not only Yahoo dataset but arbitrary third party ones as well as a bit of URL fetching. The YQL stuff that Christian demo'd was pretty slick, but the bit that really caught my eye was the following little snippit:
select * from html where url="http://finance.yahoo.com/q?s=yhoo" and
xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'Try it for your self over at the YQL console.
So I thought I'd have a bit of a bash at screen scraping my SO profile page to see it I can my answer list on my site.
<c:import url="http://query.yahooapis.com/v1/public/yql" var="feed">
<c:param name="q">select * from html where url="http://stackoverflow.com/users/31480/gid" and
xpath='//div[@class="answer-summary"]'</c:param>
<c:param name="format" value="xml"/>
</c:import>
<x:parse var="xml">${feed}</x:parse>
<ul class="so-answers">
<x:forEach select="$xml/query/results/div[@class='answer-summary']" var="answer"
end="10" >
<li class="answer">
<x:set var="votes" select="$answer/div[contains(@class,'answer-votes')]"/>
<div class="<x:out select="$votes/@class"/>"
title="<x:out select="$votes/@title"/>">
<x:out select="$votes"/>
</div>
<x:set var="a" select="$answer//a[contains(@class,'answer-hyperlink')]"/>
<div class="answer-link">
<a href="http://stackoverflow.com/<x:out select="$a/@href"/>">
<x:out select="$a"/>
</a>
</div>
</li>
</x:forEach>
</ul>
Check it out running
In theory it I could actually point the above
statement directly at stackoverflow, but the rather picky xerces parser (used under the covers of the ) complains bitterly about DTD's and all that jazz. The YQL fetch has the nice side effect of tidying up any html ugly ness and spits out easily parsable XML.