Saturday, 31 October 2009

screen scrapping using YQL

Just attended the Cambridge DevDay, which I really enjoyed.

Christian Heilmann talked about the Yahoo query language, a very powerful tool for querying not only Yahoo dataset but arbitrary third party ones as well as a bit of URL fetching. The YQL stuff that Christian demo'd was pretty slick, but the bit that really caught my eye was the following little snippit:

select * from html where url="http://finance.yahoo.com/q?s=yhoo" and
xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'


Try it for your self over at the YQL console.

So I thought I'd have a bit of a bash at screen scraping my SO profile page to see it I can my answer list on my site.


<c:import url="http://query.yahooapis.com/v1/public/yql" var="feed">
<c:param name="q">select * from html where url="http://stackoverflow.com/users/31480/gid" and
xpath='//div[@class="answer-summary"]'</c:param>
<c:param name="format" value="xml"/>
</c:import>
<x:parse var="xml">${feed}</x:parse>
<ul class="so-answers">
<x:forEach select="$xml/query/results/div[@class='answer-summary']" var="answer"
end="10" >
<li class="answer">
<x:set var="votes" select="$answer/div[contains(@class,'answer-votes')]"/>
<div class="<x:out select="$votes/@class"/>"
title="<x:out select="$votes/@title"/>">
<x:out select="$votes"/>
</div>

<x:set var="a" select="$answer//a[contains(@class,'answer-hyperlink')]"/>
<div class="answer-link">
<a href="http://stackoverflow.com/<x:out select="$a/@href"/>">
<x:out select="$a"/>
</a>
</div>
</li>
</x:forEach>
</ul>

Check it out running

In theory it I could actually point the above statement directly at stackoverflow, but the rather picky xerces parser (used under the covers of the ) complains bitterly about DTD's and all that jazz. The YQL fetch has the nice side effect of tidying up any html ugly ness and spits out easily parsable XML.

Tuesday, 2 June 2009

0.10 of logicalpractice-collections released

No really major changes to note. Main reason for the release is a packaging change that makes the library available via it's own maven repo, see maven setup instructions.

Saturday, 21 March 2009

Python is just a lovely thing

I've used quite a few dynamic scripting languages over the last couple of years including groovy, ruby and python, but I keep coming back to python. I think this time it's due to Peter Butler's (a guy I worked with a while ago) complete love of the language and I think I'm starting to see why.


Over the last week I've been bashing away working on improving the rather outdated www.logicalpractice.com and it occurred to me that it would be a bad idea to generate a sitemap xml for google and the other search bots.


The following code is my solution, I'm sure it's not the best python in the world but I do just kinda like the look.




from __future__ import with_statement
import xmlbuilder
import sys
import os
from datetime import datetime
from xml.dom.minidom import parse as parseDom
from xml.dom.minidom import Node

def url_element(xml, loc,lastmod,changefreq="weekly", priority=0.5):
with xml.url:
if loc.startswith("http:"):
xml.loc(loc)
else:
xml.loc("http://www.logicalpractice.com%s" % loc)

xml.lastmod(lastmod.strftime("%Y-%m-%d"))
xml.changefreq(changefreq)
xml.priority(priority)

def lastmod(file_name):
global basedir
last_mod = os.path.getmtime(os.path.join(basedir,file_name))
return datetime.fromtimestamp(last_mod)

basedir = os.path.join(os.path.dirname(sys.argv[0]), "..","..")
xml = xmlbuilder.builder(version="1.0",encoding="utf-8")

with xml.urlset(xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"):
url_element(xml,"/",lastmod("index.jsp"),priority=1.0)
url_element(xml,"/news.jsp", lastmod("news.jsp"), priority=0.8)
url_element(xml,"/projects.jsp", lastmod("projects.jsp"), priority=0.5)
url_element(xml,"/profile.jsp", lastmod("profile.jsp"), priority=0.5)

# generate elements from the news.rss
rss = parseDom(os.path.join(basedir,"news.rss"))
for node in rss.getElementsByTagName("item"):
link = node.getElementsByTagName("link")[0].firstChild.data
strdate = node.getElementsByTagName("pubDate")[0].firstChild.data
date = datetime.strptime(strdate, "%a, %d %b %Y %H:%M:%S +0000")
url_element(xml, link, date, priority=0.5)

print xml

the xmlbuilder used is from Jonas Galvez via github seems a very simple and elegant solution for building xml documents



How do I know that python must be a good thing? Well anything that I get up at 5 in the morning to code a bit more of before work has to be a good thing.

Wednesday, 11 February 2009

Java assert - what I learnt today

I learnt something new today, the following I thought would be just fine:

File f = new File("foo.txt");

if( f.exists() )
assert f.delete();

That all looked good to me, right up to the point that I discovered that an assert expression isn't even evaluated if asserts are not enabled, had to change it to.

if( f.exists() ){
boolean deleted = f.delete();
assert deleted;
}