That code thing: 2009

Saturday, 31 October 2009

screen scrapping using YQL

Just attended the Cambridge DevDay, which I really enjoyed.

Christian Heilmann talked about the Yahoo query language, a very powerful tool for querying not only Yahoo dataset but arbitrary third party ones as well as a bit of URL fetching. The YQL stuff that Christian demo'd was pretty slick, but the bit that really caught my eye was the following little snippit:


select * from html where url="http://finance.yahoo.com/q?s=yhoo" and
xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'

Try it for your self over at the YQL console.

So I thought I'd have a bit of a bash at screen scraping my SO profile page to see it I can my answer list on my site.


<c:import url="http://query.yahooapis.com/v1/public/yql" var="feed">
 <c:param name="q">select * from html where url="http://stackoverflow.com/users/31480/gid" and
xpath='//div[@class="answer-summary"]'</c:param>
 <c:param name="format" value="xml"/>
</c:import>
<x:parse var="xml">${feed}</x:parse>
<ul class="so-answers">
<x:forEach select="$xml/query/results/div[@class='answer-summary']" var="answer"
                       end="10" >
 <li class="answer">
   <x:set var="votes" select="$answer/div[contains(@class,'answer-votes')]"/>
   <div class="<x:out select="$votes/@class"/>"
        title="<x:out select="$votes/@title"/>">
    <x:out select="$votes"/>
   </div>

   <x:set var="a" select="$answer//a[contains(@class,'answer-hyperlink')]"/>
   <div class="answer-link">
     <a href="http://stackoverflow.com/<x:out select="$a/@href"/>">
       <x:out select="$a"/>
     </a>
   </div>
 </li>
</x:forEach>
</ul>

Check it out running

In theory it I could actually point the above statement directly at stackoverflow, but the rather picky xerces parser (used under the covers of the ) complains bitterly about DTD's and all that jazz. The YQL fetch has the nice side effect of tidying up any html ugly ness and spits out easily parsable XML.

Tuesday, 2 June 2009

0.10 of logicalpractice-collections released

No really major changes to note. Main reason for the release is a packaging change that makes the library available via it's own maven repo, see maven setup instructions.

Saturday, 21 March 2009

Python is just a lovely thing

I've used quite a few dynamic scripting languages over the last couple of years including groovy, ruby and python, but I keep coming back to python. I think this time it's due to Peter Butler's (a guy I worked with a while ago) complete love of the language and I think I'm starting to see why.

Over the last week I've been bashing away working on improving the rather outdated www.logicalpractice.com and it occurred to me that it would be a bad idea to generate a sitemap xml for google and the other search bots.

The following code is my solution, I'm sure it's not the best python in the world but I do just kinda like the look.


from __future__ import with_statement
import xmlbuilder
import sys
import os
from datetime import datetime
from xml.dom.minidom import parse as parseDom
from xml.dom.minidom import Node

def url_element(xml, loc,lastmod,changefreq="weekly", priority=0.5): 
 with xml.url:
  if loc.startswith("http:"):
   xml.loc(loc)
  else:
   xml.loc("http://www.logicalpractice.com%s" % loc)
   
  xml.lastmod(lastmod.strftime("%Y-%m-%d"))
  xml.changefreq(changefreq)
  xml.priority(priority)

def lastmod(file_name):
 global basedir
 last_mod = os.path.getmtime(os.path.join(basedir,file_name))
 return datetime.fromtimestamp(last_mod)
 
basedir = os.path.join(os.path.dirname(sys.argv[0]), "..","..") 
xml = xmlbuilder.builder(version="1.0",encoding="utf-8")

with xml.urlset(xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"):
 url_element(xml,"/",lastmod("index.jsp"),priority=1.0)
 url_element(xml,"/news.jsp", lastmod("news.jsp"), priority=0.8)
 url_element(xml,"/projects.jsp", lastmod("projects.jsp"), priority=0.5)
 url_element(xml,"/profile.jsp", lastmod("profile.jsp"), priority=0.5)

 # generate elements from the news.rss
 rss = parseDom(os.path.join(basedir,"news.rss"))
 for node in rss.getElementsByTagName("item"):
  link = node.getElementsByTagName("link")[0].firstChild.data
  strdate = node.getElementsByTagName("pubDate")[0].firstChild.data
  date = datetime.strptime(strdate, "%a, %d %b %Y %H:%M:%S +0000")
  url_element(xml, link, date, priority=0.5)

print xml

the xmlbuilder used is from Jonas Galvez via github seems a very simple and elegant solution for building xml documents

How do I know that python must be a good thing? Well anything that I get up at 5 in the morning to code a bit more of before work has to be a good thing.

Wednesday, 11 February 2009

Java assert - what I learnt today

I learnt something new today, the following I thought would be just fine:

File f = new File("foo.txt");

if( f.exists() )

assert f.delete();

That all looked good to me, right up to the point that I discovered that an assert expression isn't even evaluated if asserts are not enabled, had to change it to.

if( f.exists() ){

boolean deleted = f.delete();

assert deleted;

}

That code thing