Monthly Archives: January 2010

Query Google Scholar using Python

In desperate need to organize my collection of scientific papers, I had a look at various tools which could help me organizing them. Probably one of the best out there is Mendeley. Mendeley seems to be a very good tool to keep your massive collection of pdfs under control. Unfortunately a very basic function, namely looking up a newly imported paper in Google Scholar to get attributes like: Authors, Year, etc. right, is bundled with a Mendeley account. I guess that’s their way of forcing the user to participate to their community stuff, since without the Google Scholar lookup Mendeley is pretty useless unless you want to fill all the attributes manually.

So I decided to write my own tool to make the lookup. Unfortunately Google does not really want to give away that precious data: they don’t provide an API and even block certain User-Agents from accessing the page. Then, there is also the problem of scraping the results page to get the right data.

The first problem can be trivially solved by setting a common User-Agent string, the second one can be elegantly circumvented by using the bibtex files provided in the search results. The bibtex entries are however only showed if you enabled them in the settings, which are stored in a cookie. After a few tries, I figured that the CF attribute (citation format?) controls which bibliography format should be offered in the results page and CF=4 corresponds to bibtex. Generating a fake cookie is easy, but you have to know what must be included. In this case it looks like a 16 digit hex as ID and the CF attribute is sufficient. The ID is probably supposed to be your id, but a randomly generated one also works like a charm.

The resulting cookie looks like this: GSP=ID=762a112b5c765732:CF=4

All you have to do now is to query Google Scholar using the user string and the cookie:


...
# fake google id (looks like it is a 16 elements hex)
google_id = hashlib.md5(str(random.random())).hexdigest()[:16]

GOOGLE_SCHOLAR_URL = "http://scholar.google.com"
HEADERS = {'User-Agent' : 'Mozilla/5.0',
        'Cookie' : 'GSP=ID=%s:CF=4' % google_id }


def query(searchstr):
    """Return a list of bibtex items."""
    searchstr = '/scholar?q='+urllib2.quote(searchstr)
    url = GOOGLE_SCHOLAR_URL + searchstr
    request = urllib2.Request(url, headers=HEADERS)
    response = urllib2.urlopen(request)
    html = response.read()
    # grab the bibtex links
    ...

And Google Scholar will offer you links to the bibtex files of the results. Getting those links is easy since they all start with "/scholar.bib". Just search for those and download the targets.

The complete code is available on github. It can be used as a python library or a standalone application, you just call it like this: gscolar "some author or title" and it will print the first ten results in bibtex to stdout.

git bisect, ccache, cowbuilder

Git bisect, ccache and cowbuilder: a combination made in heaven! Tracking down a commit which introduced an ugly bug with those tools was a breeze.

Git bisect is very useful finding a commit which introduced a bug very quickly, ccache massively reduces compiling time. Compiling icedove (thunderbird) on my laptop using cowbuilder takes roughly 30 minutes. Using cowbuilder with ccache, it only takes 10 minutes, where most of the time is spent setting up the build environment.