Posts Tagged ‘python’

Query Google Scholar using Python

Wednesday, January 27th, 2010

In desperate need to organize my collection of scientific papers, I had a look at various tools which could help me organizing them. Probably one of the best out there is Mendeley. Mendeley seems to be a very good tool to keep your massive collection of pdfs under control. Unfortunately a very basic function, namely looking up a newly imported paper in Google Scholar to get attributes like: Authors, Year, etc. right, is bundled with a Mendeley account. I guess that’s their way of forcing the user to participate to their community stuff, since without the Google Scholar lookup Mendeley is pretty useless unless you want to fill all the attributes manually.

So I decided to write my own tool to make the lookup. Unfortunately Google does not really want to give away that precious data: they don’t provide an API and even block certain User-Agents from accessing the page. Then, there is also the problem of scraping the results page to get the right data.

The first problem can be trivially solved by setting a common User-Agent string, the second one can be elegantly circumvented by using the bibtex files provided in the search results. The bibtex entries are however only showed if you enabled them in the settings, which are stored in a cookie. After a few tries, I figured that the CF attribute (citation format?) controls which bibliography format should be offered in the results page and CF=4 corresponds to bibtex. Generating a fake cookie is easy, but you have to know what must be included. In this case it looks like a 16 digit hex as ID and the CF attribute is sufficient. The ID is probably supposed to be your id, but a randomly generated one also works like a charm.

The resulting cookie looks like this: GSP=ID=762a112b5c765732:CF=4

All you have to do now is to query Google Scholar using the user string and the cookie:


...
# fake google id (looks like it is a 16 elements hex)
google_id = hashlib.md5(str(random.random())).hexdigest()[:16]

GOOGLE_SCHOLAR_URL = “http://scholar.google.com”
HEADERS = {’User-Agent’ : ‘Mozilla/5.0′,
        ‘Cookie’ : ‘GSP=ID=%s:CF=4′ % google_id }

def query(searchstr):
    “”"Return a list of bibtex items.”"”
    searchstr = ‘/scholar?q=’+urllib2.quote(searchstr)
    url = GOOGLE_SCHOLAR_URL + searchstr
    request = urllib2.Request(url, headers=HEADERS)
    response = urllib2.urlopen(request)
    html = response.read()
    # grab the bibtex links
    …

And Google Scholar will offer you links to the bibtex files of the results. Getting those links is easy since they all start with "/scholar.bib". Just search for those and download the targets.

The complete code is available on github. It can be used as a python library or a standalone application, you just call it like this: gscolar "some author or title" and it will print the first ten results in bibtex to stdout.

The sorry state of Python in Debian

Wednesday, December 2nd, 2009

Looking at the sorry state of Python in Debian, makes me wonder if we shouldn’t enforce team maintainance of packages above a certain popularity/importance/whatever threshold. People worked hard in the last months to fix any bugs that would prevent Python2.6 to land in unstable and yet nothing happens. Time passes by and we will eventually end up with Squeeze having a horribly outdated Python version.

Python2.6 Blockers

Saturday, November 7th, 2009

Today’s work: 5 lazy NMUs (thanks again Kumar). Leaves us with only five open python2.6 blockers to fix and a whopping 62 of closed ones.

reportbug-ng has localization support again

Sunday, October 25th, 2009

After having ported reportbug-ng from PyQt3 to PyQt4 over a year ago, reportbug-ng lost it’s localization, since the gettext based translations where incompatible with Qt4’s translation system.

This weekend I finally had the time to have a closer look at this problem. To make a long story short: I have ported the gettext based system to Qt4’s system. All the old .po files where converted to .ts files, but almost all strings are marked as “obsolete” so that they don’t appear in the translated program. But since they are still available in the .ts file, it is easy to get the translations up-to-date. So far only English and German are complete, but eventually other translations will be added.

PyQt4 makes it by the way really hard to get non-Qt strings translated.

Python 2.6 Transition

Saturday, October 17th, 2009

Today I NMUed over a dozen of Python packages with bugs which blocked the Python 2.6 transition.

I really want to thank Kumar Appaiah for his work. He provided patches for all the bugs I NMUed today and lots more. I really did not much more than applying, testing and uploading his patches, but Kumar probably invested days of labor to create the patches and test them. Thanks to his effort, the number of 2.6-blockers shrinked considerably so that we now have like ~15 open blockers and ~50 closed ones!

python-debianbts 1.0 uploaded to unstable

Saturday, October 10th, 2009

Today I was working all day on python-debianbts 1.0 and uploaded it to unstable a few minutes ago. This version breaks backwards compatibility with previous versions. I removed lots of unneeded old cruft like the HTMLStripper class needed ages ago when I was still using HTML instead of debbugs’ SOAP interface.

A new method get_usertag(email, *tags) was introduced. It returns a dict containing usertag-buglist mappings. If tags are given the dict is limited to matching tags, otherwise all available tags of the given user are returned:

In [1]: import debianbts as bts

In [2]: bts.get_usertag(”debian-python@lists.debian.org”)
Out[2]:
{’dist-packages’: [547838, 547832, ..., 547858],
 ‘dpmt-todo’: [332913],
 ‘policy’: [373301, 373302, ..., 377089],
 ‘python-oldnum’: [478467, 478442, ..., 478441],
 ‘python2.1′: [351108, 351110, ..., 351131],
 ‘python2.2′: [351108, 351109, ..., 351161],
 ‘python2.6′: [547838, 547832, ... 547858]}

In [3]: bts.get_usertag(”debian-python@lists.debian.org”, “python2.1″, “python2.2″)
Out[3]:
{’python2.1′: [351108, 351110, ..., 351131],
 ‘python2.2′: [351108, 351109, ..., 351161]}

get_bug_log(nr) now returns a list of dicts with the keys: header (string), body (string), msg_num (int) and attachments (list). Before 1.0 it returned a list of Buglog objects.

The Bugreport class now supports every information provided by the SOAP interface. I tried to stay as close as possible to the data SOAP provides, so I renamed existing attributes (like Bugreport.nr which is not supported by SOAP but is now Bugreport.bug_num) and also added the quirky ones like: id and bug_nr, found and found_versions, keywords and tags, fixed and fixed_date which always seem to provide the same data.

Instead of the Bugreport.value() method which provided a number representing the openness (in terms of: open, closed and archived) and urgency (like: grave, important, …) to make bugreports sortable by their status, the Bugreport class now has a __cmp__ method which makes bugreports comparable. The more open and urgent a bug is, the greater it is. Openness always beats urgency (eg: an open whishlist bug is greater than a closed grave one).

While pre 1.0 versions of python-debianbts more or less served the needs of reportbug-ng, it now tries to stay as close as possible to the data provided by SOAP. As a result many parts of reportbug-ng had to be fixed for the new version. I hope this makes python-debianbts more attractive for other projects dealing with Debian’s bug tracker. As always: python-debianbts is on github and forks, patches or other kinds of collaboration are very welcome.

For the curious here a litte quickstart. it shows how to get all important bugs from reportbug-ng and prints out the bugnumber and summary:

# Get all important bugs of reportbug-ng (returns a list of integers)
bugnrlist = bts.get_bugs("package", "reportbug-ng", "severity", "important")

bugnrlist
[548871, 439203, 542759]

# Get the actual bugreports (returns a list of Bugreport-objects)
bugs = bts.get_status(bugnrlist)

for bug in bugs: print bug.bug_num, bug.subject
   ….:
542759 [reportbug-ng] Erroneously reports nothing or repeats previous package’s report
439203 Doesn’t give any explanations of the severities and what they mean
548871 reportbug-ng: does not check for newer versions before reporting a bug

Please help to complete python-debianbts

Sunday, September 20th, 2009

I’m currently working on an updated version of python-debianbts a Python interface to Debian’s Bugtracker. The goal is to equip the Bugreport class with all available attributes delivered by the SOAP interface. The problem is, that for some attributes it is not quite clear what data they provide and in which datatype they are wrapped. For example the mergedwith attribute should be a list of bugnumbers, but it seems to be a single Integer when merged with one bug and an empty String when the bug is not merged at all. Some attributes have an ambiguous name and it’s hard to guess what they mean, for example there is an id and a bug_nr and both seem to contain the same information.

There is a git branch for this task and a wiki page collecting all available information. If you have some experience with SOAP, our BTS and some time to kill, your help would be appreciated.

SO_REUSEADDR

Wednesday, July 22nd, 2009

Dear Lazyweb,

I have a simple test application where a TCP/IP server listens for incoming connections, reads the data and closes the connection again and a client which opens connections to the server and sends a package and closes the connection as fast as it can:

The server looks like this:

    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    sock.setblocking(False)
    sock.bind(("", 12347))
    sock.listen(1)

    slist = [sock]
    # use select to poll the sockets
    while 1:
        l = select.select(slist, [], [])
        for i in l[0]:
            conn, addr = i.accept()
            data = “”
            while 1:
                tmp = conn.recv(1024)
                if not tmp:
                    break
                data += tmp
            conn.shutdown(socket.SHUT_RDWR)
            conn.close()

The Client:

    # Open a connection, send data and close the connection as fast as possible
    while 1:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.connect(("", 12347))
        sock.send("foo")
        sock.shutdown(socket.SHUT_RDWR)
        sock.close()

The Problem with this application: After roughly 25.000 Iterations the client quits with a friendly:

error: (99, ‘Cannot assign requested address’)

Netstat shows the problem: roughly 25.000 of these ones:

...
tcp        0      0 localhost:56946   localhost:12347         TIME_WAIT
tcp        0      0 localhost:47163   localhost:12347         TIME_WAIT
tcp        0      0 localhost:42758   localhost:12347         TIME_WAIT
...

I’m not a TCP/IP expert but I thought SO_REUSEADDR is supposed to address this problem by allowing to reuse those as-good-as-closed connections in TIME_WAIT state, or not? So why does it fail in my test application?