Managing pdf libraries

I have quite a number of pdf documents on my hard drive. These include lecture notes distributed by professors, my own scribe notes, problem sets, interesting papers saved for later reading and so on. With large collections of anything, management becomes a significant cost. In my case, I first tried organizing by keeping a hierarchical structure reflective of file function (eg. My notes>year>semester>class). As the number of non-class files accumulated, keeping structure became a problem. Plus, there is the laziness when I download a file – the file goes into my downloads directory and I don’t necessariry want to rename it and move it at that instant. In addition, many files belong in more than one category (by journal, topic, author, period etc)  and a simple hierarchical structure doesn’t work. Maintaining multiple copies of files was not an option.

After thinking about what exactly I wanted to get out of a pdf manager, I realised that the equivalent problem for music files had already been  (elegantly and repeatedly) solved. This was done through use of embedded metadata and  playlist managers. The more I thought about it, the more excited I got about an application providing mp3 management functionality in the pdf realm. Having multiple playlists and being able to manage files irrespective of their actual location on the hard disk would do wonders to ease my problem. I was convinced that there must be an implementation of this out there since, afterall, the logic had alredy been completed.

There is already standard pdf metadata which stores author, creation date, title, creator, keywords. In terms of library managers, I came across one that works with PubMed, a Mac-only one, and Adobe’s collection manager which only works with the proffessional version of Acrobat. None of these would work for me. Someone suggested using iTunes to manage pdfs but this management would not provide a way to edit the file metadata. Some hours of online searching later, I gave up on finding  a ready product (within my budget) that would fit my needs. It seemed I would have to write my own application for this.

On the plus side,  this project would give me a good introduction to working with particular file types, exploring the filesystem,  user interface and managing file metadata. In the initial design, I made a note of the basic functionality I wanted, which would be the core program and add-ons which could be provided as plugins. For the basics, I needed three components.

  1. the metadata manager – this reads and writes metadata to the pdf files. It will work on either one file at a time or in a weak batch mode (where all the files specified are getting the same value in a specified metadata field eg. Author). This replicates the edit functionality of mp3 metadata.
  2. the library manager – this is the main UI and it displays the current files in the library. It will provide a way for users to add a folder to the library, add individual files, create reading lists, add items to reading lists etc. It basically replicates the management tasks of a music library manager.
  3. the file viewer – this replicates ‘play’ functionality of music managers. I will not actually write this since there are many robust options available, but the library manager has to be able to provide such functionality that if you double click on a file, it opens in the pdf viewer program. I could provide a light viewer (foxit) with distributions of the pdf manager.

Additional features that could be added once the basics are done are:

  1. integration with bibtex, so that any reading list can be compiled into a bibtex file (or even the whole library).
  2. connecting to online pdf repositories such as arxiv, pubmed, jstor etc

Project updates will be posted as blog entries under category ‘miradi’ and ‘pdf’