Public service announcement: How to use Wget to grab the 2011 LPSC abstracts
Posted by Emily Lakdawalla
03-03-2011 17:08 CST
The Lunar and Planetary Science Conference is happening next week, and today I went to the meeting website as I usually do to peruse the listed talks for the sessions and abstracts in advance. The sessions and abstracts are all in PDF format, so it's tiresome to access them online; I much prefer to download them all to my computer and browse them locally. So it was a nasty surprise to discover that this year, unlike previous years, you can't just direct your FTP client of choice to ftp.lpi.usra.edu/pub/outgoing and download everything. If there's an officially sanctioned way to download all the abstracts for the 2011 meeting, I haven't been able to figure it out.
I got them all anyway, though, thanks to a handy open-source file retrieval tool called Wget. I use Wget all the time to grab space image data, but it's possible to use it for other nefarious purposes like downloading 2,000 conference abstracts. I crowed about this on Twitter this morning and received several requests to explain the method, so here's how it works.
First, install Wget. Go to the Wget website and find the installer that's appropriate for your operating system.
There are various user interfaces out there that you can install on top of Wget to make it "easier" to use, but I find it pretty easy to run it from the command line, as it's well documented online. To grab the LPSC abstracts, I first used Excel to create a text list of all the hypertext links to the session and abstract PDFs (a step that you can skip by just downloading my file), which I placed in the same folder where I installed wget. Then I ran the following command:
> wget -i lpsc.txt
Then watched it grab the 2,000 or so files. Easy as pie!
I use that Excel trick a lot when I want to grab a large number of files with sequential filenames. For instance, every time Cassini spits a large number of images to their raw images website, I like to be able to browse through those images locally. To grab them, I begin with the path to the most recent image. Right now, that path is:
This path is a lot of text that is mostly the same for every Cassini image except for the last six digits before the ".jpg" filename extension. So I use Excel to generate a list of the 200 most recent narrow-angle camera images this way:
- Put the filename of the most recent image in cell A1 of the table.
- In cell A2, write the following formula:
What that gobbledygook does is grab the left 66 characters of the path, then takes the next 6 characters and turns it into a number (that's the "VALUE" bit) and add a number to it that consists of the row number of the current cell, minus 1, then appends this new number to the end of our path, and finally tacks on the ".jpg" file name extension. Copy and paste this into as many cells as you want to grab as many images as you want, save the results as a text file within the wget folder, and run the same command: "wget -i [yourfilename].txt".
Excel isn't required; there are myriad other ways to create these sequential filename lists using little more programming skills than one needs to write a program that outputs "Hello, world!" -- Excel is just the program I'm comfortable with.
Have fun grabbing files!
Other related posts:
Or read more blog entries about: