amuck-landowner

Creating a Page Scraper (Web Crawler?) and File Downloader

HalfEatenPie

The Irrational One
Retired Staff
Hello there!

So, part of what I do as a "researcher" is read new research papers. But an event arrived where I need to download and format (in a presentable fashion) large quantities of research papers.

I was wondering if anyone knew how I should start this (especially with the website/information archiving projects).

Basically I'd be going on websites like this:

http://www.sciencedirect.com/science/journal/00221694 (format could be different later)

And part of what I need to do right now download the PDF file of the report, and (off of the information presented on that page) create a nice easy-to-read way of presenting the Title, Authors, and the Abstract.

Is there already a source I can work off of to get this started? I'm assuming Python is probably the easiest language to start with and work on this. Anyone have a general idea on how I should go about this?

I guess for the "bigger discussion", anyone have an experience with programming web crawlers?

Edit: I guess I should have been a bit more specific. Currently I'm looking at using a variation of this: http://scrapy.org/.
 
Last edited by a moderator:

fisle

Active Member
I don't know if it helps, but... here's my basic crawler that I use to search for T61 laptops on a forum :D

I am just fetching the website, then let BeautifulSoup handle finding what I want. :)
 

DomainBop

Dormant VPSB Pathogen
Basically I'd be going on websites like this:

http://www.sciencedirect.com/science/journal/00221694 (format could be different later)
I'd advise checking the TOS of every site you want to use before deploying your scraper.  Elsevier (sciencedirect) prohibits the use of spiders/crawlers/downloaders without their written permission, and the TOS of most other corporate owned scientific journals also contain similar prohibitions against scraping and downloading.  The relevant portion of their TOS:

Unless expressly authorized by us, you may not use any robots, spiders, crawlers or other automated downloading programs, algorithms or devices, or any similar or equivalent manual process, to: (i) continuously and automatically search, scrape, extract, deep link or index any Content;
 

HalfEatenPie

The Irrational One
Retired Staff
Thanks guys!

I'd advise checking the TOS of every site you want to use before deploying your scraper.  Elsevier (sciencedirect) prohibits the use of spiders/crawlers/downloaders without their written permission, and the TOS of most other corporate owned scientific journals also contain similar prohibitions against scraping and downloading.  The relevant portion of their TOS:

Unless expressly authorized by us, you may not use any robots, spiders, crawlers or other automated downloading programs, algorithms or devices, or any similar or equivalent manual process, to: (i) continuously and automatically search, scrape, extract, deep link or index any Content;
Gah, yeah there is that. I'll have to review that I guess. Its just annoying when you have to go through the same repetitive task to obtain what you need. It's much easier to just have your own library and take what you need :/

Thanks though. I'll make sure what I do falls within the given ToS.
 
Top
amuck-landowner