Creating a Page Scraper (Web Crawler?) and File Downloader

HalfEatenPie · Jan 2, 2014

Hello there!

So, part of what I do as a "researcher" is read new research papers. But an event arrived where I need to download and format (in a presentable fashion) large quantities of research papers.

I was wondering if anyone knew how I should start this (especially with the website/information archiving projects).

Basically I'd be going on websites like this:

http://www.sciencedirect.com/science/journal/00221694 (format could be different later)

And part of what I need to do right now download the PDF file of the report, and (off of the information presented on that page) create a nice easy-to-read way of presenting the Title, Authors, and the Abstract.

Is there already a source I can work off of to get this started? I'm assuming Python is probably the easiest language to start with and work on this. Anyone have a general idea on how I should go about this?

I guess for the "bigger discussion", anyone have an experience with programming web crawlers?

Edit: I guess I should have been a bit more specific. Currently I'm looking at using a variation of this: http://scrapy.org/.

fisle · Jan 2, 2014

I don't know if it helps, but... here's my basic crawler that I use to search for T61 laptops on a forum

I am just fetching the website, then let BeautifulSoup handle finding what I want.

BuyCPanel-Kevin · Jan 2, 2014

You could use HTML unit and program something like you want in java real quick

5n1p · Jan 2, 2014

You can start here, I haven't tried it my self, just randomly found this article today and then saw this post:

http://pypix.com/python/build-website-crawler-based-upon-scrapy/?utm_source=Python+Weekly+Newsletter&utm_campaign=e686019ae7-Python_Weekly_Issue_118_December_19_2013&utm_medium=email&utm_term=0_9e26887fc5-e686019ae7-312688325

eva2000 · Jan 2, 2014

maybe evernote + clearly https://evernote.com/clearly/ or evernote + webclipper https://evernote.com/webclipper/ ?

DomainBop · Jan 2, 2014

HalfEatenPie said:
Basically I'd be going on websites like this:

http://www.sciencedirect.com/science/journal/00221694 (format could be different later)

I'd advise checking the TOS of every site you want to use before deploying your scraper. Elsevier (sciencedirect) prohibits the use of spiders/crawlers/downloaders without their written permission, and the TOS of most other corporate owned scientific journals also contain similar prohibitions against scraping and downloading. The relevant portion of their TOS:

Unless expressly authorized by us, you may not use any robots, spiders, crawlers or other automated downloading programs, algorithms or devices, or any similar or equivalent manual process, to: (i) continuously and automatically search, scrape, extract, deep link or index any Content;

HalfEatenPie · Jan 2, 2014

Thanks guys!

DomainBop said:
I'd advise checking the TOS of every site you want to use before deploying your scraper. Elsevier (sciencedirect) prohibits the use of spiders/crawlers/downloaders without their written permission, and the TOS of most other corporate owned scientific journals also contain similar prohibitions against scraping and downloading. The relevant portion of their TOS:

Unless expressly authorized by us, you may not use any robots, spiders, crawlers or other automated downloading programs, algorithms or devices, or any similar or equivalent manual process, to: (i) continuously and automatically search, scrape, extract, deep link or index any Content;

Gah, yeah there is that. I'll have to review that I guess. Its just annoying when you have to go through the same repetitive task to obtain what you need. It's much easier to just have your own library and take what you need :/

Thanks though. I'll make sure what I do falls within the given ToS.

Creating a Page Scraper (Web Crawler?) and File Downloader

HalfEatenPie

The Irrational One

fisle

Active Member

BuyCPanel-Kevin

New Member

5n1p

New Member

eva2000

Active Member

DomainBop

Dormant VPSB Pathogen

HalfEatenPie

The Irrational One