Yet another method to grab download-disabled slideshows from SlideShare

30Jun/130

Yet another method to grab download-disabled slideshows from SlideShare

Yes, I know. Horrible, horrible subject. The thought of stealing jpgs which are publicly viewable... Oh, well.

Standard disclaimer applies: Teaching someone how to steal a book does not make the teacher guilty of theft. If you get in trouble for following these directions, shame on you, not on me.

So, as a proof-of-concept, I was curious as to what SlideShare does to inhibit downloading of presentations. Apparently, all they do is not provide the (original?) PowerPoint document for download ¹ ^, ². However, if one examines the source of the page, it is fairly easy to determine the filename of each slide image, and then automate a fetch to grab each one.

Requirements

Web browser or something to retrieve the source of one of the slideshow's pages (well, since you're reading this, I suppose we have this one covered)
cURL (look for a version compatible with your OS; start here)

That's it.

Steps

Open the page containing any slide in the set you want to download.
View the source of the page (in Mozilla-based browsers, this is usually accomplished with Ctrl-U).
Search for "og:image" in the source, and copy the url which follows.
Note the slide count in the lower left of the presentation.
Open a terminal (command prompt or window session).
Navigate to where you would like to save the downloaded images.

Run the following cURL command:

curl -O http://image.slidesharecdn.com/&lt;name-of-presentation-including-numeric-string&gt;-phpapp02/95/slide-[1-n]-&lt;resolution&gt;.jpg

An illustration

Searching for og:image in the source, we find:

<!-- fb open graph meta tags -->

  <meta name="fb_app_id" property="fb:app_id" class="fb_og_meta" content="7890123456" />
  <meta name="og_type" property="og:type" class="fb_og_meta" content="slideshare:presentation" />
  <meta name="og_url" property="og:url" class="fb_og_meta" content="http://www.slideshare.net/somedirectory/some-presentation" />
  <meta name="og_image" property="og:image" class="fb_og_meta" content="http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-1-1024.jpg" />

The url specified by og_image is:

http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-1-1024.jpg

Assume that the slide count is 55 (i.e., on the first slide, the lower left indicates "1/55"). Once in the directory where I want to save the images, I simply tell cURL:

curl -O http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-[1-55]-1024.jpg

and cURL will retrieve each jpg in the deck.

How it works

The -O option tells cURL to save the data as the original filename. Without this, cURL will dutifully retrieve a data stream, which is of little use.

The [1-55] tells cURL to successively download the filename, replacing that space (between the dashes in this example) with the subsequent number, e.g.:

curl -O http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-1-1024.jpg
curl -O http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-2-1024.jpg
curl -O http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-3-1024.jpg
[...]
curl -O http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-55-1024.jpg

Frustration with wget

My natural inclination was to use wget for this. However, wget does not support globbing for http (no wildcards), and while I could have fed it some regex to specify one url after the other, this is a horribly clumsy way of accomplishing the task.

Apply the concepts

The point of all of this is not to go and rip off every download-disabled presentation on SlideShare, but rather to present a working example of how to use cURL to retrieve sequential filenames via http (or ftp). If you find another good use for this one-liner, please post a comment to let me know.

Point of fact #1: I don't use PowerPoint, and I absolutely go ballistic when someone emails me one of those disgustingly-huge files which I must then convert to something readable (i.e., pray that it will open in Impress and then allow me to save it to an Impress file - or better, a pdf). ↩
Point of fact #2: I do not (yet) have an account on SlideShare, which is apparently required to download any presentations from their site. ↩

Enjoy this article?

Consider subscribing to our RSS feed!

Tagged as: curl, fetch, files, image, internet, jpg, remote, script, shell, site, slide Leave a comment

S	M	T	W	T	F	S
« Jun
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Lewis' Blog Tales from the trenches of information technology