Recently, thanks to a good friend, I stumbled over a very expressive French singer called Jacques Brel, or rather yet, over some of his music videos on YouTube. Normally, when I find a video I want to keep, I use the (Java)script from 1024k.de. However, this time I decided wanted a script to which I could feed a set of links and it would download the .flv files all by itself. Here goes the process of building it:
1. Figure out how to get the link to the .flv from the video page source
This is easy to accomplish using URL Snooper.
- First, (install and) start URL Snooper and press "Sniff Network"; the "Protocol Filter" should be set to "Show All".
- Open the video page in your favorite browser (mine is Opera, by the way). Let's use this video as an example: Jacques Brel - Amsterdam.
- In the keyword filter, copy & paste this: get_video?video_id. There should be two links showing up in the list at the bottom, a longer one and a shorter one, starting with "http://youtube.com/get_video?video_id=". What follows after that is what we need from the page source. We already know the video_id; it's pk7YxDzjTxA (the original URL being http://youtube.com/watch?v=pk7YxDzjTxA). From the page source we must somehow get the "t" parameter; this is the key to the whole thing.
A unique one is generated everytime the page is requested, and a cookie linked to it is also stored. It usually starts with "OE" followed by some "random" letters and numbers (ie: t=OEgsXoDSdfK8pTloMKr2p6gfC7hfAOsf).
- Now that we know what we need, we must find the "t" parameter in the page source code. Searching for it's OE[...] value, we find a line that starts with
var fullscreenUrl = '/watch_fullscreen?[...]
and at some point contains the "&t=OE[...]" parameter. Jackpot!
2. Download the video's page from Python & parse it
Downloading a page from Python is pretty straight-forward; the twist in our case comes from the requirement to use cookies.
A (very) brief introduction to cookies in Python can be found here. Basically, you use a "cookie jar" which stores the cookies sent by the remote server; the jar is passed to the urllib2 opener. To keep things simple, we'll use the following download function:
def download(url, userAgent = '', cookieJar = None):
args = []
if cookieJar: args.append(urllib2.HTTPCookieProcessor(cookieJar))
uo = urllib2.build_opener(*args)
uo.addheaders = [('User-agent', userAgent or 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)')]
lnk = uo.open(url)
data = lnk.read()
return data
It's arguments are the URL to be downloaded, the user agent (we spoof IE 6.0 if no user agent string is given; some sites weed out automatic requests based on the user agent, and we don't want that...) and the cookie jar, which we create like this:
cj = cookielib.LWPCookieJar('myCookieFile.cookie')
Obviousely, the previous lines of code require urllib2 and cookielib to be imported.
The next step is to actually download the video's page and parse it using regular expressions to get the "t" parameter. Assuming the URL of the video gets passed as an argument to our script, this is how it would be done:
data = download(sys.argv[1], cookieJar = cj)
m = re.search('video_id=(.+?)&.+&t=(.+?)&hl=', data)
if not m:
print 'Video ID/t not found!'
sys.exit()
id,t = m.groups()
The above lines of code download the video page (actually, the first parameter passed to the script, which *should* be the video URL) and use a regular expression to find the video_id & t parameters. Of course we could get a list of URLs from a text file and pass the name of that text file to the script, but that's left as an exercise to the reader (don't you just love it when that happens?...).
3. Actually downloading the .flv video
Next step: getting the video file and saving it. Not much left to do; simply use the parameters we've got and the link format we know from URL Snooper:
video = download('http://www.youtube.com/get_video?video_id=%s&t=%s' % (id, t), cookieJar = cj)
open('%s.flv', 'wb').write(video)
Et voila!
Summary
For the lazier people, here's a Python script that takes as an argument an URL of a video and downloads it:
import urllib2, cookielib, re, os, sys
def download(url, userAgent = '', cookieJar = None):
args = []
if cookieJar: args.append(urllib2.HTTPCookieProcessor(cookieJar))
uo = urllib2.build_opener(*args)
uo.addheaders = [('User-agent', userAgent or 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)')]
lnk = uo.open(url)
data = lnk.read()
return data
cj = cookielib.LWPCookieJar('my.cookie')
data = download(sys.argv[1], cookieJar = cj)
m = re.search('video_id=(.+?)&.+&t=(.+?)&hl=', data)
if not m:
print 'Video ID/t not found!'
sys.exit()
id,t = m.groups()
video = download('http://www.youtube.com/get_video?video_id=%s&t=%s' % (id, t), cookieJar = cj)
open('%s.flv' % id, 'wb').write(video)
if os.path.isfile('my.cookie'): os.remove('my.cookie')
Usage example (assuming you saved the script as getvid.py):
getvid.py http://youtube.com/watch?v=pk7YxDzjTxA
Please note that this code is not meant to help people in mirroring the contents of YouTube.com... It's an attempt (feeble, perhaps) to present a hands-on approach to solving every-day tasks in Python, which hopefully some will find enlightening or at least a tiny bit helpful.
P.S.
And the reader thinks to himself in disappointment: "But you promised Dailymotion.com too in the title!". It's less challenging than YouTube, so I won't go through the process in detailed steps. We use the same algorithm as above (usable on mostly any video site):
- Using URL Snooper, you can find the link format in the same way as described above; it turns out to have the following pattern:
http://www.dailymotion.com/get/[some-number]/320x240/flv/[some-alphanums].flv?key=[hex-digits]
- Looking through the page source, we find the pattern (url-encoded) in a line looking something like this:
[random-alphanums].addVariable("video", "%2Fget%2F16%2F320x240%2Fflv%2F[random-alphanums].flv
We extract the interesting portion with the following regular expression:r'(%2Fget%2F[^"]+?\.flv%3Fkey%3D[^"%]+)["%]'
apply urllib.unquote (note that here it's urllib, not urllib2!) on the result and append it to the http://www.dailymotion.com host to get the full URL. - Having gotten the URL to the .flv file, we download it. In dailymotion.com's case, cookies aren't required.
No comments:
Post a Comment