This blog is subject the DISCLAIMER below.

Tuesday, May 18, 2010

Python Script to download TED Talks translations :))

Well, i admit it, i am a TED lover. I love the passion, and the "new" factor TED brings to me :)

I was watching this presentation for Simon Sink and someone I know downloaded the video but wasn't able to view the translation locally, so I thought I might be able to download the translations for him and convert them to SRT format to be able to display them on any Video Player. and Here is how I did it :)

If you visited the site before, you will find it has a flash control where you can view the video and choose the translation if needed.

First, To be able to investigate how the flash control gets the translation, you have to open a network sniffer which will enable you to view all the packets going to and from your network card. A good packet sniffer is Wireshark, and if you don't know it check those links to know more about how to use it :)
Packet Sniffing using Wireshark Tutorial (Video)
Fifteen Minute Wireshark Tutorial - Wheeler Software


Second,
you need to add a filter for "HTTP" requests only, and navigate through any video (like the one above) to view how the flash control communicates with the server :)

When you choose a translation, you will find a request to a URL like this: www.ted.com/talks/subtitles/id/848/lang/eng , where 848 is the Talk ID and eng is the language choosed "English".

So there two questions now, given a URL to the TED Talk

What format are translation subtitles returned?
How do i get the Talk ID?


To answer the first question, just click the link above, and you will find the translation is returned in JSON (Javascript Object Notation) format. So this is good news. We just need to find a good library to handle JSON, and be able to convert it to SRT format to be able to use it on almost any Video Player like VLCMedia Player.

And the answer to the second question is to do a simple "View Source" and search for this number, you will find it in several places through the page's source code, so simple parsing the page HTML should do the job.
Here is the full Script to do the job given the URL on ted.com and the language code which is eng for english and ara for arabic. Will try to provide the rest of the language codes later.

This is a simple command to test the script:

python TEDSubtitles.py "http://www.ted.com/talks/simon_sinek_how_great_leaders_inspire_action.html" "eng"

And here is the full source code, will try to upload it somewhere soon.

Hope it's worth spreading :))

Updates:
You can get the script here .
You need at least Python 2.6 for the json module to be available.

Update on 22/09/2010:
I was informed in the comments by Mohammad that he created a program http://sourceforge.net/projects/tedgrabber/ to download TED Talks and subtitles. He also created another one http://sourceforge.net/projects/timecovergrabbe/ , which allows you to grab Time Magazine covers.This is another example that illustrates the concept and ofcourse it can be applied everywhere. Good luck to Mohammed.
Although it's hosted on sourceforge, the source code is not available. It would be better if it was shared.

Update on 24/04/2011:
A Google Appengine application was created for the same functionality on http://tedsubtitles.appspot.com

You can view the source below, but for better colorized viewing check this link here
import os
import sys
import json
import urllib2

# Format Time from TED Subtitles format to SRT time Format
def formatTime ( time ) :
milliseconds = 0
seconds = ((time / 1000) % 60)
minutes = ((time / 1000) / 60)
hours = (((time / 1000) / 60) / 60)
formatedTime = str ( hours ) + ':' + str (minutes) + ':' + str ( seconds ) + ',' + str ( milliseconds )
return formatedTime

# Convert TED Subtitles to SRT Subtitles
def convertTEDSubtitlesToSRTSubtitles ( jsonString , introDuration ) :
jsonObject = json.loads( jsonString )

srtContent = ''
captionIndex = 1

for caption in jsonObject['captions'] :
startTime = str ( formatTime ( introDuration + caption['startTime'] ) )
endTime = str ( formatTime ( introDuration + caption['startTime'] + caption['duration'] ) )

srtContent += ( str ( captionIndex ) + os.linesep )
srtContent += ( startTime + ' --> ' + endTime + os.linesep )
srtContent += ( caption['content'] + os.linesep )
srtContent += os.linesep

captionIndex = captionIndex + 1
return srtContent

def getTEDSubtitlesByTalkID ( talkId , language ) :
tedSubtitleUrl = 'http://www.ted.com/talks/subtitles/id/' + str(talkId) + '/lang/' + language
req = urllib2.Request(tedSubtitleUrl)
response = urllib2.urlopen(req)
result = response.read()
return ( result )

tedTalkUrl = sys.argv[1]
language = sys.argv[2]

req = urllib2.Request(tedTalkUrl)
response = urllib2.urlopen(req)
result = response.read()

## Get Talk ID value
splits = result.split ( ';ti=' )
talkId = splits[1].split ( '&' )[0]
print talkId

## Get Talk Intro Duration value
splits = result.split ( ';introDuration=' )
talkIntroDuration = splits[1].split ( '&' )[0]
talkIntroDuration = int ( talkIntroDuration )
print talkIntroDuration

jsonString = getTEDSubtitlesByTalkID ( talkId , language )

srtContent = convertTEDSubtitlesToSRTSubtitles ( jsonString , talkIntroDuration )

# Generate SRT file name
splits = tedTalkUrl.split ( '/' )
srtFilename = splits[len ( splits )-1].split ('.')[0]

srtFile = open ( './' + srtFilename + '.srt' , 'w' )
srtFile.write ( srtContent.encode ( "utf-8" ) )
srtFile.close ()


20 comments:

nassar said...

Hi thinks for the script , can I run it on windows machine ?

Shereef Sakr said...

Yes, you can run it anywhere where python is installed.

nassar said...

Thank you very much. I will try it now.

nassar said...

Salam alikom,
I tried to run the module but an error has occurred , this is the message which appeared to me :
There's an error in your program:
expected an intended block.

and milliseconds variable was highlighted. can you help me ?

and I faced other problem with the sniffer How can I filter results to extract TED.com traffic ?

Shereef Sakr said...

@Nesser
Python blocks of code depend on indentation and it seems indentation is not displayed properly here. So please download the script from the link in the post.

I am not sure how to filter Wireshark HTTP traffic for a specific domain, but i usually close any other thing before sniffing. Will update you if i find a way.

Shereef Sakr said...

You can filter using URI using this method

http.request.uri contains "ted.com"

so the full expression can be like this

http && http.request.uri contains "ted.com"

Hope this helps :)

nassar said...

Thank you very much for the help, and I am sorry for all these questions.
Now I retrieved the URI , but I found a problem when compiling python program when I tried to run you python program Using the example that you put in the post I found this message :
SyntaxError: invalid syntax

and the name of the python file is highlighted.

sorry I never used python language.

Shereef Sakr said...

@Nasser I am not sure what's the problem now. I tested it with Windows and Linux and it's working on both.

Can you post the command you used and the resulting output here?

nassar said...

sorry for the interruption, My website took from me all the time.

for the problem , this is a capture from the window.
I opened the shell , I add the command , and here is the result :
http://yfrog.com/5mcapturewdp

Anonymous said...

very good.
and my program : tedgrabber beta2
http://sourceforge.net/projects/tedgrabber/

thanks ;)

mohammadk said...

very good.
and my program : TEDGrabber
(windows os : delphi 2010)

http://sourceforge.net/projects/tedgrabber/

thanks. ;)

Shereef Sakr said...

Great :) and you made it open source too :) Thanks for informing me here :)
I have referred to you in the post.

mohammadk said...

wow ;) thanks.

Petar Marić said...

I'm the author of metaTED, an open source tool that makes it easy to download all of the TED talks. It does so by creating metalinks of TED talks varying in both the quality levels and possible talk groupings by directory. All you need to do is get a download client that supports the Metalink standard and to choose one of the daily updated metalinks.

I've recently received an interesting feature request, mentioning TEDGrabber. The reason I'm contacting you is to warn you that TEDGrabber doesn't seem to be an open source application - namely I can't seem to find the source code for it anywhere.

Anonymous said...

I download with ted.com by means of service takeyoutube.com

Allen said...

The script help me a lot.
Thank you very much.

Anonymous said...

Hi.. I need your help..How you convert the start time from json to srt? How you get the intro time info on TED talk? thx

Shereef Sakr said...

@Anonymous The time is in milliseconds, that's what the function formattime is there. It converts the time from milliseconds to normal HH:MM;SS format

diddy47 said...
This comment has been removed by the author.
diddy47 said...

hello,

my comment is related to the content of your post. i would like to ask for help on how to convert xml or .josn subtitles into srt subtitles.

<text>What are you doing?</text>
<start>8509</start>
<end>10444</end>
<text>Mom, from today on,</text>
<start>10444</start>
<end>12226</end>

to somthing with srt format or a similar xml format that google2srt program can recognise like

<text start="1.83" dur="3.1">[GwangJin-Gu, Seoul]</text>
<text start="6.75" dur="1.61">Somehow, it felt unfamiliar</text>