Welcome, guest | Sign In | My Account | Store | Cart

A quick and dirty script to convert youtube's transcripts (xml format) to .srt subtitle files.

To download youtube's transcript, use this url: http://video.google.com/timedtext?lang=en&v=VIDEO_ID (replace "VIDEO_ID" by the ID which is in the video URL).

You can easily use this converter in a script which could download the transcript by importing it and then call the main function.

Python, 76 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
#!/usr/bin/python
# -*- encoding:utf-8 -*-

"""Translate Google's Transcript into srt file.

Takes google's transcript filename as argument (xml extension required).

NB: to get google's transcript, use tihs URL:
http://video.google.com/timedtext?lang=en&v=VIDEO_ID
"""

# srt example
"""1
00:00:20,672 --> 00:00:24,972
Entre l’Australia et la South America,
dans l’Océan South Pacific…"""

# Google's transcript example (first tags)
"""<?xml version="1.0" encoding="utf-8" ?>
<transcript>
<text start="11.927" dur="2.483">
This is a matter of National Security.</text>"""

import re, sys

# Pattern to identify a subtitle and grab start, duration and text.
pat = re.compile(r'<?text start="(\d+\.\d+)" dur="(\d+\.\d+)">(.*)</text>?')

def parseLine(text):
	"""Parse a subtitle."""
	m = re.match(pat, text)
	if m:
		return (m.group(1), m.group(2), m.group(3))
	else:
		return None

def formatSrtTime(secTime):
	"""Convert a time in seconds (google's transcript) to srt time format."""
	sec, micro = str(secTime).split('.')
	m, s = divmod(int(sec), 60)
	h, m = divmod(m, 60)
	return "{:02}:{:02}:{:02},{}".format(h,m,s,micro)

def convertHtml(text):
	"""A few HTML encodings replacements.
	&amp;#39; to '
	&amp;quot; to "
	"""
	return text.replace('&amp;#39;', "'").replace('&amp;quot;', '"')

def printSrtLine(i, elms):
	"""Print a subtitle in srt format."""
	return "{}\n{} --> {}\n{}\n\n".format(i, formatSrtTime(elms[0]), formatSrtTime(float(elms[0])+float(elms[1])), convertHtml(elms[2]))

fileName = sys.argv[1]

def main(fileName):
	"""Parse google's transcript and write the converted data in srt format."""
	with open(sys.argv[1], 'r') as infile:
		buf = []
		for line in infile:
			buf.append(line.rstrip('\n'))
	# Split the buffer to get one string per tag.
	buf = "".join(buf).split('><')
	i = 0
	srtfileName = fileName.replace('.xml', '.srt')
	with open(srtfileName, 'w') as outfile:
		for text in buf:
			parsed = parseLine(text)
			if parsed:
				i += 1
				outfile.write(printSrtLine(i, parsed))
	print('DONE ({})'.format(srtfileName))

if __name__ == "__main__":
	main(fileName)

This converter only handles a few html encoded characters. A python library maybe exists to apply the conversion for all possible characters.

2 comments

James Peters 9 years, 7 months ago  # | flag

Thank you. Your code worked as designed and saved me a significant amount of time.

chupo_cro 8 years ago  # | flag

The program couldn't catch 'start' and/or 'duration' values if the value was integer so such subtitles were missing from the outfile. For example try YouTube video with ID v2jpnyKPH64, the second subtitle has timestamp 17 (not 17.00 or 17. or 17.0 but 17) and it will be missing from output. To solve the problem, the regexps inside brackets should be changed from:

(\d+\.\d+)

to:

(\d+\.?\d*)

so the correct regular expression in the line #27 should be:

pat = re.compile(r'<?text start="(\d+\.?\d*)" dur="(\d+\.?\d*)">(.*)</text>?')

As the consequence the split method in the line #39 will not be able to find the '.' in the case of integer timestamps (or durations) and program will throw an exception. The solution is to insert:

secTime = float(secTime)

just before line #39 to add '.0' to the integer numbers caught by the regexps.