How To Scrape Subtitles?

There is very little Irish language text, audio and english translation. One of the best sources is this soap opera

https://www.tg4.ie/en/player/play/?pid=6352950048112&title=Ros%20na%20R%C3%BAn&series=Ros%20na%20R%C3%BAn&pcode=669535&genre=Drama

It is fairly easy to find the url of the subtitles when on that webpage manually

getting the vtt file

But the vtt URL uses UUIDs that seem pretty random

https://redirector.playback.eu-west-1.prod.deploys.brightcove.com/v1/1555966122001/7b5d6364-47e2-4016-ae63-93301a7f4e38/ff7182e5-8f90-4af9-8d35-41a3bae7fa1e/441366d1-6c40-4106-9c0f-ecfdc21476b0.vtt

https://redirector.playback.eu-west-1.prod.deploys.brightcove.com/v1/1555966122001/83680fe1-8055-4494-96ff-bc2786f937cc/652c30ad-ff11-45d4-9e0c-46db42f5a34c/0ab149e4-25b0-4c73-8c9a-8130d647de91.vtt

There are subtitle archive sites but this soap opera is not there. So how would you extract a few hundred sets of VTT files (I want to build NLP datasets , ngrams etc, not make money or anything).

I can imagine answers of

With this site you can hire someone and if you show them the steps they can extract them for you cheap

With this mouse emulator you can do it by XYZ

There is away around the UUIDs being random by XYZ

But I do not know how any of these would actually work.

submitted by /u/cavedave
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *