Parse Video and Presentation from NVIDIA's GTC Website

In the process of making the recordings and slides available publicly online, NVIDIA changed the URLs which I noted for myself before. So instead of going through every session on my list again manually, there's a Notebook doing it automatically.

requests is used to get the website, BeautifulSoup is used to parse it. The rest is massaging. (It's quite slow, because it does not run in parallel. I heard that there are more apt scrapers out there…)

In [1]:
import requests
from bs4 import BeautifulSoup

Everything on the GTC On Demand website goes through searches (:()

In [69]:

Go down the tree to get the links; all the if cases are there in case one of the expected links is not given

In [158]:
def getLinks(soup):
    link = None
    if soup.find(id='pageWide') is not None:
        if soup.find(id='pageWide').find(class_="quick-link-area") is not None:
            if soup.find(id='pageWide').find(class_="quick-link-area").find("a") is not None:
                link = soup.find(id='pageWide').find(class_="quick-link-area").find("a")['href']
    mp4 = None
    pdf = None
    idAllData = soup.find(id="all_data")
    if soup.find(id="all_data") is not None:
        if soup.find(id="all_data").find_all(class_="wmv-text") is not None:
            idAllData = soup.find(id="all_data").find_all(class_="wmv-text")
            for entry in idAllData:
                if entry.find("a") is not None:
                    if entry.find("a")['href'] is not None:
                        if ".mp4" in entry.find("a")["href"].lower():
                            mp4 = entry.find("a")['href']
                        if ".pdf" in entry.find("a")["href"].lower():
                            pdf = entry.find("a")['href']
    return dict({
        "link": link,
        "mp4": mp4,
        "pdf": pdf

Combine requests with the custom BeautifulSoup parsing

In [132]:
def idToLinks(id):
    r = requests.get(BASEURL + str(id))
    return getLinks(BeautifulSoup(r.content, "lxml"))

Following, the list of sessions I have on my list

In [74]:
ids = ["S7824", "S7622", "S7495", "S7122", "S7445", "S7444", "S7362", "S7628", "S7285", "S7764", "S7128", "S7700", "S7628", "S7150", "S7405", "S7438", "S7133", "S7142", "S7356", "S7546", "S7155", "S7344", "S7192", "S7496", "S7626", "S7636", "S7341", "S7640", "S7672", "S7635", "S7478", "S7193", "S7735", "S7382", "S7535", "S7457", "S7515", "S7800", "S7860", "S7666", "S7804", "S7564", "S7332", "S7785", "S7609", "S7590", "S7296", "S7329", "S7482", "S7642"]

This goes through the list of IDs and serially extracts the links for them. It's slow. But it's cached…

In [160]:
allLinks = [idToLinks(id) for id in ids]

DEBUG Have a look at the entries which have a None at the video or presentation positions

In [161]:
for id, links in zip(ids, allLinks):
    if (links["link"] is None) or ((links["mp4"] is None) or (links["pdf"] is None)):
        print("ID {} has Nones: {}  {}  {}".format(id, links["link"], links["mp4"], links["pdf"]))
ID S7362 has Nones:  None
ID S7700 has Nones:  None
ID S7535 has Nones:  None
ID S7800 has Nones:  None
ID S7804 has Nones:  None
ID SE7142 has Nones:  None  None
ID S7590 has Nones:  None  None
ID S7482 has Nones:  None  None

Put the information into a string (for each ID)

In [162]:
def genString(link):
    compiledString = "("
    compiledString += "[link]({})".format(link["link"])
    if link["mp4"] is not None:
        compiledString += ", [recording]({})".format(link["mp4"])
    if link["pdf"] is not None:
        compiledString += ", [slides]({})".format(link["pdf"])
    compiledString += ")"
    return compiledString

Finally, loop through all the ids

In [168]:
for id, links in zip(ids, allLinks):
    print("ID {} ----".format(id), genString(links))
ID S7824 ---- ([link](, [recording](, [slides](
ID S7622 ---- ([link](, [recording](, [slides](
ID S7495 ---- ([link](, [recording](, [slides](
ID S7122 ---- ([link](, [recording](, [slides](
ID S7445 ---- ([link](, [recording](, [slides](
ID S7444 ---- ([link](, [recording](, [slides](
ID S7362 ---- ([link](, [recording](
ID S7628 ---- ([link](, [recording](, [slides](
ID S7285 ---- ([link](, [recording](, [slides](
ID S7764 ---- ([link](, [recording](, [slides](
ID S7128 ---- ([link](, [recording](, [slides](
ID S7700 ---- ([link](, [recording](
ID S7628 ---- ([link](, [recording](, [slides](
ID S7150 ---- ([link](, [recording](, [slides](
ID S7405 ---- ([link](, [recording](, [slides](
ID S7438 ---- ([link](, [recording](, [slides](
ID S7133 ---- ([link](, [recording](, [slides](
ID S7142 ---- ([link](, [recording](, [slides](
ID S7356 ---- ([link](, [recording](, [slides](
ID S7546 ---- ([link](, [recording](, [slides](
ID S7155 ---- ([link](, [recording](, [slides](
ID S7344 ---- ([link](, [recording](, [slides](
ID S7192 ---- ([link](, [recording](, [slides](
ID S7496 ---- ([link](, [recording](, [slides](
ID S7626 ---- ([link](, [recording](, [slides]( simple-guideline for-code-optimizations-on-modern-architectures-with-openacc-and-cuda.pdf))
ID S7636 ---- ([link](, [recording](, [slides](
ID S7341 ---- ([link](, [recording](, [slides](
ID S7640 ---- ([link](, [recording](, [slides](
ID S7672 ---- ([link](, [recording](, [slides](
ID S7635 ---- ([link](, [recording](, [slides](
ID S7478 ---- ([link](, [recording](, [slides](
ID S7193 ---- ([link](, [recording](, [slides](
ID S7735 ---- ([link](, [recording](, [slides](
ID S7382 ---- ([link](, [recording](, [slides](
ID S7535 ---- ([link](, [recording](
ID S7457 ---- ([link](, [recording](, [slides]( learning demystified_v24.pdf))
ID S7515 ---- ([link](, [recording]( eliminating-the-regular-expression-with-neural-networks.mp4), [slides](
ID S7800 ---- ([link](, [slides](
ID S7860 ---- ([link](, [recording](, [slides](
ID S7666 ---- ([link](, [recording](, [slides](
ID S7804 ---- ([link](, [recording](
ID SE7142 ---- ([link](
ID S7564 ---- ([link](, [recording](, [slides](
ID S7332 ---- ([link](, [recording](, [slides](
ID S7785 ---- ([link](, [recording](, [slides](
ID S7609 ---- ([link](, [recording](, [slides](
ID S7590 ---- ([link](
ID S7296 ---- ([link](, [recording](, [slides](
ID S7329 ---- ([link](, [recording](, [slides](
ID S7482 ---- ([link](
ID S7642 ---- ([link](, [recording](, [slides](

We're done! 😄