# Parse Video and Presentation from NVIDIA's GTC Website

In the process of making the recordings and slides available publicly online, NVIDIA changed the URLs which I noted for myself before. So instead of going through every session on my list again manually, there's a Notebook doing it automatically.

`requests` is used to get the website, `BeautifulSoup` is used to parse it. The rest is massaging. (It's quite slow, because it does not run in parallel. I heard that there are more apt scrapers out thereâ€¦)

In [1]:
import requests
from bs4 import BeautifulSoup

Everything on the *GTC On Demand* website goes through searches (:()

In [69]:
BASEURL = "http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?submit=&searchByKeyword="

Go down the tree to get the links; all the if cases are there in case one of the expected links is not given

In [158]:
def getLinks(soup):
    link = None
    if soup.find(id='pageWide') is not None:
        if soup.find(id='pageWide').find(class_="quick-link-area") is not None:
            if soup.find(id='pageWide').find(class_="quick-link-area").find("a") is not None:
                link = soup.find(id='pageWide').find(class_="quick-link-area").find("a")['href']
    mp4 = None
    pdf = None
    idAllData = soup.find(id="all_data")
    if soup.find(id="all_data") is not None:
        if soup.find(id="all_data").find_all(class_="wmv-text") is not None:
            idAllData = soup.find(id="all_data").find_all(class_="wmv-text")
            for entry in idAllData:
                if entry.find("a") is not None:
                    if entry.find("a")['href'] is not None:
                        if ".mp4" in entry.find("a")["href"].lower():
                            mp4 = entry.find("a")['href']
                        if ".pdf" in entry.find("a")["href"].lower():
                            pdf = entry.find("a")['href']
    return dict({
        "link": link,
        "mp4": mp4,
        "pdf": pdf
    })

Combine `requests` with the custom `BeautifulSoup` parsing

In [132]:
def idToLinks(id):
    r = requests.get(BASEURL + str(id))
    return getLinks(BeautifulSoup(r.content, "lxml"))

Following, the list of sessions I have on my list

In [74]:
ids = ["S7824", "S7622", "S7495", "S7122", "S7445", "S7444", "S7362", "S7628", "S7285", "S7764", "S7128", "S7700", "S7628", "S7150", "S7405", "S7438", "S7133", "S7142", "S7356", "S7546", "S7155", "S7344", "S7192", "S7496", "S7626", "S7636", "S7341", "S7640", "S7672", "S7635", "S7478", "S7193", "S7735", "S7382", "S7535", "S7457", "S7515", "S7800", "S7860", "S7666", "S7804", "S7564", "S7332", "S7785", "S7609", "S7590", "S7296", "S7329", "S7482", "S7642"]

This goes through the list of IDs and serially extracts the links for them. It's slow. But it's cachedâ€¦

In [160]:
allLinks = [idToLinks(id) for id in ids]

**DEBUG** Have a look at the entries which have a `None` at the video or presentation positions

In [161]:
for id, links in zip(ids, allLinks):
    if (links["link"] is None) or ((links["mp4"] is None) or (links["pdf"] is None)):
        print("ID {} has Nones: {}  {}  {}".format(id, links["link"], links["mp4"], links["pdf"]))

ID S7362 has Nones: http://on-demand-gtc.gputechconf.com/gtc-quicklink/77mBeE  http://on-demand.gputechconf.com/gtc/2017/video/s7362-yifan-sun-frank-zhao-benchmarking-the-new-unified-memory-of-cuda-8.mp4  None
ID S7700 has Nones: http://on-demand-gtc.gputechconf.com/gtc-quicklink/iflrAb  http://on-demand.gputechconf.com/gtc/2017/video/s7700-mason-introduction-gpu-memory-model-presented-by-acceleware.mp4  None
ID S7535 has Nones: http://on-demand-gtc.gputechconf.com/gtc-quicklink/5g6tCZ  http://on-demand.gputechconf.com/gtc/2017/video/s7535-ronald-caplan-potential-field-solutions-of-the-solar-corona-converting-a-pcg-solver-from-mpi-to-mpi+openacc.mp4  None
ID S7800 has Nones: http://on-demand-gtc.gputechconf.com/gtc-quicklink/2w9shU  None  http://on-demand.gputechconf.com/gtc/2017/presentation/s7800-justin-lawyer-machine-learning-service.pdf
ID S7804 has Nones: http://on-demand-gtc.gputechconf.com/gtc-quicklink/1EWCvX  http://on-demand.gputechconf.com/gtc/2017/video/s7804-dobson-tensorf

Put the information into a string (for each ID)

In [162]:
def genString(link):
    compiledString = "("
    compiledString += "[link]({})".format(link["link"])
    if link["mp4"] is not None:
        compiledString += ", [recording]({})".format(link["mp4"])
    if link["pdf"] is not None:
        compiledString += ", [slides]({})".format(link["pdf"])
    compiledString += ")"
    return compiledString

Finally, loop through all the ids

In [168]:
for id, links in zip(ids, allLinks):
    print("ID {} ----".format(id), genString(links))

ID S7824 ---- ([link](http://on-demand-gtc.gputechconf.com/gtc-quicklink/eioXev4), [recording](http://on-demand.gputechconf.com/gtc/2017/video/s7824-sanjiv-satoor-developer-tools-update-in-cuda-9.mp4), [slides](http://on-demand.gputechconf.com/gtc/2017/presentation/s7824-rafeal-campana-developer-tools-update-in-cuda9.pdf))
ID S7622 ---- ([link](http://on-demand-gtc.gputechconf.com/gtc-quicklink/bekSPc), [recording](http://on-demand.gputechconf.com/gtc/2017/video/s7622-perelygin-robust-scalable-cuda-parallel-programming-model.mp4), [slides](http://on-demand.gputechconf.com/gtc/2017/presentation/s7622-Kyrylo-perelygin-robust-and-scalable-cuda.pdf))
ID S7495 ---- ([link](http://on-demand-gtc.gputechconf.com/gtc-quicklink/imKh6), [recording](http://on-demand.gputechconf.com/gtc/2017/video/s7495-jain-optimizing-application-performance-cuda-profiling.mp4), [slides](http://on-demand.gputechconf.com/gtc/2017/presentation/s7495-jain-optimizing-application-performance-cuda-profiling-tools.pdf))


We're done!
ðŸ˜„