Extracting Daily YouTube Trending Video Statistics

using YouTube Data API and a scheduled notebook in Deepnote

Syahrul Hamdani
Geek Culture

--

“YouTube is much more than TV”

We may have heard someone said this somewhere. Well, at least, in Indonesia, this statement is overheard. I personally think so… before I check the trending list.

Most of the videos that go on the trending list are basically from TV, I presume. Then, why don’t I track the list by myself?

About The Article

Inspired by the similar work of others, I’d like to share how I built a python script to automatically extract the trending video using YouTube Data API and schedule the run from a notebook hosted in Deepnote.

What You Need

First, since we want to retrieve YouTube video statistics, we need to follow some “getting started” steps to acquire an API key needed to interact with YouTube Data API. Basically, we need to create a project in Google Developer Console, obtain a credential with access to YouTube Data API, and get an API key. Check the docs here for further detail.

Next, make sure to have Python and 2 main libraries installed, requests and Pandas. I didn’t do fancy analysis after the data is collected, so only those 2 libraries are installed. Also, kaggle since I stored all the data collected to Kaggle Dataset.

Understanding YouTube’s Resources

YouTube has a lot of resources and resources types. Quoting from the documentation, “a resource is an individual data entity with a unique identifier”. In this case, we only care about the video resource, which represents a single YouTube video.

In each resource, there are some methods in which we can use to do what we want. Video resource itself has methods such as getRating, reportAbuse, list, etc. In our case, since we want to extract a list of trending videos, so we go with the list method.

The list method will return a list of videos that match what we requested. So, we need to tell the API what kind of videos we are requesting. The resulting resource representation is categorized into some sections, called part, each has its own properties, wrapped in a JSON format.

For example, snippet part consists of properties such as publishedAt, channelId, title, description, etc. While statistics part consists of properties like viewCount, likeCount, dislikeCount, etc.

The Script

The script I created follow a modular code, which means I don’t write all the codes in one large file main.py. Instead, I divided it into several parts:

  • a repository code defined in theYouTube class
  • some utility functions in thecommon directory
  • other chores like a logging setup, exception handler, etc.

All those modules are used in the main script, main.py. The full directory structure looks like this.

I used pipenv and pyenv to manage python version and virtual environment, hence there are Pipfile files.

.
├── Pipfile
├── Pipfile.lock
├── README.md
├── common
│ ├── __init__.py
│ ├── utils.py
│ └── video.py
├── data
├── last_retrieved_trending.py
├── main.py
├── storage
│ └── trending.csv
└── ytid
├── __init__.py
├── config.py
├── exceptions.py
└── logger.py

YouTube API Class

I defined a YouTube class as for interacting with YouTube API. We can consider this as our python API for YouTube.

ytid/__init__.py

As we see there, the method get_trendings will iterate through the response item until no cursor is found. It’s our main method.

In the main.py, we can then import the class and write

youtube = YouTube(url=config.URL, api_key=config.API_KEY)
videos = youtube.get_trendings()

Video

For each video collected from the response item, we define it as Video which represents a YouTube video resource.

common/video.py

Since we ask YouTube to return snippet, contentDetails, and statistics resources, we first provide necessary properties for each part. You can check other part resources on the YouTube Data API docs.

Chore Modules

There are also some chores modules. Some secrets like API Key are stored as environment variables, and by using pipenv, we can access read them using os.getenv function which lives in config.py. We also define some custom exceptions and a logging format.

Scheduling Notebook

Pict from https://analyticsdrift.com/data-science-real-time-collaboration-tool-deepnote-is-now-open-for-all/

After the code is ready, it’s time to run and get the trending data. Since I want to collect daily trending data, then it needs to be run daily. My first attempt is to run manually every day. Until I forget to run it at some points and hence some trendings data aren’t collected properly.

In May 2021, I received a product update through email, announcing that Deepnote now supports scheduling a notebook! At that time, I knew this is the place where I can run the YouTube trending code automatically and by schedule.

Quoting from deepnote.com, “Deepnote is a new kind of data science notebook. Jupyter-compatible with real-time collaboration and running in the cloud.”

Below is a snapshot of the published article where the code is scheduled to run daily at 1 PM UTC+7. If you are curious, check the article here.

Pict by author

If you want to use the scheduling features on Deepnote, read through the documentation here.

Push to Kaggle Dataset

After retrieving the trending data, the next step is to upload it to Kaggle Dataset. Again, since I want to automatically embed the upload/update step in the scheduled notebook, I use kaggle API and run it as a bash command in a notebook cell, just like the code.

Pict by author

To close this…

Thank you for reading until the final section here. So far, you’ve known my approach to retrieve YouTube trending data, both the codes and the “deployment”. I encourage you to take a look at the code repository for full detail below.

Thank you and stay safe!

--

--

Syahrul Hamdani
Geek Culture

Breathe data-driven | Sr Data Scientist @ KoinWorks