Extracting Daily YouTube Trending Video Statistics
using YouTube Data API and a scheduled notebook in Deepnote
“YouTube is much more than TV”
We may have heard someone said this somewhere. Well, at least, in Indonesia, this statement is overheard. I personally think so… before I check the trending list.
Most of the videos that go on the trending list are basically from TV, I presume. Then, why don’t I track the list by myself?
About The Article
Inspired by the similar work of others, I’d like to share how I built a python script to automatically extract the trending video using YouTube Data API and schedule the run from a notebook hosted in Deepnote.
What You Need
First, since we want to retrieve YouTube video statistics, we need to follow some “getting started” steps to acquire an API key needed to interact with YouTube Data API. Basically, we need to create a project in Google Developer Console, obtain a credential with access to YouTube Data API, and get an API key. Check the docs here for further detail.
Next, make sure to have Python and 2 main libraries installed, requests
and Pandas
. I didn’t do fancy analysis after the data is collected, so only those 2 libraries are installed. Also, kaggle
since I stored all the data collected to Kaggle Dataset.
Understanding YouTube’s Resources
YouTube has a lot of resources and resources types. Quoting from the documentation, “a resource is an individual data entity with a unique identifier”. In this case, we only care about the video resource, which represents a single YouTube video.
In each resource, there are some methods in which we can use to do what we want. Video resource itself has methods such as getRating
, reportAbuse
, list
, etc. In our case, since we want to extract a list of trending videos, so we go with the list
method.
The list
method will return a list of videos that match what we requested. So, we need to tell the API what kind of videos we are requesting. The resulting resource representation is categorized into some sections, called part
, each has its own properties, wrapped in a JSON format.
For example, snippet
part consists of properties such as publishedAt
, channelId
, title
, description
, etc. While statistics
part consists of properties like viewCount
, likeCount
, dislikeCount
, etc.
The Script
The script I created follow a modular code, which means I don’t write all the codes in one large file main.py
. Instead, I divided it into several parts:
- a repository code defined in the
YouTube
class - some utility functions in the
common
directory - other chores like a logging setup, exception handler, etc.
All those modules are used in the main script, main.py
. The full directory structure looks like this.
I used pipenv and pyenv to manage python version and virtual environment, hence there are Pipfile files.
.
├── Pipfile
├── Pipfile.lock
├── README.md
├── common
│ ├── __init__.py
│ ├── utils.py
│ └── video.py
├── data
├── last_retrieved_trending.py
├── main.py
├── storage
│ └── trending.csv
└── ytid
├── __init__.py
├── config.py
├── exceptions.py
└── logger.py
YouTube API Class
I defined a YouTube
class as for interacting with YouTube API. We can consider this as our python API for YouTube.
As we see there, the method get_trendings
will iterate through the response item until no cursor is found. It’s our main method.
In the main.py
, we can then import the class and write
youtube = YouTube(url=config.URL, api_key=config.API_KEY)
videos = youtube.get_trendings()
Video
For each video collected from the response item, we define it as Video
which represents a YouTube video resource.
Since we ask YouTube to return snippet
, contentDetails
, and statistics
resources, we first provide necessary properties for each part. You can check other part resources on the YouTube Data API docs.
Chore Modules
There are also some chores modules. Some secrets like API Key are stored as environment variables, and by using pipenv, we can access read them using os.getenv
function which lives in config.py
. We also define some custom exceptions and a logging format.
Scheduling Notebook
After the code is ready, it’s time to run and get the trending data. Since I want to collect daily trending data, then it needs to be run daily. My first attempt is to run manually every day. Until I forget to run it at some points and hence some trendings data aren’t collected properly.
In May 2021, I received a product update through email, announcing that Deepnote now supports scheduling a notebook! At that time, I knew this is the place where I can run the YouTube trending code automatically and by schedule.
Quoting from deepnote.com, “Deepnote is a new kind of data science notebook. Jupyter-compatible with real-time collaboration and running in the cloud.”
Below is a snapshot of the published article where the code is scheduled to run daily at 1 PM UTC+7. If you are curious, check the article here.
If you want to use the scheduling features on Deepnote, read through the documentation here.
Push to Kaggle Dataset
After retrieving the trending data, the next step is to upload it to Kaggle Dataset. Again, since I want to automatically embed the upload/update step in the scheduled notebook, I use kaggle
API and run it as a bash command in a notebook cell, just like the code.
To close this…
Thank you for reading until the final section here. So far, you’ve known my approach to retrieve YouTube trending data, both the codes and the “deployment”. I encourage you to take a look at the code repository for full detail below.
Thank you and stay safe!