YouTube Data Collection

We collected the YouTube data using a set of custom scripts written in Perl. We bootstrapped the data collection process by using a Perl script to find the list of newly uploaded videos to YouTube. We then used a separate script to periodically access the statistics page corresponding to the videos, collected the relevant video characteristics, and stored them in a MySql database for later analysis. The Perl script parsed the HTML content of the statistics pages by looking for key markers in the HTML tags associated with the various video related data. We used Perl’s “HTML Parser” library to perform the data extraction.

Concurrently, we used a separate set of Perl scripts deployed on a cluster of workstations to collect data on the social network of the authors that have seeded the videos. The video page provided the link to the author’s page, which in turn contained data on the author and the author's social network. For instance, the author’s page contains the identities of his or her directly connected friends. We used a cluster of workstations in order to collect a snapshot of the social network structure within four days. The entire process was managed by a centralized controller that was responsible for handing out the network crawling tasks to the individual computers, monitoring their progress, and occasionally reissuing tasks if they are not completed within a specified time interval. The social network data was also stored in a MySql database and then analyzed using custom programs written in C. The analysis yielded the various social network metrics that we use in the paper, e.g., degree, number of second-degree friends, clustering, and Betweenness centrality.

We make all of the crawler scripts available for other researchers who might be interested in collecting YouTube data. We do note that YouTube changes its webpage layout and data format, so it is likely that our scripts would have to be modified to account for recent changes.

Please download the code at this link.