Concurrently, we used a separate set of Perl scripts deployed on a cluster of workstations to collect data on the social network of the authors that have seeded the videos. The video page provided the link to the author’s page, which in turn contained data on the author and the author's social network. For instance, the author’s page contains the identities of his or her directly connected friends. We used a cluster of workstations in order to collect a snapshot of the social network structure within four days. The entire process was managed by a centralized controller that was responsible for handing out the network crawling tasks to the individual computers, monitoring their progress, and occasionally reissuing tasks if they are not completed within a specified time interval. The social network data was also stored in a MySql database and then analyzed using custom programs written in C. The analysis yielded the various social network metrics that we use in the paper, e.g., degree, number of second-degree friends, clustering, and Betweenness centrality.
We make all of the crawler scripts available for other researchers who might be interested in collecting YouTube data. We do note that YouTube changes its webpage layout and data format, so it is likely that our scripts would have to be modified to account for recent changes.
Please download the code at this link.
