The Untold Story of the Clones: Content-agnostic Factors that Impact YouTube Video Popularity
Y. Borghol, S. Ardon, N. Carlsson, D. Eager, and A. Mahanti,
Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2012),
Beijing, China, August 2012, to appear.
Video dissemination through sites such as YouTube can have widespread impacts on opinions, thoughts, and cultures. Not all videos will reach the same popularity and have the same impact. Popularity differences arise not only because of differences in video content, but also because of other “content-agnostic” factors. The latter factors are of considerable interest but it has been difficult to accurately study them. For example, videos uploaded by users with large social networks may tend to be more popular because they tend to have more interesting content, not because social network size has a substantial direct impact on popularity. In this paper, we develop and apply a methodology that is able to accurately assess, both qualitatively and quantitatively, the impacts of various content-agnostic factors on video popularity. When controlling for video content, we observe a strong linear “rich-get-richer” behavior, with the total number of previous views as the most important factor except for very young videos. The second most important factor is found to be video age. We analyze a number of phenomena that may contribute to rich-get-richer, including the first-mover advantage, and search bias towards popular videos. For young videos we find that factors other than the total number of previous views, such as uploader characteristics and number of keywords, become relatively more important. Our findings also confirm that inaccurate conclusions can be reached when not controlling for content.
The datasets used in our paper are made available here for use by the wider research community.
The datasets consist of publicly available meta-data associated with videos from the Youtube Web site.
Please refer to Section 2 of our paper for a description of the data collection methodology and a
summary of the datasets. If you use our datasets in your research,
please drop Niklas Carlsson a line at "niklas dot carlsson AT-SIGN liu dot se",
and include a reference to our paper in your work.
- Download clone snapshot data file.
- cloneset_id: a unique id for the cloneset
- cloneset_video_count: number of videos in this clonest
- cloneset_scraping_count: number of time this cloneset had its data collected
- video_id: a unique if for the video
- capture_time: time at which this video data was captured
- video_published_date : time at which the video was first published by youtube
- video_updated_date: time at which the video was last uptated on youtube
- video_keyword_count: number of keywords describing this video
- video_category_count: number of categories this video is in
- video_duration: duration of the video
- video_viewcount: number of views for this video
- video_favourite_count: number of time this video was 'favourited'
- video_rating_count: number of ratings for this video
- video_rating_min: min rating
- video_rating_max: max rating
- video_rating_average: average rating
- video_comments_count: number of comments on this video
- like_count: number of 'like' on this video
- dislike_count: number of 'dislike' for the uploader
- uploader_id: unique id for the uploader
- uploader_followers_count: number of followers for this video
- uploader_video_count: number of videos uploaded by this uploader
- uploader_contacts_count: number of 'friends' (youtube friends) for this uploader
- uploader_total_viewcount: number of time the uploader's page was viewed
- uploader_age: age of the uploader
- max_quality: best quality available for this video, higher the better. (with the following key: "1920x1080"=>5, "1280x720"=>4, "854x480"=>3, "640x360"=>2, "320x240"=>1)
- video_age: video age in days
- Download insight data file.
cloneset_id: a unique id for the cloneset
video_id: a unique if for the video
evolution_graph: the historical view counts (or number of downloads since the file was uploaded) at 100 points in time,
between now and the time the file was uploaded.
referrers: the top 10 "most significant" sources of discovery, or where the video was linked from.