* Vlogbrothers view statistics * Jutze takes a quick look * Last updated 2015-04-22 by jutze@jutze.com * tumblr post with the original data: http://vespertinehour.tumblr.com/post/116950397182/fishingboatproceeds-kitchensink108-im (Thanks for all the work and for sharing!) version 13.1 clear set more off import excel http://nerdfighteria.info/files/vlogbrothers_data.xls, sheet("John-only") firstrow gen John = 1 save killme1, replace clear import excel http://nerdfighteria.info/files/vlogbrothers_data.xls, sheet("All videos") firstrow gen John = 2 save killme2, replace clear import excel http://nerdfighteria.info/files/vlogbrothers_data.xls, sheet("Hank-only") firstrow gen John = 0 append using killme1 append using killme2 sort VideoID John drop if VideoID == VideoID[_n-1] label variable John "Brother(s)" label define brother 0 "Hank" 1 "John" 2 "both" label values John brother gen lnViews = ln(Views) gen lperv = Likes/Views label variable lperv "Likes per Views" gen cperv = Comments/Views label variable cperv "Comments per Views" gen lperd = Likes/Dislikes label variable lperd "Likes per Dislikes" gen Year = year(Date) label variable Year "Year" label variable Seconds "Length (in Seconds)" label variable Date "" sort Date gen str Events = "" gen Eventdate = . replace Eventdate = 17365 in 136 replace Events = "Accio Deathly Hallows" in 136 replace Eventdate = 17821 in 332 replace Events = "Paper Towns Publication Date" in 332 replace Eventdate = 18358 in 574 replace Events = "WGWG Publication Date" in 574 replace Eventdate = 19002 in 834 replace Events = "TFiOS Publication Date" in 834 replace Eventdate = 19668 in 1023 replace Events = "Google+/YouTube Comments Merger" in 1023 replace Eventdate = 19880 in 1083 replace Events = "TFiOS Movie Release" in 1083 save killme3, replace erase killme1.dta erase killme2.dta clear * **** * use killme3 * * Scroll down for nice plots! * This is a summary of the YouTube statistics of videos by the Vlogbrothers - Hank and John Green. * The raw data was provided by verspertinehour.tumblr.com - http://vespertinehour.tumblr.com/post/116950397182/fishingboatproceeds-kitchensink108-im * I focus on two dependent variables: * 1) Who made it? Hank-only, John-only, or both? I'll call this variable "Brother". * 2) The second variable of interest is the Date. In other words: When was the video put online? * * The data set already contains a few interesting variables: * - The view count (Views) * - The number of Likes * - The number of Dislikes * - The number of Comments * The three latter numbers co-vary (all rank correlations > .7) with the total number of views, so looking at them all separately would be repetitive and boring. * rank correlation of views, likes, dislikes, and comments spearman Views Likes Dislikes Comments Seconds * So instead I will look at * 1) The view count - most of the time I'll plot the natural logarithm of the view count because of a few outliers (more on those later) * 2) The Likes per Views ratio (overall appreciation) * 3) The Likes per Dislikes ratio (unambigious appreciation) * 4) The Comments per Views ratio along with the overall number of comments * 5) The length of the videos is not that interesting because most clock in just under 4 minutes. (NB: Longer videos were not included in the original data set.) * There is just not enough variation. So I'll just have one quick plot at the end. * * Speaking of plots, most of the analysis will be graphical. This is pretty much a census, so there's no need for statistical testing. * Also, it's all quite explorative. * * Here we go: * Who tends to have more views? * Here is the median view count for each brother: table John, contents(freq median Views mean Views sd Views) * Hank: 256k, John: 286k, both: 347k * The median for Hank is 256k. * This means that 50% of Hank's videos have more than 256k views, and the other half of his videos have less than 256k views. * So John's videos tend to get more views, but still less than reunion videos. * You can also look at the means (M) and standard deviations (SD) - but there are some influential outliers that impede the interpretation of the numbers. * Hank: M = 378k (SD = 537k) * John: M =467k (SD = 1112k) * both: M =367k (SD = 183k) * * This plot shows the view count changes across time. The solid line is a median band. * It indicates how many views a video needs at a given point in time to have less views than half of the other videos. twoway (scatter lnViews Date, mcol(gs12)) /// (mband lnViews Date, lcol(dkgreen) lpattern(solid)) /// , /// scheme(s1mono) /// xtitle("") ytitle("Views") ylabel(11.51 "100,000" 13.82 "1,000,000" 16.12 "10,000,000") legend(off) graph export viewstats-fig1.png, replace * Each gray point represents one particular Vlogbrothers video. * When I add the linear trend (actually, it's a log-linear trend!), it becomes clear that newer videos tend to get more views. twoway (scatter lnViews Date, mcol(gs12)) /// (mband lnViews Date, lcol(dkgreen) lpattern(solid)) /// (lfit lnViews Date, lcol(gs8) lpattern(dash)) /// , /// scheme(s1mono) /// xtitle("") ytitle("Views") ylabel(11.51 "100,000" 13.82 "1,000,000" 16.12 "10,000,000") legend(off) graph export viewstats-fig2.png, replace * And this is the same plot with some additional Nerfighter-related dates twoway (scatter lnViews Date, mcol(gs12) xline(17365 17821 18358 19002 19668 19880, lcol(gs4) lpattern(dash))) /// (mband lnViews Date, lcol(dkgreen) lpattern(solid)) /// (scatter lnViews Eventdate if Eventdate == 17365, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lnViews Eventdate if Eventdate == 17821, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lnViews Eventdate if Eventdate == 18358, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lnViews Eventdate if Eventdate == 19002, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lnViews Eventdate if Eventdate == 19668, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lnViews Eventdate if Eventdate == 19880, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// , /// scheme(s1mono) /// ytitle("Views") ylabel(11.51 "100,000" 13.82 "1,000,000" 16.12 "10,000,000") legend(off) graph export viewstats-fig3.png, replace * Did the movie version of The Fault in Our Stars lead to fewer views? * I don't think so - this is mostly speculation, anyway. * There could be many reasons why Nerdfighters might be watching fewer videos (CrashCourse, SciShow, Tumblr, jobs, kids). * Personally, I think that the more recent videos just haven't accumulated as many views from new nerdfighters who go through old videos (and from random strangers). * * Here is another version of this plot, this time with separate lines for John and Hank: * The dashed line is Hank's. twoway (scatter lnViews Date, mcol(gs12) xline(17365 17821 18358 19002 19668 19880, lcol(gs4) lpattern(dash))) /// (mband lnViews Date if John == 1, lpattern(solid) lcol(dkgreen)) /// (mband lnViews Date if John == 0, lpattern(dash) lcol(green)) /// (scatter lnViews Eventdate if Eventdate == 17365, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lnViews Eventdate if Eventdate == 17821, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lnViews Eventdate if Eventdate == 18358, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lnViews Eventdate if Eventdate == 19002, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lnViews Eventdate if Eventdate == 19668, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lnViews Eventdate if Eventdate == 19880, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// , /// scheme(s1mono) /// ytitle("Views") ylabel(11.51 "100,000" 13.82 "1,000,000" 16.12 "10,000,000") legend(order(3 "Hank" 2 "John") ring(0) position(5)) graph export viewstats-fig4.png, replace * My interpretation would be that the view counts of Hank and John didn't really develop differently. * ********* * * So far, so good. Now what about actual appreciation? * Here are the median values for Likes per Views: table John, contents(freq median lperv mean lperv sd lperv) * When I look at the median, Hank's videos are liked by 2.3% of viewers * John's videos are liked by 2.2% of viewers * Reunion videos are liked by 3.3% - Nerdfighters seem to like reunion videos! * * Here's the longitudinal perspective - again no clear differences between Hank's videos and John's videos twoway (scatter lperv Date, mcol(gs12) xline(17365 17821 18358 19002 19668 19880, lcol(gs4) lpattern(dash))) /// (mband lperv Date if John == 1, lpattern(solid) lcol(dkgreen)) /// (mband lperv Date if John == 0, lpattern(dash) lcol(green)) /// (scatter lperv Eventdate if Eventdate == 17365, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lperv Eventdate if Eventdate == 17821, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lperv Eventdate if Eventdate == 18358, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lperv Eventdate if Eventdate == 19002, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lperv Eventdate if Eventdate == 19668, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lperv Eventdate if Eventdate == 19880, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// , /// scheme(s1mono) /// ytitle("Likes/Views") legend(order(3 "Hank" 2 "John") ring(0) position(11)) graph export viewstats-fig5.png, replace * ********* * * Being liked is one thing. But how about the Likes per Dislikes ratio? * Here are the median values: table John, contents(freq median lperd mean lperd sd lperd) * Hank's videos tend to get 78 Likes per Dislike. * John's videos tend to get 126 Likes for each Dislike. * And reunion videos tend to get 177 Likes per Dislike. * * Here's the longitudinal perspective: twoway (scatter lperd Date, mcol(gs12) xline(17365 17821 18358 19002 19668 19880, lcol(gs4) lpattern(dash))) /// (mband lperd Date if John == 1, lpattern(solid) lcol(dkgreen)) /// (mband lperd Date if John == 0, lpattern(dash) lcol(green)) /// (scatter lperd Eventdate if Eventdate == 17365, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lperd Eventdate if Eventdate == 17821, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lperd Eventdate if Eventdate == 18358, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lperd Eventdate if Eventdate == 19002, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lperd Eventdate if Eventdate == 19668, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter lperd Eventdate if Eventdate == 19880, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// , /// scheme(s1mono) /// ytitle("Likes/Dislikes") legend(order(3 "Hank" 2 "John") ring(0) position(11)) graph export viewstats-fig6.png, replace * There were even more Likes than Dislikes during the past few years. * This development occured especially for John's videos. * ********* * * Enough with the appreciation - how about Comments? An eternal source of love, hate, fun, and chaos they are. * Here are the median values for Comments/Views: table John, contents(freq median cperv mean cperv sd cperv) * The overall tendency is that 0.5%-0.6% of viewers write a comment. * * Let's look at the longitudinal perspective: twoway (scatter cperv Date, mcol(gs12) xline(17365 17821 18358 19002 19668 19880, lcol(gs4) lpattern(dash))) /// (mband cperv Date if John == 1, lpattern(solid) lcol(dkgreen)) /// (mband cperv Date if John == 0, lpattern(dash) lcol(green)) /// (scatter cperv Eventdate if Eventdate == 17365, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter cperv Eventdate if Eventdate == 17821, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter cperv Eventdate if Eventdate == 18358, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter cperv Eventdate if Eventdate == 19002, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter cperv Eventdate if Eventdate == 19668, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// (scatter cperv Eventdate if Eventdate == 19880, mlabel(Events) mlabcol(gs4) mlabangle(90) msymbol(i)) /// , /// scheme(s1mono) /// ytitle("Comments/Views") legend(order(3 "Hank" 2 "John") ring(0) position(1)) graph export viewstats-fig7.png, replace * The number of Comments per View has declined over the past two years; possibly due to the integration of Google+ and YouTube or the new sorting algortihm for comments. * ********* * * Finally, here's a quick overview of specific types of outliers * * Videos that elicit a lot of comments are mostly about the Project for Awesome... twoway (scatter cperv Date, msymbol(i) mlabel(Title)), xtitle("") ytitle("Comments/Views") scheme(s1mono) graph export viewstats-fig8.png, replace * The videos with the highest view count all deal with animals... twoway (scatter Views Date, msymbol(i) mlabel(Title)), xtitle("") ytitle("Views") scheme(s1mono) graph export viewstats-fig9.png, replace * The last couple of plots brings us back to the length of the videos. Here are the titles of the shorter videos. twoway (scatter Seconds Date, msymbol(i) mlabel(Title) mlabangle(90)), xtitle("") ytitle("Length (in Seconds)") scheme(s1mono) graph export viewstats-fig10.png, replace * Not much to say here. * And it seems as if Hank keeps making slightly longer videos than John twoway (scatter Seconds Date, mcol(gs12)) /// (mband Seconds Date if John == 1, lpattern(solid) lcol(dkgreen)) /// (mband Seconds Date if John == 0, lpattern(dash) lcol(green)) /// , /// scheme(s1mono) /// ytitle("Length (in Seconds)") legend(order(3 "Hank" 2 "John") ring(0) position(7)) graph export viewstats-fig11.png, replace * That's all! * ****** * * REMAINS: * How many videos per brother? *tab John *table Year John, contents(freq median Views) *table Year John, contents(freq median lperv sd lperv median lperd sd lperd) *table John, contents(freq median lperv sd lperv median lperd sd lperd) *replace cperv = .05 if cperv > .05 *graph box Views, over(John) *graph box Views if Views < 5000000, over(John) *graph box lnViews, over(John) * *twoway (scatter cperv Date) (qfit cperv Date) *twoway (scatter cperv Date) (fpfit cperv Date) *twoway (scatter cperv Date) (lpoly cperv Date) *twoway (scatter cperv Date) (lpolyci cperv Date) *twoway (scatter Comments Date) (lpolyci Comments Date) * * Likes per dislikes by brother *twoway (scatter lperd Date if John == 0) (lpoly lperd Date if John == 0) (scatter lperd Date if John == 1) (lpoly lperd Date if John == 1) *twoway (lpolyci lperd Date if John == 2, lcol(green) lpattern(dash)) (lpolyci lperd Date if John == 0) (lpolyci lperd Date if John == 1) *twoway (scatter lnViews Date) (lpolyci lnViews Date), by(John, row(1)) * *twoway (scatter lperv Date, msymbol(i) mlabel(Title)) *twoway (scatter lperd Date, msymbol(i) mlabel(Title)) * * Nicely formatted graphs over time *display mdy(7,18,2007) *Accio Deathly Hallows: 17365 *display mdy(10,16,2008) *Paper Town release: 17821 *display mdy(4,6,2010) * WGWG release: 18358 *display mdy(1,10,2012) *TFiOS release: 19002 *display mdy(11,6,2013) *YouTube/Google+ for comments: 19668 *display mdy(6,6,2014) *TFiOS movie premiere: 19880 clear exit * DFTBA