Would this idea be feasible / not too hard on the server?
Have each individual list be a clickable link that will take you to a page listing that same attribute for the top 50, 100, or all movies (whatever is feasible).
<Swordless> Go hug a tree, you vegetarian (I bet you really are one)
AFAIK that would actually be quite easy to implement (doesn't even require new code to be written). Only Bisqwit could estimate if it could be too heavy on the server (I guess not because the data is cached, but who knows...)
It'd be neat to see that implemented, I think. I'm always curious about stuff like, for example, where the bulk of the run's rerecord/frame ratio lies, and I would imagine others are curious about other things as well.
Anyway, thanks, and keep up the nice work on that page!
<Swordless> Go hug a tree, you vegetarian (I bet you really are one)
Ok, Bisqwit gave the green light, so I added links to extended lists in some of the categories. I'm not sure the ones I left without an extended list really need them, but if you really, really want them, I can add them too... :)
Well, the average movie length is 25:29, which is 25.4889 minutes. The number of published movies is 371. Their total length is thus the product of those two numbers, ie. 9454.3 minutes = 157.57 hours = 6.56 days.
Joined: 3/9/2004
Posts: 4588
Location: In his lab studying psychology to find new ways to torture TASers and forumers
I really don't like the whole bumping thing. However I would like to mention that I really would like to see some of the stats I listed in the first reply to this thread. They seem to go hand in hand with some of the other stats there, and in some cases help clarify the statistics of the other ones.
Warning: Opinions expressed by Nach or others in this post do not necessarily reflect the views, opinions, or position of Nach himself on the matter(s) being discussed therein.
How hard is it to show a histogram or a graph that gives the distribution of
Movie Length,
Re-records,
rerecord/length,
Age,
a few others?
I think there is enough data to make pretty graphs.
This signature is much better than its previous version.
Joined: 8/1/2004
Posts: 2687
Location: Seattle, WA
It would take about 15 minutes to bust out all of that info in excel, but I'm sure magical Bisqwit could easily have the site spit out that info in a matter of 3 minutes or so.
I'm not exactly sure what "most downloaded / days since published" would actually tell if the "days since published" is the current amount of days. As time passes and movies get older, the order of that list would just be the same as the "most downloaded" list (because the differences between the divisors will get smaller and smaller during time). Only immensely popular videos which get downloaded a lot in the first weeks/months would disrupt that list for a while, but once again, as time would pass, it would finally fall back into its place (ie. the the same place as in the "most downloaded" lists). In other words, the list could be rather transient and most probably would mostly replicate the "most downloaded " list (I haven't tested this in any way, though, so I may well be quite wrong).
One interesting list could perhaps be "most downloaded in the first two weeks" (or first week/day/month or whatever), but I'm not sure if the bittorrent tracker logs that kind of statistics. Perhaps another interesting list would be "most downloaded movies during the past 7 days" (or whatever timeframe) which would in practice list the most popular newest publications, but again, I don't know if the tracker logs that.
If the tracker does not log times, then the only possibility to implement that latter list would be for bisqwit to make some cronjob which regularly saves somewhere the data needed for it. However, the other list would be impossible because that info has been lost for the currently published movies.
Edit: Thinking about it a bit more, perhaps "downloads/days_published" could maybe work as a "most popular recent movies" after all. As the days start increasing from 1 forward, it rapidly drops the item down the list, and publications of the day get a high boost (because they basically don't get a "penalty" on the publication day, which older ones get). If using days is way too coarse (ie. movies would drop too drastically at exactly 24 hours after publication instead of sinking gradually and more slowly), a smaller unit could be used, as hours, minutes or even seconds since publication. However, I fear that if the time unit is too small (eg. seconds) it would perhaps randomize the list too much (the list could get almost completely shuffled each 10 minutes, which is how often the page cache expires). Another thing is that if a direct fraction doesn't work very well, perhaps a factor would have to be used to make it better behaved (such as something like "downloads/(10*seconds)" or whatever). It may be worth trying, but finding good parameters might be hard.
Making histograms and graphs shouldn't be too difficult. I don't know if PHP has directly functions for creating this kind of images, but I wouldn't be surprised if it had. Even if it hadn't, it would be easy to use gnuplot for this.
However, that's not the main problem here. The problem is: What would be the units in the x and y axes be in that kind of graph?
Assuming that, for example, the y axis is the movie length (eg. in minutes) or the amount of rerecords, what would the x axis be?
Y-axis is number of movies
X-axis is the the length or rerecords etc so you see the movie distribution of how many are what size. These are likely bargraphs.
I guess you can also do X-Y scatter plots of rerecords to movie lengths... or rerecord rates to movie length... I would suspect that longer movies have smaller rerecord rates
You might have to choose the x-axis to be logarithmic since some of the submissions are hours long.
This signature is much better than its previous version.
Containing what?
I think your avatar fits this quite well... ;)
If it's a bar chart then each bar would represent a value range of rerecords (or whatever)? For example the first bar would represent movies having 0-100 rerecords, the second 101-200 rerecords and so on?
The problem I see in this is what that step should actually be. Given that the amount of published movies isn't really that large (less than 400) and the total range of rerecords is very large (smallest 215, largest 224441) the chart would probably get full of holes and bars with just 1 or two movies in them, almost regardless of what is the value range of each bar.
With movie lengths perhaps a more meaningful bar chart could be achieved given that most movies fall in the range of about 10-40 minutes. It's still a question what would the range of each bar be. The problem is that the longest movies are really long (4 hours 21 minutes) and if every movie is put into the chart, most movies will get compressed into its left side while the majority of the rest of the chart will, again, have 0-1 sized bars.
To illustrate this more concretely, let's assume that each bar represents a 1-minute range (ie. first bar is movies between 0 and 1 minute long, second bar is for movies between 1 and 2 minutes long, etc): To represent all the movies currently published the chart would need 261 bars. The majority of movies will be placed around bar 25 (which is the average movie lenght). This is one tenth of the whole chart width from its left side. I assume that over half of the chart (from the halfpoint widthwise to the right) will be mostly empty, with some 1-sized bars here and there.
Perhaps some kind of logarithmic scale could be used. Suggestions?
Thinking about it the movie length and rerecords are likely both lognormal distributions
A logarithmic scale would be a good idea. A good way to keep the bins from affecting the shape of the distribution might be to normalize each bar by dividing the count by the bin width... that will give a normalized density.
As for bins and other graphs... if you can provide an excel spreadsheet or text file with movie times/rerecords/publish dates/etc I would be happy to do some investigation into what looks nice/meaningful.
Publish dates / month seems like an easy one right now to investigate that will show a good graph of this site's history and allow you to figure out how to make graphics.
This signature is much better than its previous version.
I think you should ask Bisqwit about getting that data. After all, it's his database and I wouldn't dare distributing anything without his permission. :)
Ok, we discussed with Bisqwit quite extensively about a list of most downloaded movies compensated for time. Bisqwit wanted a list of the most popular movies in such way that the change in popularity (ie. amount of downloads per time unit) is taken out of the equation as well as possible. In other words, a formula which, in a downloads/time graph, makes download rates as close to a horizontal flat line as possible. The rationale behind this is that when the download rates are compensated like this, they will get a better figure on the popularity of the movies regardless of how long they have been published.
After much studying the data we ended up noticing that, in average, the download growth is directly proportional to the days the movie has been published to the power of 0.4 (this is just slightly slower than square root). When we divided download amounts during time by the number of days to the power of 0.4, we got almost horizontal flat lines for most movies (except in the first weeks after publication which, of course, behave rather differently before it stabilishes to the days^0.4 behaviour).
I added this list to the MovieStatistics page. Not surprisingly it resembles the Most Downloaded list, but there are some changes.
I don't know if this is what Nach wanted, but it's what Bisqwit wanted. :)
Someone reported that the "longest non-obsoleted movies" list does not list some movies that it should. This error has been fixed now. It fixed a few other errors probably too.
The nature of the change is indicated on this page. http://tasvideos.org/FullRecentChanges.html (only for the next three days)
You might want to remove duplicate entries (like the ones caused by two AVIs published for one run). Otherwise, it's good, thanks.
[EDIT]
Also, "most voted movies" sounds a bit wrong to me. I'm not sure what to suggest instead, not with my knowledge of English.
I know he didn't, but it shouldn't be too hard to make the code generate only one string per ###M entry in this list.
[EDIT]
So I take it there's nothing to be done about that?
It's not a site "bug" per se. It's a side-effect of how the database is constructed.
The problem is that each movie file and each avi file have their own entries in the database. If a movie has two movie files (such as is the case with http://tasvideos.org/83M.html) then it has two entries in the database. Likewise, if a movie has two avi files (such as is the case with http://tasvideos.org/817M.html) it also has two avi entries in the database.
Some statistics require these to be separate (eg. the file size of the avis is different and thus need to be handled separately) while in other statistics it would be better for them to be merged into one (eg. the number of rerecords doesn't change from one avi to the other).
The routine which handles the statistics is generic and by itself cannot make the distinction whether two entries should be merged or not. I suppose it could be possible to parametrize this and make the routine to merge movie/avi entries when requested, but I haven't looked into that yet.
Joined: 4/20/2005
Posts: 2161
Location: Norrköping, Sweden
Would it be possible to include things like "Lowest Technical rating", "Lowest entertainment rating", and things like that too? Or is it only me who's interested in those things? :P