Futurama: A comparison of the current production run with the old run

Mongo

Bending Unit

« on: 09-01-2010 22:58 »
« Last Edit on: 09-01-2010 23:03 »

There have been a lot of electrons spilled over the issue of how the current production run (6ACV01 and up) matches up with the original production run (1ACV01 to 4ACV18). This is impossible to definitively answer, since everybody has their own subjective opinion about each separate episode. However, I think that there are sufficient episodes in the new run to allow for at least a statistical comparison.

What I have done is enter every episode of the original 72 into a spreadsheet, with a rough personal rating for each one:

4 = outstanding episode
3 = very good episode
2 = typical episode
1 = poor episode
0 = terrible episode

The numbers themselves only matter in that they indicate the rating, they could have just as easily been 6, 7, 8, 9 and 10, without affecting the final result (since I am only using the numbers to compare between the original and new production runs). Obviously, my own personal ratings are unlikely to match those of anyone else, but on average they should be fairly close. In particular, I rated each of these episodes after having seen all 83 episodes so far of the new and old production runs (minus the movie episodes), so the new run is "calibrated" against the old run -- I was directly comparing the new episodes against the old ones.

Once I had all the episodes entered into the spreadsheet, I took a running average of the episode ratings, in blocks of 5 episodes. This is what I found.

For the 5-episode averages, the best 2 blocks of 5 consecutive (by production code) episodes of the original run had a score of 16 = 3.2/episode

3ACV01 Amazon Women in the Mood
3ACV02 Parasites Lost
3ACV03 A Tale of Two Santas
3ACV04 The Luck of the Fryish
3ACV05 The Birdbot of Ice-Catraz

4ACV08 Crimes of the Hot
4ACV09 Teenage Mutant Leela's Hurdles
4ACV10 The Why of Fry
4ACV11 Where No Fan Has Gone Before
4ACV12 The Sting

The were two blocks of 5 consecutive episodes that scored at 15 = 3.0/episode

1ACV12 When Aliens Attack
1ACV13 Fry and the Slurm Factory
2ACV01 I Second That Emotion
2ACV02 Brannigan Begin Again
2ACV03 A Head in the Polls

2ACV13 Bender Gets Made
2ACV14 Mother's Day
2ACV15 The Problem With Popplers
2ACV16 Anthology of Interest I
2ACV17 War is the H-Word

By comparison, the best 5-episode block of the new run also scored as a 15 = 3.0/episode

6ACV06 Lethal Inspection
6ACV07 The Late Philip J. Fry
6ACV08 That Darn Katz!
6ACV09 A Clockwork Origin
6ACV10 The Prisoner of Benda

The 2 worst 5-episode blocks from the original run scored at 6 = 1.2/episode

3ACV08 Thats Lobstertainment!
3ACV09 The Cyber House Rules
3ACV10 Where the Buggalo Roam
3ACV11 Insane in the Mainframe
3ACV12 The Route of All Evil

4ACV02 Leela's Homeworld
4ACV03 Love and Rocket
4ACV04 Less Than Hero
4ACV05 A Taste of Freedom
4ACV06 Bender Should Not Be Allowed on Television

The worst 5-episode block of the new run also scored at 6 = 1.2/episode

6ACV01 Rebirth
6ACV02 Inna-Gadda-Da-Leela
6ACV03 Attack of the Killer App
6ACV04 Proposition Infinity
6ACV05 The Duh-Vinci Code

It is unfortunate that the worst 5 consecutive episodes of the new run also happened to be the first 5 episodes we saw. It gave an impression that the new episodes were considerably less good than the average episode from the original run (which was true, they were -- but the following episodes were a lot better).

However, the most important conclusion I came to is that the broadcast episodes of the new production run are fully equivalent to the episodes of the original production run, with about the same average quality of, and amount of variation between, episodes.

Gorky

DOOP Secretary

« Reply #1 on: 09-02-2010 02:31 »

What you mean by all of this, I assume, is that ratings are subjective and opinions are bound to vary.

Also: I refuse to respect a person who doesn't enjoy "Insane in the Mainframe" or "Less Than Hero."

Mongo

Bending Unit

« Reply #2 on: 09-02-2010 03:12 »

Yes, as I had said in the first paragraph. One point I was trying to make is that whatever your personal rating system -- whether you put humour, shipping, science fiction, or whatever first, if your rating system is self-consistent then the new production run is comparable to the original run in terms of quality. If you rate it significantly lower than the original run, then there is some sort of bias in your system.

PumaGirl

Starship Captain

« Reply #3 on: 09-02-2010 08:53 »

You are comparing 72 episodes with 11. I would say you have serious issues with your sample size.

But quite impressive how much effort you put into this!

moonbus69

Bending Unit

« Reply #4 on: 09-02-2010 23:59 »

Impressive work, Mongo... ;-)

I'll have to hold my judgement until I've seen all of the current Comedy Central season (of 26 eps.)

Must say that I'm loving the ones aired so far.

KyleG
Poppler

« Reply #5 on: 09-03-2010 04:16 »
« Last Edit on: 09-03-2010 04:30 »

Mongo, might you post your list/rankings/ratings raw data? Your statistical analysis is interesting, but a far more quality and professional analysis could be done using the Mann-Whitney U test.

This test is used to analyze two populations to see if they can be treated as the same population. Basically, here, one population is pre-cancellation eps. The other is revival eps. By providing each a rating, they can be ranked. Then an analysis can be run on the ranking to determine if the two populations are actually one population. In effect, whether old vs. new matters for quality of show.

I could easily do itwith your raw data to work with (I guess episode number + rating, for each episode). I'm too lazy to rate them all myself

Mongo

Bending Unit

« Reply #6 on: 09-03-2010 04:50 »
« Last Edit on: 09-03-2010 05:46 »

1ACV01 -- 3
1ACV02 -- 1
1ACV03 -- 4
1ACV04 -- 3
1ACV05 -- 2
1ACV06 -- 2
1ACV07 -- 2
1ACV08 -- 2
1ACV09 -- 3
1ACV10 -- 3
1ACV11 -- 1
1ACV12 -- 4
1ACV13 -- 3
2ACV01 -- 2
2ACV02 -- 3
2ACV03 -- 3
2ACV04 -- 3
2ACV05 -- 3
2ACV06 -- 2
2ACV07 -- 2
2ACV08 -- 2
2ACV09 -- 1
2ACV10 -- 2
2ACV11 -- 3
2ACV12 -- 2
2ACV13 -- 2
2ACV14 -- 1
2ACV15 -- 4
2ACV16 -- 3
2ACV17 -- 4
2ACV18 -- 1
2ACV19 -- 0
3ACV01 -- 4
3ACV02 -- 4
3ACV03 -- 2
3ACV04 -- 4
3ACV05 -- 2
3ACV06 -- 1
3ACV07 -- 3
3ACV08 -- 0
3ACV09 -- 1
3ACV10 -- 1
3ACV11 -- 3
3ACV12 -- 1
3ACV13 -- 1
4ACV14 -- 3
3ACV15 -- 1
3ACV16 -- 1
3ACV17 -- 1
3ACV18 -- 3
3ACV19 -- 4
3ACV20 -- 4
3ACV21 -- 0
3ACV22 -- 2
4ACV01 -- 2
4ACV02 -- 2
4ACV03 -- 2
4ACV04 -- 0
4ACV05 -- 1
4ACV06 -- 1
4ACV07 -- 3
4ACV08 -- 2
4ACV09 -- 3
4ACV10 -- 4
4ACV11 -- 3
4ACV12 -- 4
4ACV13 -- 1
4ACV14 -- 1
4ACV15 -- 4
4ACV16 -- 1
4ACV17 -- 1
4ACV18 -- 3

6ACV01 -- 2
6ACV02 -- 1
6ACV03 -- 0
6ACV04 -- 1
6ACV05 -- 2
6ACV06 -- 3
6ACV07 -- 4
6ACV08 -- 1
6ACV09 -- 3
6ACV10 -- 4
6ACV11 -- 2
6ACV12 -- 3

KyleG
Poppler

« Reply #7 on: 09-03-2010 06:00 »

@Mongo Thanks. Here's my work, which concludes that there isn't evidence to suggest the pre/post cancellation distinction is meaningful for judging episode quality.

Null hypothesis: Pre- and post-cancellation episodes of Futurama are not statistically different in 0-4 rating quality based on Mongo's ratings.
Alternative hypothesis: Pre- and post-cancellation episodes of Futurama are statistically different in rating.

Now, because this data is ordinal and non-parametric, we will use the Mann-Whitney U test.

n1=72 (size of pre-cancellation population)
n2=11 (size of post-cancellation population)
U=420.5
alpha=.05 (two-tailed)

The score/rating distributions in the two groups do not differ significantly.

Now, technically what the Mann-Whitney U test is telling us is that there is not enough evidence to suggest that the pre- and post-cancellation episodes are of different quality. The test is not telling us they are of the same quality.

But for our purposes, I think it's safe to say they're of pretty much the same quality. Based on Mongo's ratings, of course.

Mongo

Bending Unit

« Reply #8 on: 09-03-2010 06:15 »

Thank you for that very interesting study. Of course, it is based on my own personal ratings, but I suspect that other people's ratings of the episodes of the new run, while they will vary when looking at individual episodes (depending on how important to them the various elements making up each episode are), will be generally similar to those of the original run (allowing for sample size effects).

KyleG
Poppler

« Reply #9 on: 09-03-2010 23:44 »

I'll clarify a bit more what the results are saying. Basically, we're "confirming" there is not enough evidence to show the pre/post cancellation episodes are "different." This is not the same thing as having enough evidence to show they are the same.

It's like how if we want to know how old you are. There is a huge difference between having enough evidence to know you are not 90 (meaning you could be <=89 years old, including 26) and having enough evidence to know you are 26 (meaning you are not 90, but are also absolutely 26).

Still, it's the best I know how to do here.

Veritas

Crustacean

« Reply #10 on: 09-04-2010 00:36 »

I think it's entirely possible for the new episodes to be stylistically different rather than quantitatively different - they're both good, but in different ways.

speedracer
Bending Unit

« Reply #11 on: 09-04-2010 05:00 »
« Last Edit on: 09-04-2010 05:02 »

I played around with the Mann-Whitney test a little bit here.

Punch in n_A = 72 and n_B = 12, input scores for each episode in the table, then crank it. If z > 2, then you can be pretty sure that there's a difference in the quality of the two sets of episodes (assuming that I understand this correctly).

The sample size for the second set (12 episodes) is so small that it's really unlikely that anyone would ever be able to firmly say that there's a difference, though -- just playing around with some sample values, it looks like you'd have to think that 4 or 5 of the 12 new episodes are legit contenders for Worst Episode Evar in order to confidently say that the new run is worse.

coldangel

DOOP Secretary

« Reply #12 on: 09-07-2010 14:48 »

I can help you out with this.
I'm essentially a demigod, to the extent that I'm so far above the rest of humanity as to be basically a new species (homo superior), so my opinion on any matter can be taken as indisputable fact.
The new episodes are easily as good as anything that's gone before.
There. Discussion finished.

Nibblonian Leader

Urban Legend

« Reply #13 on: 09-07-2010 14:55 »

So the legend goes...

transgender nerd under canada

DOOP Ubersecretary

« Reply #14 on: 09-07-2010 16:42 »

Quote from: coldangel_1 on 09-07-2010 14:48

I can help you out with this.
I'm essentially ... homo ...so my opinion on any matter can be taken as indisputable fact.
The new episodes are easily as good as anything that's gone before.
There. Discussion finished.

This is how I read Coldy's post.

PumaGirl

Starship Captain

« Reply #15 on: 09-07-2010 18:06 »

Quote from: KyleG on 09-03-2010 23:44

I'll clarify a bit more what the results are saying. Basically, we're "confirming" there is not enough evidence to show the pre/post cancellation episodes are "different." This is not the same thing as having enough evidence to show they are the same.

Kind of what I said earlier (at least that's what I meant). I didn't feel there was any need to do statistical testing, with this sample size pretty much any test would reveal statistical insignificance (or to put it differently you can't reject the null). But nice job to go through the trouble.

coldangel

DOOP Secretary

« Reply #16 on: 09-08-2010 07:51 »

Oh, real mature tnuk.

I bet you giggle when you hear the word titmouse.

seattlejohn01

Space Pope

« Reply #17 on: 09-08-2010 10:37 »

C'mon, Coldy, you KNOW you giggle as well...

coldangel

DOOP Secretary

« Reply #18 on: 09-08-2010 13:01 »

Totally not the point.

Pages: [1]

« previous next »