- #1

- 6

- 1

## Main Question or Discussion Point

Hi,

I've chosen to compare the results of three different machine-learning models (or well, one is the ground truth) outputting the mouth animation of a talking 3D model for my thesis using a user preference study.

Users will see two videos at a time, each from one of the models, and indicate which they think look the most natural, or if they are neutral. The videos are synchronized. For each user, one comparison will be made for each sentence (out of a set of s sentences) and for each pair of models to be compared. So, if there are s sentences, there'll be s*2 comparisons made by each user.

Now, I could arrange the data as simply A vs. B, B vs. C, and A vs. C. Then, let's say A was preferred a majority of the time over B (the total number of comparisons made of A vs. B is s*u where u is the number of users). What test should I use to see if this is statistically significant?

- Could I even just compare them all at the same time by treating the "wins" as "points" to get three sums or means representing "scores" and then use a repeated measures ANOVA followed by e.g. Tukey HSD? A lot of methods (Tukey's HSD, Student's t-test...) have the assumption that the score in A needs to be independent to the score in B, but since this is a preference study, they aren't really independent, right? E.g. if there's a total of 20 un-neutral comparisons, if A wins 15, then B wins 5.

- Maybe it's as simple as just doing a pairwise comparison e.g. A vs. B, and defining H0 as "The models will be voted for equally" and then have B's number of wins against A as the dependent variable X? Assuming a binomial distribution where the mean lands on half of the wins, i.e. (s*u)/2 (so if H0 is true then X ~ Bin(s*u, 0.5)), I could then just see if the actual result x is statistically significant i.e. P(X >= x or X <= x | H0) < a/2. I think a problem with this is however that the assumption isn't all that correct, since I'm not sure it can be assumed a user's judgments for different sentences are actually independent? Maybe I'd have to aggregate the data so that X is the number of users favoring model A instead? Though, if I'd do that, the number of sentences doesn't make a difference as to the statistical significance of the result. Or maybe I can assume that a subject's votes in different sentences are independent, as well as the different subjects' votes for any sentence, so that I could just do what I originally proposed and check the two-sided probability of the outcome against Bin(s*u,0.5)?

- There's also the option of logistic regression that I've dug myself into...

I want to be sure I'm doing this correct, so I'd greatly appreciate any insight!

I've chosen to compare the results of three different machine-learning models (or well, one is the ground truth) outputting the mouth animation of a talking 3D model for my thesis using a user preference study.

Users will see two videos at a time, each from one of the models, and indicate which they think look the most natural, or if they are neutral. The videos are synchronized. For each user, one comparison will be made for each sentence (out of a set of s sentences) and for each pair of models to be compared. So, if there are s sentences, there'll be s*2 comparisons made by each user.

Now, I could arrange the data as simply A vs. B, B vs. C, and A vs. C. Then, let's say A was preferred a majority of the time over B (the total number of comparisons made of A vs. B is s*u where u is the number of users). What test should I use to see if this is statistically significant?

- Could I even just compare them all at the same time by treating the "wins" as "points" to get three sums or means representing "scores" and then use a repeated measures ANOVA followed by e.g. Tukey HSD? A lot of methods (Tukey's HSD, Student's t-test...) have the assumption that the score in A needs to be independent to the score in B, but since this is a preference study, they aren't really independent, right? E.g. if there's a total of 20 un-neutral comparisons, if A wins 15, then B wins 5.

- Maybe it's as simple as just doing a pairwise comparison e.g. A vs. B, and defining H0 as "The models will be voted for equally" and then have B's number of wins against A as the dependent variable X? Assuming a binomial distribution where the mean lands on half of the wins, i.e. (s*u)/2 (so if H0 is true then X ~ Bin(s*u, 0.5)), I could then just see if the actual result x is statistically significant i.e. P(X >= x or X <= x | H0) < a/2. I think a problem with this is however that the assumption isn't all that correct, since I'm not sure it can be assumed a user's judgments for different sentences are actually independent? Maybe I'd have to aggregate the data so that X is the number of users favoring model A instead? Though, if I'd do that, the number of sentences doesn't make a difference as to the statistical significance of the result. Or maybe I can assume that a subject's votes in different sentences are independent, as well as the different subjects' votes for any sentence, so that I could just do what I originally proposed and check the two-sided probability of the outcome against Bin(s*u,0.5)?

- There's also the option of logistic regression that I've dug myself into...

I want to be sure I'm doing this correct, so I'd greatly appreciate any insight!

Last edited: