How to correctly analyze the results of a preference study?

Vaering · Aug 25, 2018

Hi,

I've chosen to compare the results of three different machine-learning models (or well, one is the ground truth) outputting the mouth animation of a talking 3D model for my thesis using a user preference study.

Users will see two videos at a time, each from one of the models, and indicate which they think look the most natural, or if they are neutral. The videos are synchronized. For each user, one comparison will be made for each sentence (out of a set of s sentences) and for each pair of models to be compared. So, if there are s sentences, there'll be s*2 comparisons made by each user.

Now, I could arrange the data as simply A vs. B, B vs. C, and A vs. C. Then, let's say A was preferred a majority of the time over B (the total number of comparisons made of A vs. B is s*u where u is the number of users). What test should I use to see if this is statistically significant?

- Could I even just compare them all at the same time by treating the "wins" as "points" to get three sums or means representing "scores" and then use a repeated measures ANOVA followed by e.g. Tukey HSD? A lot of methods (Tukey's HSD, Student's t-test...) have the assumption that the score in A needs to be independent to the score in B, but since this is a preference study, they aren't really independent, right? E.g. if there's a total of 20 un-neutral comparisons, if A wins 15, then B wins 5.

- Maybe it's as simple as just doing a pairwise comparison e.g. A vs. B, and defining H0 as "The models will be voted for equally" and then have B's number of wins against A as the dependent variable X? Assuming a binomial distribution where the mean lands on half of the wins, i.e. (s*u)/2 (so if H0 is true then X ~ Bin(s*u, 0.5)), I could then just see if the actual result x is statistically significant i.e. P(X >= x or X <= x | H0) < a/2. I think a problem with this is however that the assumption isn't all that correct, since I'm not sure it can be assumed a user's judgments for different sentences are actually independent? Maybe I'd have to aggregate the data so that X is the number of users favoring model A instead? Though, if I'd do that, the number of sentences doesn't make a difference as to the statistical significance of the result. Or maybe I can assume that a subject's votes in different sentences are independent, as well as the different subjects' votes for any sentence, so that I could just do what I originally proposed and check the two-sided probability of the outcome against Bin(s*u,0.5)?

- There's also the option of logistic regression that I've dug myself into...

I want to be sure I'm doing this correct, so I'd greatly appreciate any insight!

Dale · Aug 25, 2018

Have you already acquired the data and just searching for ways to analyze it, or are you still designing the study both data collection and analysis?

Also, are the sentences grouped into different categories like fast and slow or emotional and informational?

Vaering · Aug 25, 2018

I'm still designing it, and I figured I'd plan it out completely before conducting the survey.

Nope, but they're all in the same neutral mood and speed really, and recorded by me.

Dale · Aug 25, 2018

Good approach! Way to many people collect data without thinking of the analysis in advance.

I would recommend a 7 point Likert scale where 1 is “strongly prefer A” 2 is “prefer A” 3 is “slightly prefer A” and 4 is “no preference” and so on for B. There is a lot of information on analyzing Likert scales. Typically you would have several questions about the same topic and you would just sum them all within that question. Here you would have just 3 (A vs B, B vs C, and C vs A). Then you have a single Likert rating for each, and you just treat them as a normally distributed continuous variable. Standard ANOVA or GLM methods work.

FactChecker · Aug 25, 2018

Your test is more powerful if you have them rank-order A, B, C and apply order statistics. If you intend to do ANOVA, be sure to apply experimantal design techniques so that you don't end up with a weak test. This is most important if each experiment is expensive and you try to get tricky by reducing the number of tests with certain combinations.

Vaering · Aug 26, 2018

Dale: Ah yes, you could say I was basically thinking of doing a one-point Likert (prefer A, no-preference, and prefer B) at first, but several points could maybe make the results more significant? And I guess I can consider each observation independent then even though it's the same users voting on different sentences, so I'd have u*s observations (u and s are the number of users and sentences respectively) for e.g. A vs B, since if the models are equal a user would be equally likely to pick either model in different sentences.

FactChecker: So basically showing all three models at the same time for every sentence and letting the subjects rank them? I guess I could see if the amount of times e.g. A was favored over both the other models is enough to reject H0 (that the models are equally favored). Or apply a scoring like 1 for the first position, 2 for the second and 3 for the third and then do a hypothesis test on the average score like the second answer here: https://stats.stackexchange.com/que...al-test-for-significance-in-ranking-positions
Could you elaborate a bit as to why this test would be more powerful?

Dale · Aug 26, 2018

Vaering said:

And I guess I can consider each observation independent then even though it's the same users voting on different sentences

I definitely would not assume that. You will get hammered in peer review if you do that. The observations from a single user are expected to not be independent until the data shows otherwise. In the likely case that it is not independent then you will simply do a mixed effects model with the user as a random variable

FactChecker · Aug 26, 2018

It is clear that the set of all pairwise comparisons from one person gives logically dependent results. So there would be dependencies in the larger sample. In any case, usually getting the people to give you results is the hardest part. The way to get the most out of anyone person would be to get some sort of ranking (or relative scores) of the 3 alternatives. You can always convert that to dependent comparisons of pairs if you go that way. But I can't help but think that retaining the most information for the statistical test is the most efficient. That would be the rankings (or relative scores).

It seems that relative scoring of the 3 options on a scale (say 1..10) would get the most information from each person.

Vaering · Aug 27, 2018

Dale said:

I definitely would not assume that. You will get hammered in peer review if you do that. The observations from a single user are expected to not be independent until the data shows otherwise. In the likely case that it is not independent then you will simply do a mixed effects model with the user as a random variable

Ah, I think the second answer implicitly does that erroneously here: https://stats.stackexchange.com/a/281020/218515. Also in several other studies, people seem to do it, like here http://www.zhizheng.org/papers/is2015_oliver_control.pdf. Tempting since it's much easier to get a lot of observations (u*s) that way also, and the number of sentences actually matters for the statistic that way, than if each observation is just a sum or an average from each user. I guess mixed models would allow for these kinds of dependencies and thus increase the amount of observations, and also the number of sentences would affect the outcome, though I've never dabbled in it. But I think I could model the response with lme4 as: score ~ model + (1+model|sentence) + (1+model|subject) + ε, i.e. we allow the score to vary depending on the model, but we take into account that different subjects can have different random bias and slopes across their s observations, and so can sentences across their u observations. This way we can have u*s observations even though they're not really independent. Then if this model is statistically significantly better at predicting the outcome than score ~ 1 + (1|sentence) + (1|subject) + ε, using a hypothesis test like in https://web.stanford.edu/class/psych252/section_2013/Mixed_models_tutorial.html I can reject the null hypothesis that they're equal?

These simpler hypothesis tests always assume all observations are independent, so in that case I'd have to take the sum or the average of each user's observations and then treat each as one observation, so I'd have u observations, not u*s, right? Then those aggregated observations can be assumed to be independent since it'd be one sum (i.e. one Likert scale) per user, going from s to s*7 if the per-sentence Likert items are 1 to 7.

- Then, since this is a preference study, each Likert item (each comparison) is going to be both the score for A and the score for B at the same time, so in the A vs B test I could just test if the score for A deviates enough from a neutral score? All Likert tests I could find require an independent variable, but in each A vs. B test I'd really not have any independent variable, I just want to see if the score is significantly different from neutral. I guess I could do a one-sample t-test to see if of the deviance from a neutral score is significant?

- Or summarize each model's total score so A's score would be from the A vs B as well as the A vs C comparisons, so each model's score would be from u*2 observations (sums). Then compare these three scores using a repeated measures ANOVA (each model being a "measure") followed by Tukey's HSD? Only that the three scores wouldn't be independent now right, since it was a preference study and for each term in A's score there'll be a term in B's or C's scores that is the complement of it, i.e. for each sentence that a user votes in, if A gets 2 then B gets 7-2. Is that a problem?

FactChecker said:

It is clear that the set of all pairwise comparisons from one person gives logically dependent results. So there would be dependencies in the larger sample. In any case, usually getting the people to give you results is the hardest part. The way to get the most out of anyone person would be to get some sort of ranking (or relative scores) of the 3 alternatives. You can always convert that to dependent comparisons of pairs if you go that way. But I can't help but think that retaining the most information for the statistical test is the most efficient. That would be the rankings (or relative scores).

It seems that relative scoring of the 3 options on a scale (say 1..10) would get the most information from each person.

All right, and then I would compare the mean scores for the three models? Once again, each observation would be the sum of each user's votes across the sentences to make observations independent, so if user i voted model A to be ranked 1, 2 and 1 in three sentences the score for model A from user i would be 4 and then the total score for A would be the average of all users' scores for it? What kind of test would be good for comparing the means then? Repeated measures ANOVA (each model being a "measure") with Tukey's HSD I guess?

Dale · Aug 27, 2018

Vaering said:

Then if this model is statistically significantly better at predicting the outcome than score ~ 1 + (1|sentence) + (1|subject) + ε, using a hypothesis test like in https://web.stanford.edu/class/psych252/section_2013/Mixed_models_tutorial.html I can reject the null hypothesis that they're equal?

Yes, exactly!

Vaering · Aug 28, 2018

All right, thanks guys! I think I'll go with the regression model then, and show all three videos at once each time with a 1-10 scale for each. That way I think the scoring might be more consistent... I think maybe it'd be an idea to do some kind of normalization to make the relative scores more consistent, in order to make the per user and per sentence bias smaller (maybe one user has a tendency to put larger extremes on the scores than another user, even though they mean the same thing - also, a user might do it only for one sentence). But I guess that's really handled by the regression in this case with the random effects. I could also include a user:sentence interaction effect maybe.

For presenting the data however, some kind of normalization/standardization would be nice. But instead of that, since the users might also be confused about the meaning of the scores and where to start, like if the worst model should be at 0 or 4, I could enforce putting the worst model at 0, and then instruct users that a distance of 1 is "slightly preferred", 2 is "preferred" and 3 is "strongly preferred" and I can also encourage this by feedback like a dynamic text that says "According to you, A is strongly preferred to B which is equal to C" depending on the scores the user entered. That way, no one sentence judgment from a user will affect the scores more than any other judgment when presenting the scores, making it fairer. And I could maybe disallow bigger distances than 3 between two consecutive models. The absolute scores themselves don't really mean anything, I'm just interested in the relative scoring (and the relative scoring is also probably less noisy). So it'll probably make the mixed effects model regression easier also I think. I'll also ignore the first three responses to avoid training effects.

So to sum it up:

- Three videos side by side, one from each model, random in order, playing the same sentence simultaneously. Start/pause and volume functions are available. Above each video is a label ("A", "B" or "C"), and I'll also dynamically update the score the user has entered for it to make things as easy as possible.
- Beneath, there are three horizontal radio button scales, 1-10, with corresponding label "A", "B" or "C" from top to bottom. First, users select which model was worst using a stand-alone dropdown at the top, and that model is automatically assigned a score of 0 in its radio button scale and the radio button scale for it is greyed out. Users then set the scores for the others, and are allowed to have distances up to three points between neighboring (score-wise) models.
- At the bottom there's a feedback sentence like "According to you, A is strongly preferred to B which is slightly preferred to C" to make things clear.

In the end, the average can be modeled as a mixed effects model and I'll see which of the relative scores between models are statistically significant. Any thoughts on that?

Dale · Aug 28, 2018

Giving users feedback about their scoring is not common. I don’t think it is a bad idea, but I would make sure that I had a previous study to cite.

One thing that is often helpful is to generate a set of fake expected data. Then do your analysis with that data and make sure that it works the way you intend.

Since you have a “reference standard” you could also set that to 0 and then all comparisons would be relative to the reference.

Vaering · Aug 28, 2018

Dale said:

Giving users feedback about their scoring is not common. I don’t think it is a bad idea, but I would make sure that I had a previous study to cite.

One thing that is often helpful is to generate a set of fake expected data. Then do your analysis with that data and make sure that it works the way you intend.

Since you have a “reference standard” you could also set that to 0 and then all comparisons would be relative to the reference.

I don't mean as in how good their choices were or anything, just a clarification of what they chose. Like putting what they chose in words for clarification of what the scores mean. Just to make the task easier and less risk of misunderstanding.

Yes, I'll try that :)

By "reference standard" do you mean the ground truth model or?

Dale · Aug 28, 2018

Vaering said:

By "reference standard" do you mean the ground truth model

Yes

How to correctly analyze the results of a preference study?

1. How do I determine the significance of the results in a preference study?

2. What is the best way to present the results of a preference study?

3. How do I handle missing data in a preference study?

4. Should I use qualitative or quantitative analysis for a preference study?

5. How can I ensure the accuracy of my results in a preference study?

Similar threads

Hot Threads

Recent Insights