This is a LOT better controlled than many educational studies I've seen. I like the multi-institution approach, as it seems a sounder method than limiting the study to just one student population of one university, as too many educational research studies do.
I didn't read all of it, because the specific topic they were teaching isn't of much interest to me, and doesn't make a lot of sense to me for that reason, but I was more interested in the methodology.
I have a general question about this type of study...what would you consider the "N" for this study? Is it really 2000, the number of students tested, or would it be 8 (4 for each teaching method), which is the number of classes taught? Though, unless my quick skim missed it, they didn't actually specify how many classes or lecturers were involved at each institution. There may have been more than one of each at some.
I ask, because one concern I always have in comparing teaching methods is that simply having an engaged instructor who is willing to try new methods might mean they are giving a better lecture regardless of the method used. Likewise, from year to year, I really do see noticeable differences in attitudes in whole classes that affect performance of the class as a whole. So, for statistical purposes, I think one must consider the number of classes within the design. I kind of look at this as a nested factorial design...students, students within classes, classes within institutions. In this case, there are still sufficient degrees of freedom to do a proper statistical analysis. This is my concern with too many other studies that they only look at one institution or one class, and the results aren't really sufficiently generalizable outside the institution to be a valid research study.