My sense has been that the PER community still implement subpar standards of research reporting that minimizes our ability to carry out meaningful meta-analysis. I’m not an expert, but I’m assuming that the scores with standard deviations / standard errors would be necessary for a meta-analysis, right? So I’m curious. I’m going to quickly take a look at some recent papers that report FCI scores as a major part of their study, and see what kind of information is provided by the authors. Here’s how I’ll break it down.
N = number of students
Pre = FCI pre-score either as raw score out of 30 or a percentage (with or without standard deviation / standard error of mean)
Post = FCI post-score either as a raw score out of 30 or a percentage (with or without standard deviation / standard error of mean)
g = normalized gain with or without errors bars / confidence intervals
<G> = average normalized gain with or without errors bars / confidence intervals
Gain = Post minus Pre (with or without standard deviation / standard error of mean)
APost = ANOVCA adjusted post score (with or without standard error of mean)
d = Cohen’s d is a measure of effect size (with or without confidence intervals)
I’m leaving out statistical transparency such t-statistics or p-values, or other measures from ANOVA, and I’m sure there are others, such as accompanying data about gender, under-represented minorities, ACT scores, declared major, etc.
Anyway, here we go:
1. Thacker, Dulli, Pattillo, and West (2014) ,”Lessons from large-scale assessment: Results from conceptual inventories”
Raw Data: N
Accompanying Data: None
Calculated Data: g with standard error of the mean (mostly must be read from graphs)
2. Lasry, Charles and Whittaker, “When teacher-centered instructors are assigned to student-centered classrooms”
Raw Data: N, Pre with standard deviation
Accompanying Data: None
Calculated Data: g with standard error of mean (must be read from graphs), Apost with standard error,
3. Cahill et al: Multiyear, multi-instructor evaluation of a large-class interactive-engagement curriculum
Raw Data: N
Accompanying Data: Gender, major, ACT
Calculated Data: g with standard error of mean (must be read from graphs)
4. Ding: “Verification of causal influences of reasoning skills and epistemology on physics conceptual learning”
Raw Data: N, Pre (with standard deviation), Post (with standard deviation),
Accompanying Data: Others related to study, CLASS, for example
Calculate Data: g with standard error of mean
5. Couch and Mazur: Peer Instruction: Ten years of experience and results”
Raw Data: N, Pre (without standard deviation), Post, Pre (without standard deviation)
Calculated Data: g (with out standard deviation), d (without confidence intervals)
6. Goertzen et al, “Moving toward change: Institutionalizing reform through implementation of the Learning Assistant model and Open Source Tutorials”
Raw Data: N, Pre (with SD), Post (with SD),
Accompanying Data: Gender, race, etc.
Calculated Data: Gain (with SD), d (with CI)
7. Brewe et al, “Toward equity through participation in Modeling Instruction in introductory university physics”
Raw Data: N, pre (with SE), Post (with SE)
Accompanying Data: Gender, majority/minority
Calculated Data: Gain (with SE), d (with CI)
So, what do I see?
Of my quick grab of 7 recent papers, only 3 papers meet the criteria for reporting the minimum raw data that I would think are necessary to perform meta-analyses. Not coincidentally, two of these three papers are from the same research group. Also, probably not coincidentally, all three papers include data both in graphs and tables and include errors bars or confidence intervals. They also consistently reported measures related to any statistical analyses performed.
Four of the papers did not fully report raw data. One of the four almost gave all the raw information needed, reporting ANCOVA adjusted post scores rather than raw post scores. Even here the pre-score data is buried and Apost and g scores can almost only be gleaned from graphs. Two of the papers did not give raw data about pre or post. They reported normalized gain information with errors bars shown, but they could only be read from a graph. These two papers did some statistical analyses, but didn’t report them fully. The last of the four reported pre and post scores but didn’t include standard error or deviations. They carried out some statistically analysis as well, but did not report it meaningfully or include confidence intervals.
I don’t intend this post to be pointing the finger at anyone, but rather to point out how inconsistent we are. Responsibility is community-wide–authors, reviewers, and editors. My sense looking at these papers, even the ones that didn’t fully report data, is that this is much better than what was historically done in our field. Statistical tests were largely performed, but not necessarily reported out fully. Standard errors were often reported, but often needing to be read from small graphs.
There’s probably a lot some person could dig into with this, but it’s probably not going to be me.