Dr. David Bateson's studies represent some of the only controlled studies on the effect of block scheduling on academic performance and show genuine evidence of harm. While proponents of block scheduling were able to simply ignore Bateson's work for a long time, the spread of information via the Internet has caused them to finally acknowledge that the Canadian studies exist and to develop a response. The response, as evidenced by Mr. Vawter's memo, is intriguing. While it is clear that proponents of the block might naturally disagree with studies showing harm, I am surprised at the logic and analysis used to make the attack.
Mr. Vawter's memo, addressed "To Persons Inquiring About the Bateson Study Referenced in Jeff Lindsay's WEB site," has been circulated to many school districts and merits a public response. Mr. Vawter's memo is quoted below as I have received it from several sources through e-mail. I will interrupt the memo at several points to respond to major arguments. (Mr. Vawter's text will appear indented, if you have a good browser, and will be in quotes. My comments will begin with "Response." )
Let me first note that Mr. Vawter is not always clear about which Bateson study he refers to. Some of his arguments refer to elements of the Bateson's 1990 publication ("Science Achievement in Semester and All-year Courses," Journal of Research in Science Teaching, 27(3): 233-240 (1990)), based on a 1986 assessment, while other comments of Mr. Vawter appear to criticize the 1995 study (M. Marshall, A. Taylor, D. Bateson, and S. Brigden, "The British Columbia Assessment of Mathematics and Science: Preliminary Report (DRAFT), 1995). I will indicate which study he is attacking or I will consider both studies when the argument appears generic. His memo and my comments follow:
"To Persons Inquiring About the Bateson Study Referenced in Jeff Lindsay's WEB site:Response: Truth is not determined by popularity contests!
"This is a copy of a letter I sent to another person concerned about the Bateson Study. I hope this is helpful to you. Please let me know if there is anything I can do to help.
"First, I was a teacher in a high school with a 4x4 block schedule before returning to graduate school to pursue my doctorate. The change to the block was so positive that it helped me decide to postpone my studies for an extra year. I am an advocate of block scheduling because I have experienced both sides of this issue. Second, I am studying under Dr. X [name withheld by webmaster] and my degree will be in Instruction. Third, I have just finished my review of research for my dissertation on both academic and non-academic achievements in the 4x4 block schedule.
"Let me be as brief as I can about the evidence against block scheduling.
"1. There are now hundreds of articles and research about block scheduling; 97% of them are positive. Would these dissenters feel the same way if the opposite were true? Would they try to persuade a school board to switch from block scheduling if 97% of the articles were against block scheduling?"
When an educator writes an article about how he or she implemented block scheduling or any other popular practice or fad, you can bet it's going to be positive. No one will write an article entitled, "How I Pushed My School to Adopt a Failed Program." And once a practice has become a fad, few are going to contradict popular opinion. People and schools tend to hop on the bandwagon, especially if it will boost someone's career or bring publicity or money. Yes, most block scheduling articles in popular educational trade journals are positive, and the fraction may even be close to the 97% estimate of Mr. Vawter, but do 97% of the peer-reviewed, longitudinal studies on Block Scheduling show evidence of academic gains? Absolutely not.
What is the research supporting block scheduling that Mr. Vawter refers to? Many articles held up as "research" by proponents of the block are either isolated case studies, trade articles (including publicity pieces) that have not been peer-reviewed, studies of something other than block scheduling in secondary schools, or reviews of other articles and not fundamental, scientific investigations of the question that really matters: how does block scheduling in secondary schools affect true academic performance?
Mr. Vawter's appeal to the current popularity of block scheduling needs to be considered in light of other modern fads. Fads have positive press, almost by definition, but that doesn't mean that the fad is good or even harmless. Articles and ads in the popular press used to claim that cigarettes were good or at least not harmful for the body - until many years of scientific investigation exposed the lie. In corporate America, we've had numerous empty management fads, some of which have wrecked havoc. In education, we have experienced several fads accompanied by initially or even continually positive press. In the 1800s, a tremendously acclaimed fad was the practice of phrenology (reading of head bumps) to determine intelligence. In this century, we've had our share of less silly but still questionable fads, all receiving good press in the popular press. (Some see block scheduling as just a recycled version of the "Mod Scheduling" fad of the 70s.)
Block scheduling is certainly popular among administrators, and we should not be surprised that they will write about it in glowing terms. That doesn't mean that kids on the block are learning more. Again, truth is not determined by popularity contests.
By the way, it would be interesting to know what fraction of the block scheduling articles fail to mention any disadvantages or harm to the block. The number will be much less than the 97% figure Mr. Vawter pulled out of the air for "positive" articles, but it may be that over 97% of the articles by certain notable proponents of block scheduling only list advantages and never give any hint of problems - something more typical of pure propaganda than of "research." It seems that nearly all the information that most school districts are given consists of such one-sided propaganda.
"2. Without getting too specific, there are major problems with the statistics used in the Bateson Study.Response: Truth is not determined by politicians, by governments, or by bureaucrats. Government bureaucrats in Canada have not discounted the Canadian studies as much as they have refused to consider them, as Dr. Dennis Raphael notes on my main B.S. page. It is common that politicians and bureaucrats are more interested in popularity and fads than they are in what's right. One of my major concerns about block scheduling, as expressed on my main B.S. page, is that it seems to be a fad driven by politics and popularity rather than a serious concern for academic performance in light of scientific evidence. It is fascinating, then, to see the student of the nation's leading block scheduling proponent attack Dr. Bateson's scientific studies using what one could term POPULARITY and POLITICS (governmental decisions) as the first issues raised to defend the block - i.e., the claim that 97% of the articles written are in favor of the block, and the claim that the government of Canada has adopted the block in spite of Bateson's data. Neither argument has anything to do with the only issue that matters: does block scheduling improve the education of children?
"a. First, the Canadian government is aware of those studies, and they do not understand the United State's preoccupation with them. They discounted them in the late 1980's and have continued to support semestered schools."
By the way, Mr. Vawter's statement about "Canadian government" reflects a common misunderstanding about Canadian education, one which I shared until gently corrected. The central government does not handle educational issues, but leaves that to individual provinces. Ontario has widely adopted semestering (block scheduling) in spite of the data, and many schools in B.C. have it as well. I don't have data for other provinces, but welcome further information. In any case, does Mr. Vawter have any evidence that politicians or educators across Canada have seriously considered Dr. Bateson's study? If so, the debate has been awfully quiet.
Now Mr. Vawter challenges specific details of Bateson's study:
"b. The test was administered to all students and most of them failed it! The full year students had an average of just over 50%. The test is either invalid or no one is learning passable science."Response: This is a misleading argument. Surely Mr. Vawter realizes that many standardized tests like the ACT or SAT have low averages. If the average were near 100% or near 0%, a test would have little power to separate students by ability. While in-class exams for individual science courses tend to have averages around 70%, comprehensive tests that cover multiple classes are not invalid merely because the average is near 50%. The examination used by Bateson in his 1995 study was a comprehensive one covering grades 7 to 10, for which an average less than 70% is entirely appropriate. Scores near 50% for the 1990 assessment are also not necessarily a problem.
"c. The students were not chosen randomly and this is one of the tenets of research. There are too many threats to internal validity!"Response: This argument is probably directed to Dr. Bateson's 1990 publication, which says that the test was administered to ALL British Columbian 10th-grade students who were present on the day of testing, a total of over 30,000 students. When nearly all of a population is tested, randomization in the sampling becomes irrelevant. If a small subset of those 30,000 students were sampled and used to represent the entire population of 30,000 or so, then careful randomization would be needed, but even then the results would necessarily have more uncertainty and less power (see below) than was achieved by testing everyone. This particular argument of Mr. Vawter sounds impressive, but is grossly misleading or else statistically naive. However, it will probably continue to be used as a "weapon" in the arsenal of block scheduling proponents.
As for the 1995 study, details of the design and its execution have not yet been released by the B.C. Ministry of Education prior to publication of the full study. However, Dr. Bateson reports that the methodology of the study was almost identical to his earlier 1991 study (1991 British Columbian Assessment of Science) which received the annual publication award of AERA for the best program evaluation report of the year, worldwide. Such a study is likely to be on fairly solid ground in terms of elementary issues such as sampling, and Dr. Bateson is known for thorough experimental designs.
What is the foundation for Mr. Vawter's claim that improper selection procedures were used? Certainly every study has weaknesses, but Mr. Vawter should explain how the internal threats affect the validity of the study. Regardless of problems, the studies raise a red flag about the impact of block scheduling on academic performance. The burden of proof, however, must be on the proponents of block scheduling to show academic gains or at least absence of harm with similar scientific studies involving thousands of students.
"d. In statistics there is a technique called 'power.' It can be done initially to determine how large a study needs to be to show significance. In the Bateson study the number of subjects is so large that a very small difference would be 'statistically significant.' many beginning stat books warn of the difference between 'statistically' significant and 'educationally' significant. His study was so large that while it is statistically significant the difference between the scores has no real significance."Response: I'm really surprised at this argument. In statistics, "power" is the ability to discern differences, and high power is ALWAYS desirable in a study. The more data we have, the more we we can know about our subject. Low power is not an advantage, just like a camera with a bad lens is no advantage. We want the focus to be as sharp as possible if the image is to be accurate. To say that either of Bateson's studies "was so large that ... the difference between the scores has no real significance" must be an unintended slip of the pen (so I hope) or a serious misunderstanding of statistics. A large sample size only strengthens the work and increases the ability to make a correct evaluation of effects. Mr. Vawter is correct, however, in noting that a statistically significant result might not have practical significance. If block scheduling only caused a 0.01% decrease in academic performance, that might be a small price to pay for the advantages of the block. However, in Bateson's 1995 study, the effect of scheduling was actually the strongest single effect that arose from the many variables considered and could single-handedly account for a decrease of 5-10% in some measures of academic performance. Bateson's earlier 1990 publication showed a similar effect, with the difference between first-semester block and full-year students being on the order of 6 to 8% in the various cognitive domains that were studied. To say that a 5-10% drop in test scores is "not educationally significant" leaves me cold. In my educational experience, a difference of 5% has routinely been the difference between an A and a B for my graduate students in science. On the SAT or ACT tests, losing 5-10% could easily cost a student a lucrative scholarship. If block scheduling truly caused only a 1% drop in average academic performance, I would be concerned. I feel we need to adopt changes that IMPROVE academic performance.
Let me offer a question to Mr. Vawter, to his advisor, and to all proponents of block scheduling: how large of a genuine drop in average academic performance is required before you will call it "educationally significant"? If not 5-10%, then 15%? 25%? If we were talking about I.Q. tests, would you trivialize a factor that caused a "mere" 10% drop in raw scores?
"3. His conclusions are wrong and these data could be used to support block scheduling. When parents hit you with this study, show them these conclusions:Response: In calculating a percentage to describe the relative impact of block scheduling on performance, the denominator should be the number of correct answers, not the total number of questions. So how big is that percentage? With mean scores around 50% for the objective portion of the 1986 assessment, as reported in the 1990 publication, the observed differences between first semester block and full-year students typically represent over one question difference out of a mean of about twenty correct, giving over a 5% difference in performance. Tables 43 and 44 of the released portion of Bateson's 1995 study show results for four different subsections of the assessment, each subsection having twenty questions. These subsections show mean scores around 11 to 14 correct answers, with science score differences between full-year and quarter systems ranging from 0.26 to 1.1 questions, with a mean difference of about 0.7 questions out of 12 (roughly a 6% difference between the two) and nearly a 10% difference in two of the subsections. Differences between full-year (10-month) and semester science courses were smaller than the differences between quarter and full-year courses, typically closer to 2%. Greater differences were observed in the mathematics results shown in Table 45. Differences in the four individual subsections of mathematics questions ranged from 5.3 to 6.5% between full-year and semester courses and from 8.0 to 15.6% (mean of 12% decrease) between full-year and quarter courses. If block scheduling can lead to a 5-12% drop in mathematics achievement, we ought to be concerned. Even if the magnitude of the problem were one question out of 40 correct for a 2.5% deficit, it should still raise concerns for all who want improved academic performance.
" a. The difference between full year and second semester averages is less than half a question in most subtests, and less than that in other subtests. Is less than half of a test item in a 40-item multi-choice test a real difference? Is it an educational difference?
" b. The difference between full year and first semester averages is less than a whole question in most subtests, and less than that in other subtests. Is less than one question a real difference?"
Mr. Vawter argues that a half-question difference or a single question difference is insignificant. The absolute magnitude of an effect may be small but can still be important. Suppose we tested kids to see if they could calculate the square root of 36 - a test with one question and one answer. Suppose we tested thousands of students and found that 70% of full-year students knew the answer while only 63% of block students knew the answer, and that this difference was reproducible and statistically significant. The average score of the full-year students on this one-question test would be 0.70 correct answers compared to 0.63 for block students, a difference of only 0.07 questions. One could try to downplay the data by asserting that a mere 0.07 questions (roughly one-fourteenth of a question) was trivial, but in relative terms it would indicate a 10% drop in mean performance for the particular skill being tested. That kind of difference, if real, could be educationally significant, regardless of its small absolute size. That kind of difference might never show up amid the many sources of human variation in a test of only a few dozen students, for the assessment would lack sufficient power. But with good methodology and a large enough sample size for good power, real effects can come into focus more clearly.
Let me give some perspective to a "mere" 5% drop in performance. Consider a typical spread of grades given in a course or on an exam. Normalizing all scores by the highest, the top student has a 100% score and receives an A. Typically those who get above 90% of that high score also get As. B grades may assigned to those in the range of 80 to 90%, C grades to those in the range of 70 to 80%, and those with scores below 70% or at least below 60% may fail. A range of 40% or less may separate sterling success from outright failure. A 5% drop in performance may be more than one-eighth the distance to failure, and a 10% drop takes a student even further toward the realm of mediocrity. Of course, with all the hundreds of factors that cause variability in human performance, that 5% difference may not be easy to detect if the sample size is small, or it may show up but lack statistical significance due to the scatter in the results. (The negative effect on math performance was just a hair short of statistical significance in Susan Lockwood's study, as discussed on my main block scheduling page, but she declared that block scheduling was therefore a viable option and should be adopted because the academic harm was not statistically significant!) It takes a powerful study, one with much data, to resolve such differences accurately. Once they are found, it is inadequate to simply dismiss them as only being a few percent. Every percentage point of lost performance hurts when 30% can be the difference between success and failure. More study may be needed, yes, but we don't need apathy about reduced learning.
"c. THIS IS THE BIG ONE: The test was given in May. The school year ends the third week of June. The full year students had received almost 94% of the entire course, and the second semester students had received only 85% of the course. That could mean semestered kids did as well, or better, with less of the course!"Response: This argument is not valid for either of Bateson's studies. The 1990 publication reports that the assessment covered science material from Grades 8, 9, and 10, not just Grade 10. The few weeks of lost class time in June would have no impact on most of the material tested. Since full details of the 1995 study have not yet been released, Mr. Vawter understandably may not be aware of the scope of the assessment used in Bateson's study. Dr. Bateson has clarified the scope of the testing in personal communication, quoted with permission:
The achievement "tests" did not just assess Grade 10. They assessed the total junior secondary curriculum in math and science from Grade 7 to Grade 10. A majority of the items tested skills and knowledge taught in Grades 7 to 9, and only about 1/3 of the items were from the Grade 10 curriculum.Dr. Bateson acknowledges the potential effect of May testing in the released section of his 1995 report. However, the "opportunity to learn" problem associated with May testing of students in Grade 10 will not affect the results for the test items covering Grades 7 to 9. Further, full-year students who took the tests in May before completing Grade 10 still outperformed those students on the block who had no "opportunity to learn" problem because they completed Grade 10 math or science in the first part of the year. Testing in May simply does not account for the observed drop in scores for students on the block.
Any effect of May testing with June graduation fails to explain the significant and consistent drop in performance of first-semester block students, who had no lost days of learning due to May testing, relative to the full-year students who did miss some learning time in June.
"d. What about the bigger difference between first semester and the full year students? (That whole one question!) research tells us that the normal retention rate of concepts over a three month summer break is about 85%. That is accepted as OK. Now, the average score for the first semester students is two questions less than the full year students, which is a retention rate of almost 93%, after a longer time span than a summer. Such a finding might conclude that retention is enhanced in block schedules."Response: In the 1990 study, each of six cognitive domains studied showed a statistically and educationally significant (in my opinion) difference between block and full-year learning, with block scheduling (semestering) consistently outperformed by full-year scheduling. As for the 1995 study, each of the four subsections in both the mathematics assessment and the science assessment shows full-year students outperforming block scheduling students. Can such data really be manipulated to show an advantage for block scheduling? (Try explaining to a concerned parent that a 6% drop in test scores actually represents enhanced education.) Mr. Vawter's argument assumes that the test comprised questions on Grade 10 material only, but in fact about 2/3 of the test items covered material from previous years (Grades 7 to 9 for the 1995 test, and Grades 8 and 9 for the 1986 assessment, as reported in 1990), for which short-term retention should not be an issue. Since first-semester students completed the entire course, unlike full-year students who missed a portion at the end, some of the retention problems for Grade 10 material taken in the first semester were offset by the "opportunity-to-learn" problem affecting the full-year students. Thus, claims of decreased retention loss under the block are probably not warranted without further evidence.
In the real world, many important exams are given near the end of the school year. Whether the cause is related to retention, opportunity to learn, attention-span problems in longer classes, or less total time spent in meaningful learning, a potential decrease in test performance attributable to block scheduling must be a cause for concern.
Incidentally, the alleged 85% retention rate over a summer break appears to be based on a study that Dr. Canady and others cite frequently to support block scheduling, arguing that retention really isn't much of a problem. The reference is G.B. Semb, J.A. Ellis, and J. Araujo, "Long-Term Memory for Knowledge Learned in School," Journal of Educational Psychology, Vol. 85, No. 2, 1993, pp. 305-316. This study involved a child psychology course at the University of Kansas. After completing the course, there were two tests to evaluate retention, one at 4 months and one at 11 months after the course. Based on scores on the multiple choice test, the authors found that students retained about 85% of what they had learned after 4 months and still retained 80% after 11 months. As I discuss in more detail on my main block scheduling page, it may be unwise to draw any conclusions about math and science retention in secondary schools based on a study of a child-psychology course for college students. Further research is needed, and certainly further details are needed from the Bateson study to understand what role retention may play in block scheduling. It would be interesting if decreased academic performance under the block results were partially offset by a less-than-expected degree of retention loss, but that still doesn't turn an apparent academic deficit into a benefit.
"e. Those students who failed the course the first time probably were allowed to take the course again; if so, then they were tested twice. To be fair he should have counted the scores of the F students in the full yeast twice (That won't work exactly, but the idea is clear.) "Response: I'm not sure that this argument is correct. I don't have information about how students with failing grades were handled, and don't see any reason why any reasonable approach would invalidate the consistent findings of decreased academic performance under the block.
"There are other problems, but that is enough."Response: As a measure of the effectiveness of block scheduling, Bateson's 1995 study has some limitations since it was designed to consider many factors other than block scheduling per se. Indeed, it was something of a surprise to Dr. Bateson that block scheduling would jump out of the data as the single most significant factor studied. Between Bateson's 1995 study and the 1990 publication, there are several significant reasons to question the common allegations of academic gain under block scheduling. Such claims are often based on grade inflation rather than performance on objective tests. As a skeptical parent, I am still waiting for serious scientific research to be published which could allay my concerns about decreased learning in math, science, and other fields under block scheduling. Proponents of block scheduling need to do more than attack Bateson and point to popularity polls to justify making dramatic changes in schools that could collide headfirst with the realities of human learning (finite attention span, retention problems, and decreased total learning time in most block classes).
After the above treatment of Bateson's work, Mr. Vawter's memo goes on to discuss uncertainty in effects on SAT and other standardized scores and to affirm the positive things about block scheduling he sees in the literature, including its popularity and acceptance. He also gives the reference for the Univ. of Virginia Block Scheduling page: "http://curry.edschool.virginia.edu/~dhv3v/block/BSintro.html".
The rebuttal of Mr. Vawter was probably hurriedly written and I expect it to be refurbished soon by himself, his advisor, or others. Please let me know when the next release is out - and look carefully at the analysis. Until then, I feel justified in remaining skeptical about some of the claims of block scheduling. Indeed, the highly questionable arguments used to attack Bateson's work by the intellectuals at the forefront of the block scheduling movement greatly strengthen my doubts about block scheduling.
Residents for Quality Education: More Information on Block Scheduling - a great site for parents and educators.