Testing is a matter of concern to all teachers – whether we are in the classroom or engaged in syllabus/ materials, administration or research. We know quite well that good tests can improve our teaching and stimulate students’ learning. Although we may not want to become a measurement expert we may have to periodically evaluate students’ performances and prepare reports on students’ progress. That is why J. B. Heaton (1988) said “Both testing and teaching are so closely interrelated that it is virtually impossible to work in either field without being constantly concerned with the other.” (p.5)
Till now, no one is doubtful about the role and importance of tests in the teaching process. Tests are of various types serving different purposes of the teaching and learning process. However, it cannot be sure that all tests employed are to help teachers and students evaluate and better their teaching as well as learning. Therefore, the question of how to conduct a good test is at all time concerned by teachers as the test is used as a mirror reflecting the real ability and acquisition of their students, and as “a good test can be used as a valuable teaching device” (J. B. Heaton, 1988, p.7)
And we, teachers of Vietnam University of Commerce, are not of the exception. We have tried our best to make better and better achievement tests by changing and improving our tests every school year but it seems that the tests have hardly been satisfactory and that we have not met the core issue of this problem. From the author’s point of view, there are three reasons for this.

13 trang |

Chia sẻ: superlens | Ngày: 04/06/2015 | Lượt xem: 1305 | Lượt tải: 4
Bạn đang xem nội dung tài liệu **Evaluating an achievement test for credit four to non-Majors at Vietnam University of Commerce and some suggestions for improvement**, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

CHAPTER 4: FINDINGS AND DISCUSSIONS
This chapter will present the findings from both quantitative and qualitative analyses. Firstly, it will give statistical descriptions of the whole test. Secondly, the reliability of the test will be evaluated. Next, assessments on the face validity of the test will be presented. Finally, some discussions and implications for improvement of the test are mentioned.
4.1. Test score analysis and interpretation
(See appendix 3 for students’ test scores)
The frequency distribution
Figure 4.1: Histogram of score distribution
It can be seen from the histogram that the scores are distributed quite unevenly, in which more students got marks from 4 to 6.5 than the others.
The central tendency
The mean = the average score:
= 5.22
( = the arithmetic mean, X = the sum of all scores, N = the number of persons in the sample)
The mode = the score gained by the largest number of students
Looking at the histogram above, we can see that, in this case, the mode is 5 (as 13 students got mark 5).
The median = the midpoint of the interval between the two centermost scores. According to Henning (1987, p.39), this can be determined by summing the two centermost scores and dividing the result by two.
In this case, the median is calculated as below:
Median =
The dispersion
The range = the difference between the top and the bottom scores
The range = 9 – 2 = 7
Standard deviation = the average amount that each student’s score deviates from the mean (see appendix 3 for calculation details)
(Henning, 1987, p.40) = ==1.436
(X = any observed score in the sample; = the mean of all scores; N = the number of scores in the sample)
4.2. Evaluation of reliability
4.2.1. Test item analysis and interpretation
The test scores are analyzed employing the MicroCAT Testing System – ITEMAN (see Appendix 4 for details of the analysis). The table below summarizes and categorizes the properties of all test items.
Item property
Index
Interpretation
Proportion (%)
Difficulty (F.V)
– 0.33
0.33 – 0.67
0.68 – 1.00
difficult
acceptable
easy
20.0%
67.5%
12.5%
Discrimination
– 0.3
0.3 – 0.66
0.67 – 1.00
very poor
low
acceptable
15.0%
50.0%
35.0%
Point Biserials
– 0.24
0.25 – 1.00
very poor
acceptable
2.5%
97.5%
Table 4.1: Item properties of the test
As we can see from the table, the three main features of a test item including difficulty or facility value, discrimination and point biserials are presented.
Firstly, concerning the facility value or difficulty level, 20% of the test items can be considered to be difficult as only 0.0 to 33% students got them right. This number might be quite high for an achievement test, which helps explain why most students got mark 4 to 6 while only three of them got marks eight to nine and none got mark ten. Most of the remaining items (67.5%) are acceptable with F.V of 0.33 to 0.67. Few items seem to be easy (12.5%).
Next, it can be seen that only 35% of the items are acceptable in terms of discriminability, whereas 50% are low and 15% has very poor discriminability. Thus, the discrimination of the test in general is not very good.
Finally, in consideration of the index of item-test correlation, it can be optimistic to say that 97.5 % of test items are acceptable and only 2.5% of them have poor correlation indices.
4.2.2. Intercorrelation scales between the five subtests
Subtest 1
Subtest 2
Subtest 3
Subtest 4
Subtest 5
Subtest 1
1.000
0.236
0.375
0.547
0.018
Subtest 2
0.236
1.000
0.276
0.138
-0.003
Subtest 3
0.375
0.276
1.000
0.567
-0.023
Subtest 4
0.547
0.138
0.567
1.000
0.015
Subtest 5
0.018
-0.003
-0.023
0.015
1.000
Table 4.2: Intercorrelation scales of the subtests
It is clear from the table that the scales are generally quite low. The highest intercorrelation is between part three and part four of the test, which is 0.567. Next comes the relation between part one and part four with the scale of 0.547. Part five, in general, does not correlate well with other parts. There is almost no correlation between part five and part one as well as between part five and part four since the correlation indices are very near to 0.0 (0.018 and 0.015 respectively). Moreover, this part has two negative scales (-0.023 with part three and -0.003 with part two), which means that it relates in the reverse direction with these two parts. Other correlations are also not very high, ranging from 0.138 to 0.375. In general, the scales of intercorrelation between subtests are not very satisfactory and these subtests might not go the same trait. This certainly affects the reliability of the whole test.
4.2.3. Estimating internal consistency reliability
In an attempt to estimate the reliability of the test, the author employed both Kuder-Richardson Formula 21 and Cronbach’s alpha formula.
For the first estimation, KR 21 was chosen. The formula is as follows:
where, = the KR 21 reliability estimate; n = the number of items in the test;
= the mean of scores on the test; = the variance of test scores
Applying this formula, we have:
According to Lado (1961), good vocabulary, structure and reading tests are usually in the .90 to .99 range while auditory comprehension tests and oral production tests may be in the 0.70 to .89 range. He also adds that a reliability coefficient of .85 might be considered high for an oral production test but low for a reading test (cited in Hughes, 1989, p.32).
In case of a vocabulary, structure and reading test as in this study, the reliability coefficient of 0.52 is obviously not a good one. This inspired the author to re-estimate the reliability coefficient using another formula – Cronbach’s alpha.
where, k = the number of items on the test; = the variance of the test scores
= the sum of the variances of the different parts of the test;
Applying the above formula, we have:
The reliability index is quite different from the result from KR 21. The reason is that the KR 21 method is recommended by Henning (1987, p.84) to be less accurate than the Cronbach’s alpha as it slightly underestimates the actual reliability. After all, whatever formula is more accurate; the reliability of the test is, in general, rather low, far from the acceptable range of .90 to .99.
4.2.4. The standard error of measurement
The formula for calculating the standard error of measurement is as follows:
(Hughes, 1989, p.159)
Sdx = standard deviation of the test; = reliability of the test
The standard error of measurement turns out to be 1.20. We know that we can be 95% certain that an individual’s true score will be within two standard errors of measurement from the score she or he actually obtains on the test. In this case, two standard errors of measurement amount to 2.40 (2 x 1.20). This means that for a candidate who scores 4.5 on the test we can be 95% certain that the true score lies somewhere between 6.9 and 2.1. This is obviously valuable information if significant decisions are to be taken on the basis of such scores.
4.2.5. Evaluation of the subtests
However, the test is made up of parts, and it is useful to know how the various parts are contributing to the whole. As a matter of course, the mean, standard deviation, an estimate of reliability and its standard error of measurement are also calculated. The figures are presented in the following table.
Features
Subtest 1
Subtest 2
Subtest 3
Subtest 4
Subtest 5
Mean
5.067
2.322
5.933
3.516
2.967
Variance
4.707
4.330
4.929
5.265
4.166
Standard deviation
2.169
2.081
2.220
2.294
2.041
Median
5.000
3.000
6.000
3.000
3.000
Alpha
0.618
0.908
0.603
0.714
0.902
SEM
1.340
0.632
1.399
1.228
0.640
Table 4.3: Statistical description of subtests
4.3. Evaluation of validity
There are many kinds of validity to consider when evaluating a test’s validation; however, due to the constraints of time and length, within the scope of this study, only one essential validity type, face validity, is of investigation.
Face validity
As mentioned in part 2.3.2, face validity is very important for teachers. Therefore, semi-structured interviews were conducted to gain reflection of the test. The first part of the interview which consists of five questions seeks to clarify how familiar the format of each part of the test is to students. Answers to these questions are summarized in table 4.3 as below:
No
Questions
Yes
No
1
Are your students familiar with doing multiple choice tasks?
100%
0%
2
Are your students familiar with doing True/False task?
100%
0%
3
Is the task of sentence rewriting familiar to your students?
100%
0%
4
Are your students familiar with doing word form task?
100%
0%
5
Are your students familiar with doing sentence rearranging task?
0%
100%
Table 4.4: Familiarity of the task types to students
It is obvious from the table that all interviewed teachers share the same view of the familiarity of the task types to their students. 100% of the teachers said that their students were familiar with all the task types in the test except for the final one - sentence rearrangement.
However, explanations for these answers are not the same. For the first question, Mr. Hop and Ms Linh simply said that their students had done this type of exercise in some units of the teaching material. Ms K. Lien, Ms X. Phuong and Ms Mai added that “students did not have much time to practise this type of exercise in the class”. However, Ms Ph. Lien thought that this task type was very popular, therefore students did practise a lot, not only in this course.
In response to the second question, all of the teachers agreed that this task type was very familiar to their students.
Although answers to question three were all “Yes”, there were different opinions about the task of rewriting sentences. Ms Mai stated “There was not much time to instruct students to practise this kind of exercise. We just introduced the structures”. Sharing the same view, Ms Ph. Lien talked more about the results: “Due to limited time and little practice, students did not master all the structures so they often had low mark for this part.” Meanwhile, Ms Linh and Ms X. Phuong had different viewpoint. “Students had chances to do sentence rewriting but doing this task in multiple choice form is totally new to them”, they said.
When being asked about word form exercises, all teachers answered “yes, very familiar to students” and Ms Mai emphasized that this type of exercise was strongly recommended as it was very suitable for their students and the course book.
There were controversial ideas relating to sentence rearranging task asked in question five. Ms K. Lien said that in that semester students did not deal with this kind of exercise but they had done it in the previous one. Therefore, they still decided to put it in the final achievement test. “Students only did exercise relating to letters or in letter form but the letters were used for different purposes such as mistake correcting, not rearranging to make a correct letter”, said Linh.
In the second part, there are six questions from 6 to 11 with the aim of asking for the familiarity of topics and contents of the tasks. Responses to these questions are presented in table 4.4.
No
Questions
Yes
No
6
Are the topic of section 1 and section 2 familiar to your students
100%
0%
7
Have your students learnt all grammatical structures in section 3? If not, what structures haven’t they learnt?
100%
0%
8
Are all the target words in part 4 familiar to your students? If not, what words are new to them?
100%
0%
9
Have your students learnt all parts of speech used in part 4? If not, what are they?
83.3%
16.70%
politic, care
10
Are your students familiar with business letter form?
33.3%
66.7%
11
Is the topic of the letter in section 5 familiar to your students
50%
50%
Table 4.5: Familiarity of topics and task’s contents to students
As can be seen from the table, 100% teachers said that the topic of part one and part two were familiar to their students as they both related to business field, especially related to the topics that they had learnt in the teaching material - E-commerce and Company profile.
The answers to question seven were all yes, however, the scores of this part were quite low. Teachers’ explanations for this fact were different from each other. “As this task type is intended to put in the test, we added structures to be tested in the specifications for revision. Thus, students only had chance to practise all the structures in the revision period”, said Linh. Ms Mai had the same opinion when saying that all students were given the list of all the structures to revise and that these were basic structures which students must know. In Ph. Lien’s words:
Though all structures had been revised, usually the students could not remember all of them and they did not have much time on this part. Part of the reason lies in the fact that they were not very hard working.
Word-form exercises (questions eight and nine), as recommended by Mai, were essential for this test, but their results were not good. The reason might be the fact that the procedures of teaching this aspect were not the same by all teachers. Ms Ph. Lien affirmed that “all the new words and their forms are presented by teachers in the classroom. However, in general students can only remember the form they meet in the course book”. Other teachers admitted the fact that “all words in the material are already introduced. However, the expansion or introduction of their forms varies among teachers.
The most criticized task was task five. Ms K. Lien said “this task is not very familiar. Students had dealt with it in the previous semester. However, they did not meet in this semester”. Mai confided to the author of this study “The teachers herself had difficulty in dealing with this exercise.”
Part three is devoted to estimating the difficulty of the subtests. Results obtained are shown in table 4.5 below.
No
Subtests
Difficult
Appropriate
Easy
12
Section 1
33.3%
66.7%
0%
13
Section 2
16.7%
66.6%
16.7%
14
Section 3
16.7%
66.6%
16.7%
15
Section 4
33.3%
50%
16.7%
16
Section 5
33.3%
66.7%
0%
Table 4.6: Difficulty level of the subtests
Looking at the table, we can see that, 33.3% of the teachers thought that part one is appropriate whereas 66.7% considered it difficult. To explain the difficulty, Ms Linh said “there are many synonyms which might cause misunderstanding and the clues given are not clear”. Ms Mai gave another reason: “There are many ESP words and this task requires students to remember structures and lexical meanings of words”.
As to evaluate part two of the test, 66.6% answered “appropriate”, 16.7% chose “difficult” and the same percent was for “easy”. There were no explanations except for Ms Ph. Lien’s. She just stated that the reading passage in part two had many new words.
The figures for part three are similar to part two. This part was considered appropriate for “these are basic structures” – said Mai and for “students can do all the items if they are hard-working” – Ph. Lien affirmed.
As for part four, 50% of the teachers said this was not an easy or a difficult one. They all explained “this task relies much on the self-study process of students. If they study hard, this one is not difficult for them”. However, Ms Ph. Lien thought that this was quite difficult since, in fact, few students could give the correct word forms.
The final part, part five, seems not to be very easy in the teachers’ eyes. 66.7% of the teachers thought it was quite appropriate while 33.3% said it was difficult. To explain, Ms Linh said that “this section is quite difficult. There are no conjunctions, which requires students to understand the meaning of every sentence”. Ms Mai expressed:
I don’t like this task type as we should not test the skill of letter writing in form of rearranging sentences. It is not difficult in terms of vocabulary but the logic is not easy. It is difficult to decide which sentence comes first, which comes later.
Ms Ph. Lien also revealed her comments: “it is not difficult, not easy and not a suitable task”. Mr Hop added “students did not have much time to practise this task type. To perform this task, they must have deductive skills and be able to find out the key words.”
The final part of the interviews is some general questions. Question 17 asks if the teachers think that the test measures what their students have learnt in the course. Most of the teachers said “Yes, but not enough”. Ms X. Phuong stated that “due to time limitation and availability of facilities, we can just test major parts, not all what students learned such as listening skill, speaking skill, etc.” Sharing the same view, Ms Mai added “this test focused on vocabulary and grammar. However, listening and speaking skills are not tested whereas we test writing skill which students do not learn.” To explain why there is not listening comprehension in the test, Ms K. Lien said “One reason is objective conditions. Another is that students did practise listening skills during the course, but very little. Therefore, if they have to take a listening test, it will be too difficult for them.”
Finally, question 18 asks if the time allowance is sufficient for students. Most teachers said that the time was enough for students to fulfil all the tasks in the test. However, there are two contradictory ideas by X. Phuong and Linh. While Linh said that “time is not really sufficient. Students need 10 more minutes to do all the parts of the test”, X. Phuong argued “there is too much time. We should cut off five to 10 minutes to eliminate chatting and exchanging information in the redundant time.”
Suggestions by teachers to improve the test’s face validity
After evaluating the suitability of the test’s content with the syllabus, all teachers were asked to give some suggestions to improve the quality of the test as a better instrument of measurement. There were many precious contributions which will be presented below.
Firstly, there is one idea by K. Lien about part one of the test. She suggested replacing this part by one reading comprehension exercise with about five follow-up questions as there was already one part checking students’ vocabulary (part four). Mai had the same view, “we had better not to focus too much on vocabulary”, she said, “instead it is necessary to test other skills like listening skill”. To make this part easier for students, Linh thought that we should give students more clues to guess the answers.
Next, Mai added some changes to the second part of the test. In her opinion, five true/false questions for the reading comprehension passage were not enough. “We should add extra questions and diversify task types”, said Mai.
For the third part of the test, K. Lien recommended, “To test students’ ability to write correct structures, asking students to rewrite the whole sentence instead of choosing the correct answers will test students more exactly”.
Part five was the most problematic since all teachers suggested replacing it by another task type. Hop, Ph. Lien, K. Lien and Linh had the same idea of employing a small listening task instead. They all thought that this would make the test more appropriate with the syllabus design. “We can make the test easier by replacing part five with a listening task in which we can ask students to listen to the tape and till rearrange sentences in the correct order”, said Hop. However, being aware of the physical conditions and general features of VUC, Ph. Lien proposed an alternative. “Since there are not enough cassettes and the number of students in each class is large, the teachers can dictate or read aloud in front of the whole class but it must be ensured that teachers dictate loudly, clearly and at the same speed” she said. Mean