Discrimination is the spread of scores produced by a test, or the extent to which a test separates students from one another on a range of scores from high to low. Also used to describe the extent to which an individual multi-choice item separates the students who do well on the test as a whole from those who do badly.
Difficulty is the extent to which a test or test item is within the ability range of a particular candidate or group of candidates.
Mean is a descriptive statistic, measuring central tendency. The mean is calculated by dividing the sum of a set of scores by the number of scores.
Median is a descriptive, measuring central tendency: the middle score or value in a set.
Marker, also scorer is the judge or observer who operates a rating scale in the measurement of oral and written proficiency. The reliability of markers depends in part on the quality of their training, the purpose of which is to ensure a high degree of comparability, both inter- and intra-rater.
Mode is a descriptive statistic, measuring central tendency: the most frequent occurring score or score interval in a distribution.
Raw scores – test data in their original format, not yet transformed statistically in any way ( eg by conversion into percentage, or by adjusting for level of difficulty of task or any other contextual factors).
Reading comprehension test is a measure of understanding of text.
Reliability is the consistency, the extent to which the scores resulting from a test are similar wherever and whenever it is taken, and whoever marks it.
61 trang |
Chia sẻ: superlens | Lượt xem: 2152 | Lượt tải: 3
Bạn đang xem trước 20 trang tài liệu Evaluating a final english reading test for the students at hanoi, technical and professional skills training school – hanoi construction corporation, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
VIETNAM NATIONAL UNIVERSITY, HANOI
COLLEGE OF FOREIGN LANGUAGES
DEPARTMENT OF POSTGRADUATE STUDIES
------------------------------
GIÁP THỊ AN
EVALUATING A FINAL ENGLISH READING TEST FOR THE STUDENTS AT HANOI, TECHNICAL AND PROFESSIONAL SKILLS TRAINING SCHOOL – HANOI CONSTRUCTION CORPORATION
(ĐÁNH GIÁ BÀI KIỂM TRA HẾT MÔN TIẾNG ANH CHO HỌC SINH TRƯỜNG TRUNG HỌC KỸ THUẬT VÀ NGHIỆP VỤ HÀ NỘI – TỔNG CÔNG TY XÂY DỰNG HÀ NỘI)
M.A minor thesis
Field: Methodology
Code: 60.14.10
HANOI –2008
VIETNAM NATIONAL UNIVERSITY, HANOI
COLLEGE OF FOREIGN LANGUAGES
DEPARTMENT OF POSTGRADUATE STUDIES
--------------------------------
GIÁP THỊ AN
EVALUATING A FINAL ENGLISH READING TEST FOR THE STUDENTS AT HANOI, TECHNICAL AND PROFESSIONAL SKILLS TRAINING SCHOOL – HANOI CONSTRUCTION CORPORATION
(ĐÁNH GIÁ BÀI KIỂM TRA HẾT MÔN TIẾNG ANH CHO HỌC SINH TRƯỜNG TRUNG HỌC KỸ THUẬT VÀ NGHIỆP VỤ HÀ NỘI – TỔNG CÔNG TY XÂY DỰNG HÀ NỘI)
MA. minor thesis
Field: Methodology
Code: 60.14.10
Supervisor: Phùng Hà Thanh, M.A.
HANOI – 2008
I clarify that the study is my own work. The thesis, wholly or partially, has not been submitted for higher degree.
ACKNOWLEGEMENTS
I would like to express my deepest thanks to my supervisor Ms. Phùng Hà Thanh, M.A. for the invaluable support, guidance, and timely encouragement she gave me while I was doing this research. I am truly grateful to her for her advice and suggestions right from the beginning when this study was only in its formative stage.
I would like to send my sincere thanks to the teachers at English Department, HATECHS, who have taken part in the discussion as well given insightful comments and suggestion for this paper.
My special thanks also go to students in groups KT1, KT2, KT3, KT4-K06 for their participation to the study as the subjects of the study. With out them, this project could not have been so successful.
I owe a great debt of gratitude to my parents, my sisters, my husband especially my son, who have constantly inspired and encouraged me to complete this research. ABSTRACT
Test evaluation is a complicated phenomenon which has been paid much attention by number of researchers since the importance of language test in assessing the achievements of students was raised. When evaluating a test, evaluator should have concentrated on criteria of a good test such as the mean, the difficulty level, discrimination, the reliability and the validity.
This present study, researcher chose the final reading test for students at HATECHS to evaluate with an aim at estimating the reliability and checking the validity. This is a new test that followed the PET form and was used in school year 2006 – 2007 as a procedure to assess the achievement of students at HATECHS. From the interpretation of the data got from scores, researcher has found out that the final reading test is reliable in the aspect of internal consistency. The face and construct validity has been checked as well and the test is concluded to be valid based on the calculated validity coefficients. However, the study remains limitations that lead to the researcher’s directions for future studies.
TABLE OF CONTENT
ACKNOWLEGEMENT ii
ABSTRACT iii
TABLE OF CONTENT iv
LIST OF ABBREVIATION vi
LIST OF TABLES vii
GLOSSARY OF TERMS viii
PART ONE: INTRODUCTION 1
1. Rationale 2
2. Objectives of the study 3
3. Scope of the study 3
4. Methodology of the study 3
5. The organization of the study 4
PART TWO: DEVELOPMENT 5
CHAPTER 1: LITERATURE REVIEW 6
1.1. Language testing 6
1.1.1. Approaches to language testing 6
1.1.1.1. The essay translation approach 6
1.1.1.2. The structuralist approach 6
1.1.1.3. The integrative approach 7
1.1.1.4. The communicative approach 8
1.1.2. Classifications of Language Tests 8
1.2. Testing reading 10
1.3. Criteria in evaluating a test 13
1.3.1. The mean 13
1.3.2. The difficulty level 13
1.3.3. Discrimination 14
1.3.4. Reliability 15
1.3.5. Validity 17
CHAPTER 2: METHODOLOGY AND RESULTS 19
2.1. Research questions 19
2.2. The participants 19
2.3. Instrumentation and Data collection 20
2.3.1 Course objectives; Syllabus and Materials for teaching 20
2.3.1.1. Course objectives 20
2.3.1.2. Syllabus 20
2.3.1.3. Assessment instruments 21
2.3.2. Data collection 23
2.3.3. Data analysis and Results 23
2.3.3.1. Test score analysis 24
2.3.3.2. The reliability of the test 25
2.3.3.3. The test validity 28
2.3.3.4. Summary to the results of the study 32
PART THREE: CONCLUSION 33
1. Conclusion 34
2. Limitations 34
3. Future directions 35
REFERENCES 36
APAENDIX 1: THE FINAL READING TEST 38
APPENDIX 2: THE PET 44
APPENDIX 3: QUESTIONS FOR DISCUSSION 50
LIST OF ABBREVIATIONS
HATECHS: Hanoi, Technical and Professional Skill Training School
M: Mean
N: Number of students
P: Part
PET: Preliminary English Test
r: Validity coefficient
Rxx: Reliability Coefficient
SD: Standard Deviations
Ss: Students
LIST OF TABLES
Table 1: Types of language tests 9
Table 2: Types of tests 10
Table 3: Types of reliability 16
Table 4: The syllabus for teaching English – Semester 2 21
Table 5: Components of the PET reading test 22
Table 6: Components of the final reading test 22
Table 7: The Raw Scores of the final reading test and the PET 24
Table 8: The reliability coefficients 27
Table 9: The validity coefficients 31GLOSSARY OF TERMS
Discrimination is the spread of scores produced by a test, or the extent to which a test separates students from one another on a range of scores from high to low. Also used to describe the extent to which an individual multi-choice item separates the students who do well on the test as a whole from those who do badly.
Difficulty is the extent to which a test or test item is within the ability range of a particular candidate or group of candidates.
Mean is a descriptive statistic, measuring central tendency. The mean is calculated by dividing the sum of a set of scores by the number of scores.
Median is a descriptive, measuring central tendency: the middle score or value in a set.
Marker, also scorer is the judge or observer who operates a rating scale in the measurement of oral and written proficiency. The reliability of markers depends in part on the quality of their training, the purpose of which is to ensure a high degree of comparability, both inter- and intra-rater.
Mode is a descriptive statistic, measuring central tendency: the most frequent occurring score or score interval in a distribution.
Raw scores – test data in their original format, not yet transformed statistically in any way ( eg by conversion into percentage, or by adjusting for level of difficulty of task or any other contextual factors).
Reading comprehension test is a measure of understanding of text.
Reliability is the consistency, the extent to which the scores resulting from a test are similar wherever and whenever it is taken, and whoever marks it.
Score is the numerical index indicating a candidate’s overall performance on some measure. The measure may be based on ratings, judgements, grades, number of test items correct.
Standard deviation is the property of the normal curve. Mathematically, it is the square root of the variance of a test.
Test analysis is the data from test trials are analyzed during the test development process to evaluate individual items as well as the reliability and validity of the test as a whole. Test analysis is also carried out following test administration in order to allow the reporting of results. Test analysis may also be conducted for research purposes.
Test item is part of an objective test which sets the problem to be answered by the student: usually either in multi choice form as statement followed by several choices of which one is the right answer and the rest are not; or in true/false statement which the student must judge to be either right or wrong.
Test taker is a term used to refer to any person undertaking a test or examination. Other terms commonly used in language testing are candidate, examinee, testee.
Test retest is the simplest method of computing test reliability; it involves administering the same test to the same group of subjects on two occasions. The time between administrations is normally limited to no more than two weeks in order minimize the effect of learning upon true scores.
Validity is the extent to which a test measures what it is intended to measure. The test validity consists of content, face and construct validity.
PART ONE: INTRODUCTION
1. Rationale
Testing is necessary in the process of language teaching and learning, therefore it has gained much concern from teachers and learners. Through testing, teacher can evaluate learners’ achievements in a certain learning period; self assess their different teaching method and provide input into the process of language teaching (Bachman, 1990, p. 3). Thanks to testing, learners also self-assess their English ability to examine whether their levels of English meet the demand of employment or studying abroad. The important role of test makes test evaluation necessary. By evaluating tests, test designers would have the best test papers for assessing their students.
Despite the importance of testing, in many schools, tests are designed without following any rigorous principles or procedures. Thus, the validity and the reliability should be doubted. In HATECHS, the final English course tests had been designed by teachers at English Department at the end of the course, and some of tests were used repeatedly with no adjustment. In school year 2006 – 2007, there has been a change in test designing. Final tests were designed according to PET (Preliminary English Test) procedure. The PET is from Cambridge Testing System for English Speakers of Other Languages. Based on the PET, new Final Reading Test has been also developed and used as an instrument to assess students’ achievement in reading skill. The test was delivered to students at the end of school year 2006 - 2007, there was not any evaluation. To decide whether the test is reliable and valid a serious study is needed.
The context at HATECHS has inspired the author, a teacher of English to take this opportunity to undertake the study entitled “Evaluating a Final English Reading Test for the Students at Hanoi Technical and Professional Skills Training School” with an aim to evaluate the test to check the validity and reliability of the test. The author was also eager to have a chance to find out some suggestions for test designers to get better and more effective test for their students.
2. Objectives of the study
The study is aimed at evaluating final reading test for the students at Hanoi, Technical and Professional Skill Training School. The test takers are non-majors. The results of the test will be analyzed, evaluated and interpreted with the aims:
- to calculate the internal consistency reliability of the test
- to check the face and construct validity of the test
3. Scope of the study
Test evaluation is a wide concept and there are many criteria in evaluating the test. Normally, there are four major criteria - item difficulty, the discrimination, reliability and the validity when any test evaluator wants to evaluate a test. However, it is said that item difficulty and the discrimination of the test are difficult to evaluate and interpret; therefore, with in this study the researcher focuses on the reliability and the validity of the test as a whole.
At HATECHS, at the end of Semester 1, there is a reading achievement test, and at the end of the first year, after finishing 120 periods studying English, there is a final reading test. The researcher chose the final test to evaluate the internal consistency reliability, face and construct validty.
4. Methodology of the study
In this study, the author evaluated the test by adopting both qualitative and quantitative methods. The research is quantitative in the sense that the data will be collected through the analysis to the scores of the 30 random papers of students at the Faculty of Finance and Accounting. To calculate the internal consistency reliability researcher use Formula called Kuder-Richardson 21 and Pearson Correlation Coefficient Formula would be adopted to calculate the validity coefficient. It is qualitative in the aspect of using a semi-structured interview with open questions which were delivered to teachers at HATECHS at the annual meeting on teaching syllabus and methodology. The conclusion to the discussion would be used as the qualitative data of the research.
5. The organization of the study
The study is divided into three parts:
Part one: Introduction – is the presentation of basic information such as the rationale, the scope, the objectives, the methods and the organization of the study.
Part two: Development – This part consists of two chapters
Chapter 1: Literature Review – in which the literature that related to language of testing and test evaluation.
Chapter 2: Methodology and Results – is concerned with the methods of the study, the selection of participants, the materials and the methods of data collection and analysis as well the results of the process of data analysis.
Part three: Conclusion – this part will be the summary to the study, limitations as well the recommendations for further studies.
Then, Bibliography and Appendices
PART TWO: DEVELOPMENT
CHAPTER 1
LITERATURE REVIEW
This chapter is an attempt to establish theoretical backgrounds for the study. Approaches to language testing and testing reading as well as some literature to the test evaluation will be reviewed.
1.1. Language testing
1.1.1. Approaches to language testing
1.1.1.1. The essay translation approach
According to Heaton (1998), this approach is commonly referred to as the pre-scientific stage of language testing. In this approach, no special skill or expertise in testing is required. Tests usually consist of essay writing, translation and grammatical analysis. The tests, for Heaton, also have a heavy literary and cultural bias. He also criticized that public examination i.e. secondary school leaving examinations resulting from the essay translation approach sometimes have an aural/oral component at the upper intermediate and advanced levels though this has sometimes been regarded in the past as something additional and in no way an integral part of the syllabus or examination (p. 15)
1.1.1.2. The structuralist approach
“This approach is characterized by the view that language learning is chiefly concerned with the systematic acquisition of a set of habits. It draws on the work of structural linguistics, in particular the importance of contrastive analysis and the need to identify and measure the learner’s mastery of the separate elements of the target language: phonology, vocabulary and grammar. Such mastery is tested using words and sentences completely divorced from any context on the grounds that a language forms can be covered in the test in a comparatively short time. The skills of listening, speaking, reading and writing are also separated from one another as much as possible because it is considered essential to test one thing at a time” Heaton, 1998, p.15).
According to him, this approach is now still valid for certain types of test and for certain purposes such as the desire to concentrate on the testees’ ability to write by attempting to separate a composition test from reading. The psychometric approach to measurement with its emphasis on reliability and objectivity forms an integral part of structuralist testing. Psychometrists have been able to show early that such traditional examinations as essay writing are highly subjective and unreliable. As a result, the need for statistical measures of reliability and validity is considered to be the utmost importance in testing: hence the popularity of the multi-choice item – a type of item which lends itself admirably to statistical analysis.
1.1.1.3. The integrative approach
Heaton (1998, p.16) considered this approach the testing of language in context and is thus concerned primarily with meaning and the total communicative effect of discourse. As the result, integrative tests do not seek to separate language skills into neat divisions in order to improve test reliability: instead, they are often designed to assess the learner’s ability to use two or more skills simultaneously. Thus, integrative tests are concerned with a global view of proficiency – an underlying language competence or ‘grammar of expectancy’, which it is argued every learner possesses regardless of the purpose for which the language is being learnt.
The integrative testing, according to Heaton (1998) are best characterized by the use of cloze testing and dictation. Beside, oral interviews, translation and essay writing are also included in many integrative tests – a point frequently overlooked by those who take too narrow a view of integrative testing.
Heaton (1998) points out that cloze procedure as a measure of reading difficulty and reading comprehension will be treated briefly in the relevant section of the chapter on testing reading comprehension. Dictation, another major type of integrative test, was previously regarded solely as a means of measuring students’ skills of listening comprehension. Thus, the complex elements involved in tests of dictation were largely overlooked until fairly recently. The integrated skills involved in test dictation includes auditory discrimination, the auditory memory span, spelling, the recognition of sound segments, a familiarity with the grammatical and lexical patterning of the language, and overall textual comprehension.
1.1.1.4. The communicative approach
According to Heaton (1998, p.19), “the communicative approach to language testing is sometimes linked to the integrative approaches. However, although both approaches emphasize the importance of the meaning of utterances rather than their form and structure, there are nevertheless fundamental differences between the two approaches”. The communicative approach is said to be very humanistic. It is humanistic in the sense that each student’s performance is evaluated according to his or her degree of success in performing the language tasks rather than solely relation to the performance of other students. (Heaton, 1998, p.21).
However, the communicative approach to language testing reveals two drawbacks. First, teachers will find it difficult to assess students’ ability without comparing achievement results of performing language tests among students. Second, communicative approach is claimed to be somehow unreliable because of various real-life situation. (Hoang, 2005, p.8). Nevertheless, Heaton (1988) proposes a solution to this matter. In his point of view, to avoid the lack of reliability, very careful drawn - up and well-established criteria must be designed, but he does not set any criteria in detail.
It a nutshell, each approach to language testing has its weak points and strong point as well. Therefore, a good test should incorporate features of these four approaches. (Heaton, 1988, p.15).
1.1.2. Classifications of Language Tests
Language tests may be of various types but different scholars hold different views on the types of language tests.
Henning (1987), for instant, establishes seven kinds of language test which can be demonstrated as follows:
No
Types
Characteristics
1
Objective vs objective tests
Objective tests have clear making scale; do not need much consideration of markers.
Subjective tests are scored based on the raters’ judgements or opinions. They are claimed to be unreliable and dependent.
2
Direct vs indirect tests
Direct tests are in the forms of spoken tests (in real life situations)
Indirect tests are in the form of written tests.
3
Discrete vs integrative tests
Discrete tests are used to test knowledge in restricted areas.