Penggunaan Penskor Jawaban Esai Otomatis dalam Pengukuran Pengetahuan Guru

Raden Ahmad Hadian Adhy Permana*, Ari Widodo, Wawan Setiawan, Siti Sriyati


Measuring the ability of teachers is part of the evaluation in the national education system. The development of the instrument in the form of essay questions has become an alternative because the instrument is considered to have better authenticity than the form of multiple choice questions. The implications of using the essay instrument were scored consistency and efficient resources. The purpose of this study was to examine the inter-rater reliability between the automated essay scoring system and human raters in measuring teachers’ knowledge using the essay instrument. The research was conducted randomly with 200 junior high school science teachers. The quantitative method design was applied to investigate the intra-class correlation coefficient (ICC) and Pearson's correlation (r) as indicators of the ability of automated essay scoring (UKARA) that had been used. The main data in this study were test answers from participants in the form of limited essay answers that have been distributed online. The inter-rater reliability coefficient between the UKARA and the human rater in this study was in the high category (more than 0.7) for all items or means that the score given by UKARA has a strong correlation with the score given by human rater. These results mean that UKARA has adequate capability as an automated essay scoring system on the measurement of science teacher knowledge.



Automated essay scoring, inter-rater reliability, competency measurement

Full Text:



Aalaei, S., Ahmadi, M.A.T., & Aalaei, A. 2016. A Comparison of multiple-choice and essay questions in the evaluation of dental students. International Journal of Advanced Biotechnology and Research, 7(5):1674–1680.

Ahmad, R. 2019. E-learning automated essay scoring system menggunakan metode searching text similarity matching text. Jurnal Penelitian Enjiniring, 22(1):38–43.

Aji, R.B., Baizal, A., & Firdaus, Y. 2011. Automatic essay grading system menggunakan metode latent semantic analysis E-78 E-79. Prosiding Seminar Nasional Aplikasi Teknologi Informasi, 2011(Snati): 1–9.

Akturk, A.O. & Ozturk, H.S. 2019. Teachers’ TPACK levels and students’ self-efficacy as predictors of students’ academic achievement. International Journal of Research in Education and Science, 5(1):283–294.

Attali, Y. & Burstein, J. 2006. Automated essay scoring with e-rater V.2. The Journal of Technology, Learning, and Assessment, 4(3). Retrieved from

Sevcikova, B.L. 2018. Human versus automated essay scoring: a critical review. Arab World English Journal, 9(2):157–174.

Brew, C, & Leacock, C. 2013. Automated Shor Answer Scoring: Principles and Prospects. In Handbook of Automated Essay Evaluation: Current Applications and New Directions. New York: Routledge. pp. 136–152.

Byrne, R., Tang, M., Tranduc, J., & Tang, M. 2010. E Grader, a software application that automatically scores student essays: With a postscript on ethical complexities. Journal of Systemics, Cybernetics and Informatics, 8(6):30–35.

Chung, G.K.W.K. & Baker, E.L. 2003. Issues in the reliability and validity of automated scor- ing of constructed responses, dalam Shermis, M.D. & Burnstein, J. (Eds.), Automated essay scoring: A cross disciplinary approach (pp. 23–40). Lawrence Erlbaum Associates.

Clauser, B.E., Kane, M.T., & Swanson, D. B. 2002. Validity issues for performance-based tests scored with computer-automated scoring systems. Applied Measurement in Education, 15(4):413–432.

Cohen, Y. 2017. Estimating the Intra-Rater Reliability of Essay Raters. Frontiers in Education, 2(September): 1–11.

Crossley, S.A., Allen, L.K., Snow, E.L., & Mcnamara, D.S. 2016. Incorporating learning characteristics into automatic essay scoring models: What individual differences and linguistic features tell us about writing quality. Journal of Educational Data Mining, 8(2):1–19.

Darling-Hammond, L., Amrein-Beardsley, A., Haertel, E., & Rothstein, J. 2012. Evaluating teacher evaluation. Phi Delta Kappa. 9(3):8–15.

Dawis, R.V. 1987. Scale construction. Journal of Counseling Psychology, 34(4):481–489.

Ebel, R.L. & Frisbie, D.A. 1991. Essentials of educational measurement (5th ed.).

Etikan, I. 2016. Comparison of Convenience Sampling and Purposive Sampling. American Journal of Theoretical and Applied Statistics, 5(1):1-7.

Finch, W.H. & French, B.F. 2018. Educational and Psychological Measurement. Routledge, New York and London.

Fleenor, J.W., Fleenor, J.B., & Grossnickle, W.F. 1996. Interrater reliability and agreement of performance ratings: A methodological comparison. Journal of Business and Psychology, 10(3):367–380.

Graham, M., Milanowski, A., & Miller, J. 2012. Measuring and Promoting Inter-Rater Agreement of Teacher and Principal Performance Inter-Rater Agreement of Teacher and Principal. United States: Center for Educator Compensation Reform.

Ha, M., Nehm, R.H., Urban-Lurain, M., & Merrill, J.E. 2011. Applying computerized-scoring models ofwritten biological explanations across courses and colleges: Prospects and limitations. CBE Life Sciences Education, 10(4):379–393.

Haley, D.T., Thomas, P., De Roeck, A., & Petre, M. 2007. Measuring improvement in latent semantic analysis-based marking systems: Using a computer to mark questions about HTML. Proceedings of the Ninth Australasian Conference on Computing Education, 66(January):35–42.

Hasanah, U., Permanasari, A.E., Kusumawardani, S.S., & Pribadi, F.S. 2019. A scoring rubric for automatic short answer grading system. Telkomnika (Telecommunication Computing Electronics and Control), 17(2):763–770.

Horbach, A. & Zesch, T. 2019. The Influence of Variance in Learner Answers on Automatic Content Scoring. Frontiers in Education, 4(April).

Ke, Z. & Ng, V. 2019. Automated essay scoring: A survey of the state of the art. IJCAI International Joint Conference on Artificial Intelligence, 2019-Augus, 6300–6308.

Keith, T.Z. 2003. Validity and Automated Essay Scoring Systems, dalam Shermis, M.D & Burstein, J. (Eds.), Automated Essay Scoring: A Cross-Disciplinary Perspective. New Jersey: Lawrence Erlbaum Associates. pp.147–168.

Khan, M.A. & Ishfaq, U. 2016. Test-score reliability of essay questions in BA examination. Global Language Review, I(I):58–65.

Kuntarto, E., Nurhayat, W.I, Handayani, H., Trianto, A., & Maryono, M. 2019. Teacher’S competency assessment (TCA) in Indonesia: a New Frame Work. International Conference on Educational Assessment and Policy (ICEAP 2019), 8–20.

Kuo, C.Y. & Wu, H.K. 2013. Toward an integrated model for designing assessment systems: An analysis of the current status of computer-based assessments in science. Computers and Education, 68:388–403.

Landauer, T.K., Laham, D., & Folt, P. 2003. Automatic essay assessment. Assessment in Education: Principles, Policy and Practice, 10(3):295–308.

Latifi, S. & Gierl, M. 2021. Automated scoring of junior and senior high essays using Coh-Metrix features: Implications for large-scale language testing. Language Testing, 38(1):62–85.

Linn, M.C., Gerard, L., Ryoo, K., McElhaney, K., Liu, O.L., & Rafferty, A.N. 2014. Computer-guided inquiry to improve science learning. Science, 344(6180):155–156.

Liu, O L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., & Linn, M.C. 2014. Automated scoring of constructed-response science items: Prospects and obstacles. Educational Measurement: Issues and Practice, 33(2):19–28.

Liu, O.L., Rios, J.A., Heilman, M., Gerard, L., & Linn, M.C. 2016. Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2):215–233.

Luckin, R. 2017. Towards artificial intelligence-based assessment systems. Nature Human Behaviour, 1(3):725-731.

Nehm, R.H., & Haertig, H. 2012. Human vs. Computer Diagnosis of students’ natural selection knowledge: testing the efficacy of text analytic software. Journal of Science Education and Technology, 21(1):56–73.

Nkhoma, C., Nkhoma, M., Thomas, S., & Quoc Le, N. 2020. The Role of Rubrics in Learning and Implementation of Authentic Assessment: A Literature Review. Proceedings of the 2020 InSITE Conference, (October):237–276.

Powers, D.E., Escoffery, D.S., & Duchnowski, M.P. 2015. Validating automated essay scoring: a (modest) refinement of the “gold standard.” Applied Measurement in Education, 28(2):130–142.

Ramalingam, V.V., Pandian, A., Chetry, P., & Nigam, H. 2018. Automated essay grading using machine learning algorithm. Journal of Physics: Conference Series, 1000(1).

Rios, J.A. & Wang, T. 2018. Essay Items, dalam Frey, B.B. (Ed.), The SAGE Encyclopedia of Educational, Measurement, and Evaluation. Thousand Oak CA: SAGE Publications, Inc. pp.602–605.

Shermis, M.D. & Burstein, J. 2003. Introduction, dalam Shermis, M.D. & Burnstein, J. (Eds.), Automated essay scoring: A cross disciplinary approach. Mahwah NJ: Lawrence Erlbaum Associates. pp. xiii-xvi.

Shermis, M.D., Burstein, J., & Bursky, S.A. 2013. Introduction to Automated Essay Evaluation, dalam Shermis, M.D. & Burnstein, J. (Eds.), Handbook of Automated Essay Evaluation: Current Applications and New Directions. New York: Routledge/Taylor & Francis. pp.1–15.

Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. 2010. Automated essay scoring: Writing assessment and instruction. International Encyclopedia of Education, Vol. 4, pp. 20–26.

Shin, J. & Gierl, M.J. 2020. More efficient processes for creating automated essay scoring frameworks: A demonstration of two algorithms. Language Testing, 1–26.

Sukardi, R.R., Widodo, A., & Sopandi, W. 2017. Describing teachers’ pedagogic content knowledge about reasoning development and students’ reasoning test. Advances in Social Science, Education and Humanities Research (ASSEHR). ICMSEd 2016:14–20.

Taghipour, K. & Ng, H.T. 2016. A Neural Approach to Automated Essay Scoring. Proceedings of the 2016 conference on empirical methods in natural language processing. pp.1882–1891.

Toranj, S. & Ansari, D.N. 2012. Automated versus human essay scoring: A comparative study. Theory and Practice in Language Studies, 2(4):719–725.

Vacha-Haase, T., Kogan, L.R., & Thompson, B. 2000. Sample compositions and variabilities in published studies versus those in test manuals: Validity of score reliability inductions. Educational and Psychological Measurement, 60(4):509–522.

Valenti, S., Neri, F., & Cucchiarelli, A. 2003. An Overview of Current Research on Automated Essay Grading. Journal of Information Technology Education: Research, 2:319–330.

van Helvoort, J., Brand-Gruwel, S., Huysmans, F., & Sjoer, E. 2017. Reliability and validity test of a Scoring Rubric for Information Literacy. Journal of Documentation, 73(2): 305–316.

Wahlen, A., Kuhn, C., Zlatkin-Troitschanskaia, O., Gold, C., Zesch, T., & Horbach, A. 2020. Automated scoring of teachers’ pedagogical content knowledge – a comparison between human and machine scoring. Frontiers in Education, 5(August):1–10.

Williamson, D., Bennett, R., Lazer, S., Bernstein, J., Foltz, P., Landauer, T., … Sweeney, K. 2010. Automated Scoring for the Assessment of Common Core Standards. Working Paper. Educational Testing Service.

Zhang, M. 2013. Contrasting automated and human scoring of essays. ETS R & D Connections, (21):1–11.

Zimmerman, W.A., Kang, H. Bin, Kim, K., Gao, M., Johnson, G., Clariana, R., & Zhang, F. 2018. Computer-automated approach for scoring short essays in an introductory statistics course. Journal of Statistics Education, 26(1):40–47.

Zlatkin-Troitschanskaia, O., Kuhn, C., Brückner, S., & Leighton, J.P. 2019. Evaluating a technology-based assessment (tba) to measure teachers’ action-related and reflective skills. International Journal of Testing, 19(2):148–171.



  • There are currently no refbacks.

Copyright (c) 2021 Jurnal IPA & Pembelajaran IPA

Jurnal IPA dan Pembelajaran IPA

ISSN 2614-0500  (print) | 2620-553X (online)
Organized by Universitas Syiah Kuala 
Published by Program Studi Magister Pendidikan IPA Program Pascasarjana Universitas Syiah Kuala
Website :
Email     :

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.