This study aimed to explore the theoretical foundations of automatic item generation (AIG) methods and derive practical implications by examining the quality of English reading items automatically produced using generative language models, in order to support item writers seeking efficient development of high-quality items. To this end, open-source (Llama) and closed-source (ChatGPT) models were selected and test items were generated under four sets of conditions to compare the outcomes. These conditions specified eight variables — standard performance, item type, genre, topic, keyword, number of options, language and text length — to ensure maximal consistency between the two language models. Based on these conditions, two item types were pilot-generated for 7th- and 10th-grade students: identifying specific details and identifying mood/emotion changes. Subsequently, ten content experts evaluated the quality of the generated items. The results indicated that although test items produced by generative language models still require improvement and are not yet ready for immediate use in school settings, there was no meaningful difference in quality between the items generated by the open-source and closed-source models. These results suggest that institutions sensitive to data security may consider developing automatic generation systems based on open-source models.