Measuring Vision-Language STEM Skills of Neural Models (2024)

Model		Science	Technology	Engineering	Math	Average
Random Guesses		38.6	25.0	44.9	39.1	36.9
Language Models
GloVe(Pennington etal., 2014)		38.0	25.2	48.1	39.0	37.6
UnifiedQA_Small(Khashabi etal., 2020)		39.6	27.2	58.0	39.6	41.1
UnifiedQA_Base(Khashabi etal., 2020)		42.6	28.8	55.4	40.0	41.7
\hdashline GPT-3(Brown etal., 2020)							47.1	22.1	73.5	44.0	46.7
GPT-3.5-Turbo		50.1	26.3	74.6	45.0	49.0
Vision-Language Models
Virtex (Desai & Johnson, 2021)		37.5	24.0	48.1	38.9	37.1
12-in-1 (Lu etal., 2020)		39.4	27.5	44.2	41.9	38.3
ViLBERT (Lu etal., 2019)		39.0	32.1	44.2	42.7	39.5
UNITER (Chen etal., 2020b)		50.8	34.6	55.1	43.2	45.9
\hdashlineCLIP(Radford etal., 2021)	RN50	47.8	64.4	55.8	43.6	52.9
	RN101	50.3	65.3	46.7	43.7	51.5
	RN50x4	48.8	69.2	49.4	44.1	52.9
	RN50x16	49.8	66.1	51.4	44.3	52.9
	RN50x64	50.9	70.0	55.5	43.2	54.9
	ViT-B/32	48.3	63.7	59.5	42.8	53.6
	ViT-B/16	48.6	65.9	47.2	43.6	51.3
	ViT-L/14	49.8	68.6	54.3	43.1	54.0
	ViT-L/14@336px	50.3	68.7	55.1	43.6	54.4
	+Finetuning	87.0	71.9	67.7	78.4	76.3

3 Experiments

In this section, we show the performance of a wide set of neural models as well as humans on STEM. The results show that state-of-the-art foundation models like CLIP and GPT-3.5-Turbo still underperform general elementary students. The details of the experimental setup, additional results and analysis are described in the appendix.

3.1 Main Results

Zero-Shot

The results are shown in Table 2.4. We first test language models to see whether models that only understand text are proficient at the multimodal skills in STEM. GloVe has near random-chance accuracy. This means that STEMcannot be solved by simply matching the text semantic similarity between questions and answers. UnifiedQA does slightly better than GloVe with an improvement of averaging 4.1% points. GPT-3.5-Turbo performs the best among these language models, reaching 49.0% accuracy on average. Both foundation models (GPT-3.5-Turbo and GPT-3) perform well in engineering. This is mainly because engineering practices are mainly described in the text (see Figure1(a)(iii)). Recent advancements in large language models help dramatically improve text understanding capabilities. However, large language models still struggle in other subjects. This implies that the understanding of both vision and language information is essential to STEMskills.

Measuring Vision-Language STEM Skills of Neural Models (1)

Next, we examine vision-language models. We find that the performance of Virtex, 12-in-1, and ViLBERT is nearing the performance of random guesses. These models capture very limited knowledge of STEM subjects. On the other hand, UNITER and CLIP show significant improvements over the random-chance accuracy.Specifically, CLIP-RN50x64 achieves the best result on STEM. It achieves 18.0% points improvements over random guesses. Notably, CLIP-RN50x64 outperforms GPT-3.5-Turbo by 5.9% points. This shows that CLIP has a basic understanding of multimodal STEM skills. Its vision understanding ability certainly contributes to this performance. Among all subjects, we see only marginal improvements in math. This applies to all foundation models. In addition, the result implies that math is the most challenging subject for current neural models. Novel algorithm advancements that can enable strong reasoning ability are necessary to solve math problems.

Finetuning

The results are shown in Table2.4. It is encouraging as finetuning CLIP ViT-L/14@336px is able to significantly boost the performance on science and math by averaging 30% points over its zero-shot setting. The performance improvements on other subjects are 7.9% points, which is much smaller. While having a large amount of training data helps to some extent, the finetuning performance is still far behind that of an average elementary student (the human-level performance is presented in Sec.3.3). This indicates that more fundamental advancements are required to solve STEM questions in the STEMdataset. For simplicity, we use CLIP to represent CLIP ViT-L/14@336px in the rest of this section.

3.2 Results Analysis

Skills

As STEMprovides massive skills, analyzing models’ performance at the skill level helps understand models better. We show the performance of foundation models (GPT-3, GPT-3.5-Turbo, and CLIP) on an uncurated set of skills of each subject in Figure5.We find that these foundation models are able to perform well zero-shot on skills focusing on identifying common objects (e.g., classifying fruits). However, zero-shot and finetuned foundation models all fail in challenging skills that require abstract knowledge and complex reasoning (e.g., describing transformation).

Grades

Intuitively, questions for higher graders are more difficult than those for lower graders. We illustrate the grade-level model performance to investigate if the same trend holds for neural models as well. We show the exam scores of models along each grade in Figure6. Surprisingly, there is no obvious performance drop as the increase in grade levels. This implies the learning curve for neural models may be different from that of humans. A reason is that neural models are trained on data including all grade-level questions simultaneously while humans gradually learn from lower to higher grade-level questions. Also, the average exam scores of elementary grades (grades 1-6) equals 40.8, which is 54.7% lower than human reference (i.e., 90).

Measuring Vision-Language STEM Skills of Neural Models (3)

Measuring Vision-Language STEM Skills of Neural Models (4)

Measuring Vision-Language STEM Skills of Neural Models (5)

Measuring Vision-Language STEM Skills of Neural Models (6)

Calibration

A trustworthy model should be calibrated. This means that its confidence should approximately match the actual probability of the prediction being correct(Guo etal., 2017a). However modernneural networks are often not well calibrated(Nguyen etal., 2015; Guo etal., 2017b). We show the relationship between the confidence of CLIP and the corresponding accuracy in Figure8. We use the softmax probability as the confidence. We observe that the zero-shot CLIP model is not well calibrated. In fact, it is overconfident about its predictions and is only loosely related to its actual accuracy. After finetuning, CLIP is more calibrated. The results suggest that further improving calibration on STEMis another promising direction.

Scaling Laws

Figure8 shows the average accuracy of zero-shot CLIP with different model sizes. As expected, the performance improves as models grow larger. But the performance also saturates. This implies that other than increasing model scales, new advancements in model design or training schema are required to improve the performance on STEM.

3.3 Comparison with Human

In this section, we explore whether the best-performing foundation models namely CLIP, GPT-3, and GPT-3.5-Turbo are nearing human-level performance.

Figure9(a) shows the exam scores (Sec.2.4) of models and humans on each subject. A score of 90 means a student is proficient in the subject. The zero-shot performances of all tested neural models are well below that bar. In technology, CLIP finetuning achieves human-level performance. This is mainly because most technology skills are about specific empirical knowledge, which is learnable for neural models after finetuning. Overall, there is still a large performance gap between general neural models and average elementary students even in understanding the fundamental skills in STEM. In addition, the offline real-world test-takers (Sec.2.4) produce similar outputs with the above online setup on a subset of questions in the STEM. The results are shown in Figure9(b).

Measuring Vision-Language STEM Skills of Neural Models (8)

3.4 Case Study

We show examples of GPT-3.5-Turbo predictions in Figure10. We show an example of correct and incorrect predictions respectively. For the correct ones, the corresponding skills are mainly about the basics, such as names of objects (e.g., shapes or animals). The incorrect predictions are mainly due to the complex nature of skills. These skills are often about abstract concepts such as symmetry and the direction of force. They are also more relevant to logical reasoning, such as finding patterns or inferring the function of animal adaption.

4 Related Work

There are various types of vision-language tasks, such as reference resolution(Kazemzadeh etal., 2014), image captioning or tagging(Thomee etal., 2016; Sharma etal., 2018), image-text retrieval(Lin etal., 2014; Plummer etal., 2015), visual question answering(Antol etal., 2015; Goyal etal., 2017; Zhang etal., 2016; Zhu etal., 2016), and visual reasoning(Suhr etal., 2017; Johnson etal., 2017). Our STEMdiffers from the previous datasets in that it covers diverse fundamentals of STEM and requires both multimodal understanding and domain knowledge in STEM. This makes STEMa natural testbed to evaluate the real-world problem solving abilities of machine learning models.

Existing STEM related benchmarks do not cover all STEM skills for multimodal understanding. There are benchmarks targeting math(Saxton etal., 2019; Hendrycks etal., 2021b; Zheng etal., 2022; Lu etal., 2021a; b; Xiong etal., 2023b). PIQA(Bisk etal., 2020) is a benchmark for physical commonsense understanding. ScienceQA(Lu etal., 2022) is a multimodal dataset for general science. MMLU(Hendrycks etal., 2021a) contains 57 tasks including STEM but is only restricted to single text modality. Our STEMis the first to include all STEM subjects for vision-language understanding.

Pretrained foundation models help achieve state-of-the-art performance in both NLP and computer vision tasks.Pretrained language models(Radford etal., 2018; 2019; Devlin etal., 2019), especially the recent large language models(Chen etal., 2020a; Wang etal., 2020; 2022a; Ouyang etal., 2022; Crispino etal., 2023; OpenAI, 2023; Chowdhery etal., 2022) have significantly advanced the performance in general natural language understanding tasks. Based on these models, various techniques(Shen etal., 2022a; b; Imani etal., 2023; Jiang etal., 2023; Wang etal., 2023; Xiong etal., 2023a; Pan etal., 2024b; a) have been developed to address specific challenges in a domain such as math. We focus on testing the basic STEM ability of state-of-the-art models in a zero-shot setting and identifying room for improvement by referring to our finetuning results.CLIP(Radford etal., 2021) is one of the state-of-the-art pretrained vision-language models(Lu etal., 2019; Krishna etal., 2017; Chen etal., 2020b; Desai & Johnson, 2021; Lu etal., 2020). Other similar models include GLIP(Li etal., 2022b), GLIDE(Nichol etal., 2022), OFA(Wang etal., 2022b), and BLIP(Li etal., 2022a; 2023). We use CLIP in our test while the majority of existing benchmarks have not explored it yet.

5 Conclusion

We introduce STEM, a new challenge to examine the STEM skills of neural models. STEMis the largest multimodal benchmark for this challenge. It consists of a large number of multi-choice questions and skills spanning all STEM subjects. STEMfocuses on fundamentals of STEM based on the K-12 curriculum. We also include state-of-the-art foundation models such as GPT-3.5-Turbo and CLIP for evaluations. The benchmark results suggest that current neural model performances are still far behind that of elementary students. STEMposes unique challenges for the research community to develop fundamental algorithmic advancements.We hope our benchmark will foster future research in multimodal understanding.

Ethics Statement

We hereby acknowledge that all of the co-authors of this work are aware of the provided ICLR Code of Ethics and honor the code of conduct. We collected data from several sources, and we cited the data creators. The copyright belongs to the original data owners. The STEMdataset is under the CC BY-NC-SA 4.0 license (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International) and is used for non-commercial research purposes. The collected data does not contain any personally identifiable information or offensive content. Our dataset is mainly built upon instances from real-world exam data. Therefore it was less likely to contain sensitive data. We evaluate foundation models, for which the risks and potential harms are discussed(Brown etal., 2020; Radford etal., 2021).

Acknowledgements

This paper is partially supported by the National Key Research and Development Program of China with Grant No. 2023YFC3341203 as well as the National Natural Science Foundation of China with Grant No.62276002.

References

Anderson etal. (2018)Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang.Bottom-up and top-down attention for image captioning and visual question answering.In CVPR, pp. 6077–6086, 2018.
Antol etal. (2015)Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C.Lawrence Zitnick, and Devi Parikh.VQA: visual question answering.In ICCV, pp. 2425–2433, 2015.
Bashkov etal. (2021)BozhidarM Bashkov, Kate Mattison, and Lara Hochstein.Ixl design principles.2021.
Bisk etal. (2020)Yonatan Bisk, Rowan Zellers, Ronan LeBras, Jianfeng Gao, and Yejin Choi.PIQA: reasoning about physical commonsense in natural language.In AAAI, pp. 7432–7439, 2020.
Brown etal. (2020)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, etal.Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020.
Chen etal. (2020a)Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and GeoffreyE. Hinton.Big self-supervised models are strong semi-supervised learners.In NeurIPS, 2020a.
Chen etal. (2020b)Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed ElKholy, Faisal Ahmed, Zhe Gan, YuCheng, and Jingjing Liu.Uniter: Universal image-text representation learning.In ECCV, pp. 104–120, 2020b.
Chowdhery etal. (2022)Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, HyungWon Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, YiTay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, AndrewM. Dai, ThanumalayanSankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel.Palm: Scaling language modeling with pathways.CoRR, abs/2204.02311, 2022.
Crispino etal. (2023)Nicholas Crispino, Kyle Montgomery, Fankun Zeng, Dawn Song, and Chenguang Wang.Agent instructs large language models to be general zero-shot reasoners.arXiv preprint arXiv:2310.03710, 2023.
Desai & Johnson (2021)Karan Desai and Justin Johnson.Virtex: Learning visual representations from textual annotations.In CVPR, pp. 11162–11173, 2021.
Devlin etal. (2019)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.BERT: pre-training of deep bidirectional transformers for language understanding.In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,, pp. 4171–4186, 2019.
Dosovitskiy etal. (2020)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etal.An image is worth 16x16 words: Transformers for image recognition at scale.In ICLR, 2020.
Goyal etal. (2017)Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.Making the V in VQA matter: Elevating the role of image understanding in visual question answering.In CVPR, pp. 6325–6334, 2017.
Guo etal. (2017a)Chuan Guo, Geoff Pleiss, YuSun, and KilianQ. Weinberger.On calibration of modern neural networks.In ICML, pp. 1321–1330, 2017a.
Guo etal. (2017b)Chuan Guo, Geoff Pleiss, YuSun, and KilianQ Weinberger.On calibration of modern neural networks.In International conference on machine learning, pp. 1321–1330. PMLR, 2017b.
He etal. (2016)Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In CVPR, pp. 770–778, 2016.
Hendrycks etal. (2021a)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.Measuring massive multitask language understanding.In ICLR, 2021a.
Hendrycks etal. (2021b)Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt.Measuring mathematical problem solving with the math dataset.NeurIPS, 2021b.
Imani etal. (2023)Shima Imani, Liang Du, and Harsh Shrivastava.Mathprompter: Mathematical reasoning using large language models.CoRR, abs/2303.05398, 2023.
IXL (a)IXL.Understanding the ixl smartscore.https://blog.ixl.com/wp-content/uploads/2014/11/SmartScore-guide.pdf, a.
IXL (b)IXL.How does the smartscore work?https://www.ixl.com/help-center/article/1272663/how_does_the_smartscore_work, b.
Jiang etal. (2023)AlbertQiaochu Jiang, Sean Welleck, JinPeng Zhou, Timothée Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu.Draft, sketch, and prove: Guiding formal theorem provers with informal proofs.In ICLR, 2023.
Johnson etal. (2017)Justin Johnson, Bharath Hariharan, Laurens vander Maaten, LiFei-Fei, C.Lawrence Zitnick, and RossB. Girshick.CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning.In CVPR, pp. 1988–1997, 2017.
Kazemzadeh etal. (2014)Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and TamaraL. Berg.Referitgame: Referring to objects in photographs of natural scenes.In ACL, pp. 787–798, 2014.
Khashabi etal. (2020)Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi.Unifiedqa: Crossing format boundaries with a single QA system.In Findings of EMNLP, pp. 1896–1907, 2020.
Krishna etal. (2017)Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, DavidA Shamma, etal.Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, pp. 32–73, 2017.
Learning (2019)IXL Learning.The impact of ixl math and ixl ela on student achievement in grades pre-k to 12 (pp. 1–27), 2019.
Li etal. (2022a)Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.In International Conference on Machine Learning, pp. 12888–12900. PMLR, 2022a.
Li etal. (2023)Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint arXiv:2301.12597, 2023.
Li etal. (2022b)LiunianHarold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, LuYuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao.Grounded language-image pre-training.In CVPR, pp. 10955–10965, 2022b.
Lin etal. (2014)Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and CLawrence Zitnick.Microsoft coco: Common objects in context.In ECCV, pp. 740–755, 2014.
Liu etal. (2023)Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, YeYuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, Ming Zhang, and Qun Liu.Fimo: A challenge formal dataset for automated theorem proving, 2023.
Lu etal. (2019)Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee.Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.In NeurIPS, pp. 13–23, 2019.
Lu etal. (2020)Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee.12-in-1: Multi-task vision and language representation learning.In CVPR, 2020.
Lu etal. (2021a)Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu.Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning.In ACL-IJCNLP, pp. 6774–6786, 2021a.
Lu etal. (2021b)Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu.Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning.In NeurIPS, 2021b.
Lu etal. (2022)Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan.Learn to explain: Multimodal reasoning via thought chains for science question answering.In NeurIPS, 2022.
Nguyen etal. (2015)Anh Nguyen, Jason Yosinski, and Jeff Clune.Deep neural networks are easily fooled: High confidence predictions for unrecognizable images.In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436, 2015.
Nichol etal. (2022)AlexanderQuinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen.GLIDE: towards photorealistic image generation and editing with text-guided diffusion models.In ICML, pp. 16784–16804, 2022.
OpenAI (2023)OpenAI.GPT-4 technical report.CoRR, abs/2303.08774, 2023.
Ouyang etal. (2022)Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal.Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Pan etal. (2024a)YuPan, YeYuan, Yichun Yin, Jiaxin Shi, Zenglin Xu, Ming Zhang, Lifeng Shang, Xin Jiang, and Qun Liu.Preparing lessons for progressive training on language models.arXiv preprint arXiv:2401.09192, 2024a.
Pan etal. (2024b)YuPan, YeYuan, Yichun Yin, Zenglin Xu, Lifeng Shang, Xin Jiang, and Qun Liu.Reusing pretrained models by multi-linear operators for efficient training.Advances in Neural Information Processing Systems, 36, 2024b.
Peng etal. (2023)Zhiliang Peng, Wenhui Wang, LiDong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei.Kosmos-2: Grounding multimodal large language models to the world, 2023.
Pennington etal. (2014)Jeffrey Pennington, Richard Socher, and ChristopherD. Manning.Glove: Global vectors for word representation.In EMNLP, pp. 1532–1543, 2014.
Plummer etal. (2015)BryanA. Plummer, Liwei Wang, ChrisM. Cervantes, JuanC. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik.Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.In ICCV, pp. 2641–2649, 2015.
Radford etal. (2018)Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, etal.Improving language understanding by generative pre-training.2018.
Radford etal. (2019)Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, etal.Language models are unsupervised multitask learners.OpenAI blog, 2019.
Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In ICML, pp. 8748–8763, 2021.
Saxton etal. (2019)David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli.Analysing mathematical reasoning abilities of neural models.In ICLR, 2019.
Sharma etal. (2018)Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut.Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.In ACL, pp. 2556–2565, 2018.
Shen etal. (2022a)DaShen, Xinyun Chen, Chenguang Wang, Koushik Sen, and Dawn Song.Benchmarking language models for code syntax understanding.In EMNLP, 2022a.
Shen etal. (2022b)Jianhao Shen, Chenguang Wang, YeYuan, Jiawei Han, Heng Ji, Koushik Sen, Ming Zhang, and Dawn Song.Palt: Parameter-lite transfer of language models for knowledge graph completion.In EMNLP, 2022b.
Suhr etal. (2017)Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi.A corpus of natural language for visual reasoning.In ACL, pp. 217–223, 2017.
Sun etal. (2023)Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao.Eva-clip: Improved training techniques for clip at scale, 2023.
Thomee etal. (2016)Bart Thomee, DavidA. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li.YFCC100M: the new data in multimedia research.Commun. ACM, pp. 64–73, 2016.
Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.NeurIPS, 2017.
Wang etal. (2020)Chenguang Wang, Xiao Liu, and Dawn Song.Language models are open knowledge graphs.arXiv preprint arXiv:2010.11967, 2020.
Wang etal. (2022a)Chenguang Wang, Xiao Liu, Zui Chen, Haoyun Hong, Jie Tang, and Dawn Song.Deepstruct: Pretraining of language models for structure prediction.In ACL, 2022a.
Wang etal. (2023)Haiming Wang, YeYuan, Zhengying Liu, Jianhao Shen, Yichun Yin, Jing Xiong, Enze Xie, Han Shi, Yujun Li, Lin Li, Jian Yin, Zhenguo Li, and Xiaodan Liang.DT-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function.In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12632–12646, Toronto, Canada, July 2023. Association for Computational Linguistics.doi: 10.18653/v1/2023.acl-long.706.URL https://aclanthology.org/2023.acl-long.706.
Wang etal. (2022b)Peng Wang, AnYang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang.Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework.In International Conference on Machine Learning, pp. 23318–23340. PMLR, 2022b.
Xiong etal. (2023a)Jing Xiong, Zixuan Li, Chuanyang Zheng, Zhijiang Guo, Yichun Yin, Enze Xie, Zhicheng Yang, Qingxing Cao, Haiming Wang, Xiongwei Han, Jing Tang, Chengming Li, and Xiaodan Liang.Dq-lore: Dual queries with low rank approximation re-ranking for in-context learning, 2023a.
Xiong etal. (2023b)Jing Xiong, Jianhao Shen, YeYuan, Haiming Wang, Yichun Yin, Zhengying Liu, Lin Li, Zhijiang Guo, Qingxing Cao, Yinya Huang, Chuanyang Zheng, Xiaodan Liang, Ming Zhang, and Qun Liu.TRIGO: Benchmarking formal mathematical proof reduction for generative language models.In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 11594–11632, Singapore, December 2023b. Association for Computational Linguistics.doi: 10.18653/v1/2023.emnlp-main.711.URL https://aclanthology.org/2023.emnlp-main.711.
Zhang etal. (2016)Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.Yin and yang: Balancing and answering binary visual questions.In CVPR, pp. 5014–5022, 2016.
Zheng etal. (2022)Kunhao Zheng, JesseMichael Han, and Stanislas Polu.minif2f: a cross-system benchmark for formal olympiad-level mathematics.In ICLR, 2022.
Zhu etal. (2016)Yuke Zhu, Oliver Groth, MichaelS. Bernstein, and LiFei-Fei.Visual7w: Grounded question answering in images.In CVPR, pp. 4995–5004, 2016.

Appendix A More Details on STEM

In this section, we provide more details on STEM, including dataset analysis, models, evaluation settings, and dataset collection.

A.1 Analysis

Questions and Answers

STEMcontains multi-choice questions (AppendixD provides a question example for each skill). The question contains a textual description with an optional image context. Answer options are in text or in an image. We further analyze the questions from the following aspects.(i) The number of answers. STEMhas averaging $2.8$ answer options for each question. The distribution is presented in Figure 12. In practice, the more answer options one question has, the more difficult it is.(ii) Question type. We categorize questions based on the first three words of the question text as shown in Figure 12. STEMmostly includes factoid questions that start with words such as “which” and “what”. We also show the word cloud of our STEMin Figure 13. We can see the most common words like “shape” and “number”. This indicates the questions require joint reasoning of the text and images.(iii) Question distribution. Figure15 depicts the distribution of question lengths. We can see all subjects generally follow a long-tail distribution, while math distribution is most steep and science distribution is flatter. Heuristically, longer questions are more difficult to solve. Figure 15 shows the number of questions in each grade. While pre-K has more questions, the number of questions in other grades is approximately evenly distributed.

Measuring Vision-Language STEM Skills of Neural Models (9)

Measuring Vision-Language STEM Skills of Neural Models (10)

Measuring Vision-Language STEM Skills of Neural Models (11)

Measuring Vision-Language STEM Skills of Neural Models (12)

Measuring Vision-Language STEM Skills of Neural Models (13)

Skill Comparison

We compare the skills of STEMwith other related datasets in Table 3. STEM contains the largest skill set among existing datasets, with a great number of new skills introduced to STEMthat are not yet covered by existing datasets, e.g., skills in technology and engineering.

Subject	IconQA	ScienceQA	STEM
Science	0	167	82
Technology	0	0	9
Engineering	0	0	6
Math	13	0	351
Total	13	167	448

IconQA	STEM
Counting	Count to 10, Count shapes in rows, Count sides and corners …
Geometry	Classify triangles,Identify symmetry,Identify shapes …
Time	Match times,Identify A.M./P.M.,Read a calendar …
…	…
Not cover	Science	Compare concentrations of solutions …
	Technology	Identify peripherals …
	Engineering	Identify laboratory tools …
	Math	Linear and exponential functions …

A.2 Models

In this section, we introduce the foundation models we benchmark in detail.

Vision-Language Models

CLIP (Radford etal., 2021).CLIP is pretrained on a sufficiently large dataset of 400 million text-image pairs across the Internet. It uses a Transformer as the text encoder, and has several variants of image encoder, including ResNet (RN) backbones and Vision Transformers (ViT) (Dosovitskiy etal., 2020). CLIP aligns the text and image representation by training on in-batch contrastive loss, and is able to zero-shot transfer to downstream vision language tasks. To align with CLIP pretraining, we formulate question answering as matching text and images. We use the cosine similarity between the text and image embeddings as the matching function, the same as the original zero-shot image-text retrieval settings in CLIP (Radford etal., 2021).

ViLBERT and 12-in-1 (Lu etal., 2019; 2020). ViLBERT adopts two parallel streams to process image regions and text segments separately, with co-attentional transformer layers connecting them. There is also a multi-task version called 12-in-1 (Lu etal., 2020) that trains 12 different tasks with individual task-specific heads sharing 1 “trunk” ViLBERT model. Its multi-modal alignment prediction serves as the matching score.

UNITER (Chen etal., 2020b).UNITER consists of an Image Embedder with Faster R-CNN (Anderson etal., 2018), a Text Embedder with Transformer (Vaswani etal., 2017), as well as a multi-layer Transformer to get cross-modality representation. During inference on STEM, the matching score function is the same as CLIP, i.e., the cosine similarity between the text and image embeddings (Chen etal., 2020b).

Virtex (Desai & Johnson, 2021).Virtex first extracts visual features with ResNet-50 (He etal., 2016) backbone. The visual features are then fed into a text head, which consists of two unidirectional Transformers, to predict captions. We extract the image feature with the image encoder, then feed text into the textual head and use the sum of bidirectional generation logits as the matching score.

Language Models

GPT-3(Chen etal., 2020a) and GPT-3.5-Turbo(Ouyang etal., 2022). These foundation language models are generation models pretrained on a large corpus of text. We use the OpenAI API “text-davinci-002” and “gpt-3.5-turbo” corresponding to the best-performing GPT-3 and GPT-3.5-Turbo respectively. We formalize the evaluation task as a question-answering task. The input to GPT-3 and GPT-3.5-Turbo is the concatenation of the question text, the context text, and multiple answer options. The output is to predict a final answer from answer options. For images in questions, we follow Lu etal. (2022) to convert them to visual context text based on a captioning model consisting of ViT(Dosovitskiy etal., 2020) and GPT-2(Radford etal., 2019).

UnifiedQA(Khashabi etal., 2020). UnifiedQA is a pretrained question-answering model. We use both its base and small versions. Its evaluation setup is the same as that of GPT-3 and GPT-3.5-Turbo.

GloVe(Pennington etal., 2014). GloVe is a pretrained word embedding model. We use the similarity between the average embedding of the concatenation of the question and context and the average embedding of each answer option. The answer option with the largest similarity score is the answer output. We use average pooling based on the 300-dimensional word embeddings. The images are also converted to text using the same method as GPT-3 and GPT-3.5-Turbo.

A.3 Evaluation Settings

We benchmark state-of-the-art foundation models on STEMunder different settings, including zero-shot, few-shot, finetuning, and multi-task.

(i)Zero-Shot. We use CLIP(Radford etal., 2021), ViLBERT(Lu etal., 2019), 12-in-1(Lu etal., 2020), UNITER(Chen etal., 2020b), and Virtex(Desai & Johnson, 2021) for the zero-shot evaluation of foundation multimodal models. CLIP is the state-of-the-art multimodal model. For zero-shot CLIP, we follow its original setup in Radford etal. (2021). The input to the text encoder is the concatenation of the question text and an answer option. The input to the image encoder is the image context. The output is the cosine similarity scores between the text embeddings and image embedding. Then the answer option with the largest similarity score serves as an answer. For questions with image answer options, the input to the image encoder will also add the image answer options.

(ii) Few-Shot. We also use CLIP to benchmark the multimodal few-shot results. For $k$ -shot setup, we randomly select $k$ questions for each skill from the training set as a meta training set. For each STEM subject, we train the model on the meta training set and select the best model on the validation set. At test time, the evaluation is the same as the zero-shot setup.

(iii) Finetuning. We also finetune CLIP on the entire training set for each subject. The remaining setup is the same as the few-shot setting.

(iv) Multi-Task. Under this setting, we train CLIP on the mixture of training sets of four subjects to produce a single model for all subjects.

A.4 Dataset Collection

We collect science, engineering and math problems from IXL¹¹1https://www.ixl.com/, and technology problems from ProProfs Quizzes²²2https://www.proprofs.com/quiz-school and Triviaplaza³³3https://www.triviaplaza.com/. We first collect multi-choice problems that have at least one image in either question context or answers. We collect at most 2,000 problems for each skill and remove duplicated problems. There are many formulas embedded in math problems that are not represented in the text. We use the Mathpix⁴⁴4https://mathpix.com/ OCR API to convert these math formulas into the latex format.

Appendix B More Details on Experiments

B.1 Experimental Setup

For the zero-shot setting, we evaluate all models on the test set. For the few-shot, finetuning, and multi-task setting, we train CLIP-ViT-L/14@336px on the corresponding train set, tune hyperparameters on the valid set, and finally evaluate on the test set. We use AdamW for optimization and tune hyperparameters as follows: batch size is chosen from {16, 32, 64, 128}, and set to 16 for few-shot learning, 128 for finetuning and multi-task learning after hyperparameter tuning. The learning rate is chosen between [5e-6, 5e-5] and set to 1e-5 for all training. We set the warm-up ratio to 0.1 and set weight decay as 0.2. We set the maximum of training samples to 100k for finetuning, 200k for multitask training, and 10 epochs for few-shot training, all with early stopping on the valid set. We use NVIDIA GeForce RTX 3090 GPUs for training.

B.2 Detailed Experimental Analysis

Method		Science	Technology	Engineering	Math	Average
CLIP	Zero-Shot	50.3	68.7	55.1	43.6	54.4
	Few-Shot	75.2	70.9	61.9	63.2	67.8
	Finetuning	87.0	71.9	67.7	78.4	76.3
	Multi-Task	86.3	60.4	73.4	77.7	74.5

Few-Shot

In the few-shot setting, we sample different number of samples in each grade to see how the learning performance varies. Specifically, we sample 16 samples per skill and train CLIP on the sampled data. The results are shown in Table 4.We observe that CLIP gains much improvement in all subjects after few-shot learning. This implies that CLIP has already stored STEM-related knowledge and a few samples are able to trigger such knowledge. We also show performance varies when the number of samples of each skill changes (Figure17). The overall performance improves with more samples, but 1-shot and 2-shot in technology are worse than zero-shot. Since there are only 9 skills in technology, 1-shot and 2-shot learning in technology might lead to overfitting.

Multi-Task

We show the results in Table 4. Multi-task learning improves in engineering but performs worse in other subjects compared with individual finetuned models. The reason for the great drop in technology is mainly because its data is much less than other subjects. Multi-task training actually improves performance in engineering. This implies that data from one subject may be beneficial for another when the knowledge is transferable. For example, science shares many common topics with engineering like chemical experiments.

Measuring Vision-Language STEM Skills of Neural Models (14)

Measuring Vision-Language STEM Skills of Neural Models (15)

Measuring Vision-Language STEM Skills of Neural Models (16)

Measuring Vision-Language STEM Skills of Neural Models (17)

Measuring Vision-Language STEM Skills of Neural Models (18)

Measuring Vision-Language STEM Skills of Neural Models (19)

Number of Answers

We also analyze how model performance changes with the number of answers. The results are shown in Figure 17. We find that for GPT-3, GPT-3.5-Turbo, CLIP zero-shot, and few-shot, the accuracy drops as the number of answers increases, but the accuracy of CLIP finetuning and multi-task does not drop. This implies that models after full training are actually solving the problem rather than guessing, so the number of choices does not affect the performance much.

Question Lengths

Figure 19 shows how the question length affects model accuracy. For GPT-3, GPT-3.5-Turbo and CLIP zero-shot, the accuracy decreases slightly as the question becomes longer. For tuned models, the same trend holds for questions less than 70 tokens, but the accuracy starts to increase for longer questions. We think this may be caused by some bias in longer questions and the tuned models learn such bias and achieve higher accuracy. Since there are only a small proportion of questions that are longer than 70 tokens, such bias will not affect the whole dataset much.

Question Type

We mark the types of problems as the first word in the question or request of each problem. In Figure 19 we show the accuracy of the top 10 frequent types. Questions starting with “What” and “How” have relatively low accuracy, as these questions are more difficult to answer.

Measuring Vision-Language STEM Skills of Neural Models (20)

Measuring Vision-Language STEM Skills of Neural Models (21)

Grades

We show the model accuracy on each grade in Figure 22. There is no obvious performance drop as the increase in grade levels, which is similar to the trend of exam scores. This implies the learning curve for neural models may be different from that of humans.

Correlation Between Exam Scores and Accuracy

We evaluate exam scores’ correlation with model accuracy and human accuracy(Figure 20). They in general positively correlated to each other. Even though exam score is different from accuracy, it overall captures accuracy as an important factor.

Subject	Reason	Ratio (%)
Math	Commonsense	36
	Numerical calculation	24
	Counting	16
	Read table/graph	12
	Transformation	12
Science	Comparison	40
	Commonsense	32
	Direction	20
	Read table/graph	8

B.3 Error Analysis

To better understand the errors made by CLIP zero-shot, we sample 25 error cases of CLIP zero-shot on math and science. We manually check the reasons for these errors. Table5 shows the analysis results. For math, 36% errors are caused by a lack of mathematical commonsense, such as area formulas and symmetry. Other errors include failure of calculation (24%), counting objects (16%), reading tables or graphs (12%, e.g., graphs of functions), and transformation (12%, e.g., rotation of a 3D object). For science, comparison causes the most errors with a ratio of 40%. Most of these questions only require a straightforward comparison like the distance between two pairs of magnets. However, CLIP fails on such basic problems. This indicates that it is not good at comparing objects and properties yet. Lacking science commonsense also leads to a good number of errors (32%), followed by identifying directions (20%, e.g., the directions of push and pull, towards and away) and reading tables or graphs (8%).

Moreover, we show the top-5 skills with the most errors of fine-tuned models on math and science subsets in Table6 and Table7 respectively.

Measuring Vision-Language STEM Skills of Neural Models (22)

Measuring Vision-Language STEM Skills of Neural Models (23)

Measuring Vision-Language STEM Skills of Neural Models (27)

Measuring Vision-Language STEM Skills of Neural Models (28)

B.4 Comparison with Human

Exam Score

We test exam scores on all skills in engineering and technology, and randomly choose 40 skills from math, and 30 skills from science due to technical and time constraints. We compare neural models with humans using the exam score, and the results are shown in Table 8. The detailed scores and skills are listed in Table 10.

Method		Exam Score				Accuracy
Method		Science	Engineering	Math	Technology	Science	Technology	Engineering	Math
Human		90.0	90.0	90.0	68.6	90.7	62.9	86.4	92.1
Random		26.7	16.1	51.1	25.0	38.3	25.0	40.0	36.8
GPT-3		45.7	50.2	51.4	22.1	48.4	21.3	65.2	42.4
GPT-3.5-Turbo		48.9	58.7	53.5	26.3	48.5	27.4	62.5	40.6
CLIP	Zero-Shot	33.9	19.0	52.9	68.7	53.8	60.7	65.5	44.3
	Few-Shot	39.1	43.9	67.6	70.9	77.3	59.7	55.5	67.8
	Finetuning	57.8	37.4	75.7	71.9	91.9	62.6	60.3	83.5
	Multi-Task	61.9	50.3	72.0	60.4	90.9	50.6	70.2	82.5

Accuracy

We randomly sample 20 problems for each subject and ask 7 Ph.D. students to answer these questions, and calculate the average accuracy for each subject. To evaluate neural models on these questions, we use the corresponding skill accuracy for each sampled problem as the models’ score on this problem and average all accuracy together as the final score. We do not evaluate models on these sampled data directly since the small number of samples will lead to a large variance, and skill accuracy can avoid such variance. The comparison results are shown in Table 8 and Figure 22. All sampled problems are listed in Table12 to 17.

B.5 Zero-Shot Prompt Sensitivity

We study the effect of prompts on CLIP zero-shot. We design 5 types of prompts and demonstrate them with an example problem. The example question is “Which property matches this object?” and the answer is “Rough”. Examples of different prompt types and the corresponding accuracies are shown in Table 9. We observe that “Q+A results in the best performance on average, but the difference is only marginal, meaning that CLIP zero-shot is not very sensitive to the format of prompts.

Prompt Format	Example	Science	Technology	Engineering	Math	Average
Q+A	Which property matches this object? Rough.	50.3	68.7	55.1	43.6	54.4
A+Q	Rough. Which property matches this object?	50.0	66.0	49.6	43.2	52.2
Q “Choose the best answer:” A	Which property matches this object? Choose the best answer: Rough.	50.1	70.7	49.7	44.2	53.7
“Answer the question:” Q + A	Answer the question: Which property matches this object? Rough.	49.4	67.6	51.0	43.6	52.9
A “best answers the question” Q	Rough best answers the question: Which property matches this object?	49.7	69.5	50.8	43.8	53.4

Subject	Grade/Skill	Random	Zero-shot	Finetune
Science	grade-2/classify-matter-as-solid-liquid-or-gas	28	40	100
	grade-2/identify-animals-with-and-without-backbones	0	70	70
	grade-2/identify-mammals-birds-fish-reptiles-and-amphibians	0	0	18
	grade-2/identify-materials-in-objects	21	40	100
	grade-2/identify-properties-of-an-object	35	65	65
	grade-3/compare-strengths-of-magnetic-forces	0	18	63
	grade-3/describe-ecosystems	65	50	100
	grade-3/find-evidence-of-changes-to-earths-surface	17	38	100
	grade-3/identify-ecosystems	35	100	100
	grade-3/identify-minerals-using-properties	35	11	35
	grade-4/compare-properties-of-objects	10	17	20
	grade-4/describe-ecosystems	74	100	100
	grade-4/identify-minerals-using-properties	35	16	35
	grade-4/use-evidence-to-classify-mammals-birds-fish-reptiles-and-amphibians	26	35	35
	grade-5/animal-adaptations-beaks-mouths-and-necks	17	27	35
	grade-5/classify-elementary-substances-and-compounds-using-models	75	75	75
	grade-5/compare-ancient-and-modern-organisms-use-observations-to-support-a-hypothesis	32	32	50
	grade-5/identify-directions-of-forces	0	26	35
	grade-5/identify-the-photosynthetic-organism	0	0	100
	grade-5/predict-temperature-changes	0	22	0
	grade-5/use-evidence-to-classify-animals	35	35	35
	grade-5/use-evidence-to-classify-mammals-birds-fish-reptiles-and-amphibians	18	35	35
	grade-5/weather-and-climate-around-the-world	60	36	60
	grade-6/compare-concentrations-of-solutions	15	11	100
	grade-6/describe-the-effects-of-gene-mutations-on-organisms	52	13	69
	grade-6/diffusion-across-membranes	50	25	50
	grade-7/describe-the-effects-of-gene-mutations-on-organisms	42	13	69
	grade-8/classify-symbiotic-relationships	25	36	45
	grade-8/diffusion-across-membranes	0	18	35
	grade-8/moss-and-fern-life-cycles	0	12	0
Engineer	grade-6/evaluate-tests-of-engineering-design-solutions	0	0	100
	grade-6/identify-control-and-experimental-groups	0	0	0
	grade-6/identify-independent-and-dependent-variables	0	0	100
	grade-6/identify-the-experimental-question	30	30	30
	grade-7/evaluate-tests-of-engineering-design-solutions	0	0	0
	grade-7/identify-control-and-experimental-groups	0	0	40
	grade-7/identify-independent-and-dependent-variables	0	0	30
	grade-7/identify-the-experimental-question	40	0	40
	grade-8/identify-control-and-experimental-groups	0	0	0
	grade-8/identify-the-experimental-question	60	0	40
	grade-5/identify-laboratory-tools	21	42	31
	grade-6/identify-laboratory-tools	21	21	21
	grade-6/laboratory-safety-equipment	24	65	52
	grade-7/identify-laboratory-tools	10	28	21
	grade-7/laboratory-safety-equipment	9	58	52
	grade-8/identify-laboratory-tools	49	21	21
	grade-8/laboratory-safety-equipment	9	58	58
Math	algebra-2/factor-quadratics-using-algebra-tiles	40	51	55
	algebra-2/outliers-in-scatter-plots	55	47	97
	calculus/determine-continuity-using-graphs	36	63	80
	calculus/find-limits-at-vertical-asymptotes-using-graphs	60	65	85
	grade-1/subtraction-sentences-up-to-10-which-model-matches	50	30	99
	grade-2/identify-halves-thirds-and-fourths	65	75	97
	grade-2/identify-lines-of-symmetry	70	64	99
	grade-2/interpret-bar-graphs-ii	14	23	12
	grade-2/ordinal-numbers-up-to-10th	32	61	28
	grade-3/compare-fractions-in-recipes	55	50	68
	grade-3/identify-parallelograms	51	64	70
	grade-3/is-it-a-polygon	71	60	98
	grade-3/parallel-sides-in-quadrilaterals	29	66	45
	grade-4/nets-of-three-dimensional-figures	68	40	99
	grade-5/nets-of-three-dimensional-figures	53	40	99
	grade-6/changes-in-mean-median-mode-and-range	38	14	15
	grade-6/classify-triangles	47	38	45
	grade-6/identify-polyhedra	75	75	75
	grade-6/mean-median-mode-and-range-find-the-missing-number	55	41	99
	grade-6/model-and-solve-equations-using-algebra-tiles	36	36	57
	grade-6/rational-numbers-find-the-sign	31	78	99
	grade-6/rotational-symmetry	62	56	78
	grade-6/similar-and-congruent-figures	34	33	46
	grade-6/which-figure-is-being-described	36	27	86
	grade-7/rational-numbers-find-the-sign	47	58	99
	grade-8/rotational-symmetry-amount-of-rotation	47	32	63
	kindergarten/count-on-ten-frames-up-to-10	15	2	49
	kindergarten/fewer-and-more-up-to-20	80	62	97
	kindergarten/subtraction-sentences-up-to-5-which-model-matches	41	30	96
	pre-k/addition-sentences-up-to-10-which-model-matches	60	55	96
	pre-k/count-on-ten-frames-up-to-3	84	50	51
	pre-k/fewer-and-more-compare-by-matching	63	52	90
	pre-k/one-less-with-pictures-up-to-10	61	37	66
	pre-k/one-more-with-pictures-up-to-5	48	36	75
	pre-k/shapes-of-everyday-objects	67	96	96
	pre-k/spheres	67	96	96
	pre-k/triangles	57	75	75
	pre-k/what-comes-next	75	56	70
	pre-k/ordinal-numbers-up-to-tenth	27	84	82
	kindergarten/are-there-enough	40	99	96

B.6 Detailed Performance on Skills

We show the accuracy of neural models on all 448 skills in Figure 24 to 28. We can see that the zero-shot performance is generally better than random guesses on most skills and achieves near 100% on some skills (e.g., “circles” and “cones”). After finetuning, accuracy improves on most skills and becomes near 100%on many skills.

B.7 VQA Results

Model	Accuracy
Zero-Shot CLIP	24.7%
Finetuning with Science	27.3%
Finetuning with Technology	26.5%
Finetuning with Engineering	24.8%
Finetuning with Math	24.9%

We evaluate the zero-shot CLIP model and models finetuned on each subject on the VQA(Antol etal., 2015) dataset. Results are shown in Table11. The average increase of the finetuned models over the zero-shot setting is 1.2%.

Appendix C Additional Related Work

In addition to vision-language foundation models included in the main text, we expand the discussion to some recent models, including BLIP-2(Li etal., 2023), EVA-ClIP(Sun etal., 2023), and KOSMOS-2(Peng etal., 2023). BLIP-2 provides a versatile and efficient strategy for pre-training. This strategy enhances the vision-language pre-training process by utilizing frozen pre-trained image encoders and frozen large language models, while EVA-CLIP proposes a series of methods to increase the training efficiency of the CLIP model. KOSMOS-2 enables new capabilities for perceiving object descriptions. This work focuses on the creation of a dataset to evaluate the multimodal STEM understanding and we chose the foundation models like CLIP for a pilot study on our dataset. There are more benchmarks targeting formal math reasoning(Zheng etal., 2022; Liu etal., 2023; Xiong etal., 2023b), however, they are all restricted to single text modality and they can not evaluate fundamental skills.

Measuring Vision-Language STEM Skills of Neural Models (35)

Measuring Vision-Language STEM Skills of Neural Models (36)

Measuring Vision-Language STEM Skills of Neural Models (37)

Measuring Vision-Language STEM Skills of Neural Models (38)

Appendix D Summary of Skills

We list all skills in STEMinTable 18 to 20 and show some examples in Table 21 to 27.

Measuring Vision-Language STEM Skills of Neural Models (41)

Measuring Vision-Language STEM Skills of Neural Models (42)

Measuring Vision-Language STEM Skills of Neural Models (61)

Measuring Vision-Language STEM Skills of Neural Models (62)

Measuring Vision-Language STEM Skills of Neural Models (86)

Measuring Vision-Language STEM Skills of Neural Models (87)

Measuring Vision-Language STEM Skills of Neural Models (99)

Measuring Vision-Language STEM Skills of Neural Models (100)

Measuring Vision-Language STEM Skills of Neural Models (124)

Measuring Vision-Language STEM Skills of Neural Models (125)

Measuring Vision-Language STEM Skills of Neural Models (142)

Measuring Vision-Language STEM Skills of Neural Models (143)

Subject	Grade	Skills
Science	grade-2	classify-fruits-and-vegetables-as-plant-parts, classify-matter-as-solid-liquid-or-gas, classify-matter-as-solid-or-liquid, classify-rocks-and-minerals-by-color-and-shape, compare-properties-of-materials, compare-properties-of-objects, compare-temperatures-on-thermometers, find-evidence-of-changes-to-earths-surface, identify-animals-with-and-without-backbones, identify-earth-s-land-features, identify-living-and-nonliving-things, identify-magnets-that-attract-or-repel, identify-mammals-birds-fish-reptiles-and-amphibians, identify-materials-in-objects, identify-plants-and-animals, identify-properties-of-an-object, identify-pushes-and-pulls, identify-solids-and-liquids, identify-solids-liquids-and-gases, identifying-mixtures, natural-resources, predict-heat-flow, read-a-thermometer
	grade-3	animal-adaptations-beaks-mouths-and-necks, animal-adaptations-feet-and-limbs, animal-adaptations-skins-and-body-coverings, classify-fruits-and-vegetables-as-plant-parts, classify-matter-as-solid-liquid-or-gas, classify-rocks-and-minerals-by-color-shape-and-texture, classify-rocks-as-igneous-sedimentary-or-metamorphic, compare-ancient-and-modern-organisms-use-observations-to-support-a-hypothesis, compare-properties-of-materials, compare-properties-of-objects, compare-strengths-of-magnetic-forces, compare-temperatures-on-thermometers, find-evidence-of-changes-to-earths-surface, how-do-balanced-and-unbalanced-forces-affect-motion, identify-earth-s-land-features, identify-ecosystems, identify-living-and-nonliving-things, identify-magnets-that-attract-or-repel, identify-mammals-birds-fish-reptiles-and-amphibians, identify-materials-in-objects, identify-minerals-using-properties, identify-plants-and-animals, identify-properties-of-an-object, identify-pushes-and-pulls, identify-rocks-using-properties, identify-roles-in-food-chains, identify-solids-liquids-and-gases, identify-vertebrates-and-invertebrates, interpret-food-webs, natural-resources, predict-heat-flow, predict-temperature-changes, read-a-thermometer, use-climate-data-to-make-predictions, use-data-to-describe-u-s-climates, use-data-to-describe-world-climates, weather-and-climate-around-the-world
	grade-4	animal-adaptations-beaks-mouths-and-necks, animal-adaptations-feet-and-limbs, animal-adaptations-skins-and-body-coverings, classify-fruits-and-vegetables-as-plant-parts, classify-rocks-as-igneous-sedimentary-or-metamorphic, compare-amplitudes-and-wavelengths-of-waves, compare-ancient-and-modern-organisms-use-observations-to-support-a-hypothesis, compare-properties-of-materials, compare-properties-of-objects, compare-strengths-of-magnetic-forces, compare-temperatures-on-thermometers, describe-classify-and-compare-kingdoms, evaluate-natural-energy-sources, how-do-balanced-and-unbalanced-forces-affect-motion, identify-and-classify-fossils, identify-and-sort-solids-liquids-and-gases, identify-common-and-scientific-names, identify-directions-of-forces, identify-earths-land-features-using-photographs, identify-earths-land-features-using-satellite-images, identify-ecosystems, identify-living-and-nonliving-things, identify-magnets-that-attract-or-repel, identify-mammals-birds-fish-reptiles-and-amphibians, identify-minerals-using-properties, identify-phases-of-the-moon, identify-rocks-using-properties, identify-roles-in-food-chains, identify-vertebrates-and-invertebrates, interpret-food-webs, origins-of-scientific-names, predict-heat-flow, predict-temperature-changes, read-a-thermometer, use-climate-data-to-make-predictions, use-data-to-describe-climates, use-evidence-to-classify-animals, use-evidence-to-classify-mammals-birds-fish-reptiles-and-amphibians, use-scientific-names-to-classify-organisms, weather-and-climate-around-the-world
	grade-5	animal-adaptations-beaks-mouths-and-necks, animal-adaptations-feet-and-limbs, animal-adaptations-skins-and-body-coverings, classify-elementary-substances-and-compounds-using-models, classify-fruits-and-vegetables-as-plant-parts, classify-rocks-as-igneous-sedimentary-or-metamorphic, compare-amplitudes-and-wavelengths-of-waves, compare-ancient-and-modern-organisms-use-observations-to-support-a-hypothesis, compare-magnitudes-of-magnetic-forces, compare-properties-of-objects, describe-classify-and-compare-kingdoms, evaluate-natural-energy-sources, flowering-plant-and-conifer-life-cycles, how-do-balanced-and-unbalanced-forces-affect-motion, identify-and-classify-fossils, identify-common-and-scientific-names, identify-directions-of-forces, identify-earths-land-features-using-photographs, identify-earths-land-features-using-satellite-images, identify-ecosystems, identify-magnets-that-attract-or-repel, identify-mammals-birds-fish-reptiles-and-amphibians, identify-phases-of-the-moon, identify-rocks-and-minerals, identify-roles-in-food-chains, identify-the-photosynthetic-organism, identify-vertebrates-and-invertebrates, match-chemical-formulas-to-ball-and-stick-models, moss-and-fern-life-cycles, origins-of-scientific-names, predict-heat-flow, predict-temperature-changes, use-data-to-describe-climates, use-evidence-to-classify-animals, use-evidence-to-classify-mammals-birds-fish-reptiles-and-amphibians, use-scientific-names-to-classify-organisms, weather-and-climate-around-the-world
	grade-6	analyze-data-to-compare-properties-of-planets, classify-elementary-substances-and-compounds-using-models, classify-rocks-as-igneous-sedimentary-or-metamorphic, classify-symbiotic-relationships, compare-ages-of-fossils-in-a-rock-sequence, compare-amplitudes-wavelengths-and-frequencies-of-waves, compare-concentrations-of-solutions, compare-magnitudes-of-magnetic-forces, compare-thermal-energy-transfers, describe-populations-communities-and-ecosystems, describe-tectonic-plate-boundaries-around-the-world, describe-the-effects-of-gene-mutations-on-organisms, diffusion-across-membranes, flowering-plant-and-conifer-life-cycles, identify-and-compare-air-masses, identify-common-and-scientific-names, identify-earths-land-features-using-photographs, identify-earths-land-features-using-satellite-images, identify-ecosystems, identify-elementary-substances-and-compounds-using-models, identify-how-particle-motion-affects-temperature-and-pressure, identify-phases-of-the-moon, identify-rocks-and-minerals, identify-the-photosynthetic-organism, match-chemical-formulas-to-ball-and-stick-models, moss-and-fern-life-cycles, origins-of-scientific-names, predict-heat-flow-and-temperature-changes, use-data-to-describe-climates, use-scientific-names-to-classify-organisms, weather-and-climate-around-the-world
	grade-7	analyze-data-to-compare-properties-of-planets, angiosperm-and-conifer-life-cycles, classify-elementary-substances-and-compounds-using-models, classify-rocks-as-igneous-sedimentary-or-metamorphic, classify-symbiotic-relationships, compare-ages-of-fossils-in-a-rock-sequence, compare-amplitudes-wavelengths-and-frequencies-of-waves, compare-concentrations-of-solutions, compare-magnitudes-of-magnetic-forces, compare-thermal-energy-transfers, describe-populations-communities-and-ecosystems, describe-tectonic-plate-boundaries-around-the-world, describe-the-effects-of-gene-mutations-on-organisms, diffusion-across-membranes, identify-and-compare-air-masses, identify-chemical-formulas-for-ball-and-stick-models, identify-common-and-scientific-names, identify-ecosystems, identify-how-particle-motion-affects-temperature-and-pressure, identify-phases-of-the-moon, identify-rocks-and-minerals, identify-the-photosynthetic-organism, moss-and-fern-life-cycles, origins-of-scientific-names, predict-heat-flow-and-temperature-changes, use-data-to-describe-climates, use-scientific-names-to-classify-organisms
	grade-8	analyze-data-to-compare-properties-of-planets, angiosperm-and-conifer-life-cycles, classify-elementary-substances-and-compounds-using-models, classify-symbiotic-relationships, compare-ages-of-fossils-in-a-rock-sequence, compare-amplitudes-wavelengths-and-frequencies-of-waves, compare-concentrations-of-solutions, compare-magnitudes-of-magnetic-forces, compare-thermal-energy-transfers, describe-populations-communities-and-ecosystems, describe-tectonic-plate-boundaries-around-the-world, describe-the-effects-of-gene-mutations-on-organisms, diffusion-across-membranes, identify-and-compare-air-masses, identify-chemical-formulas-for-ball-and-stick-models, identify-common-and-scientific-names, identify-ecosystems, identify-how-particle-motion-affects-temperature-and-pressure, identify-phases-of-the-moon, identify-rocks-and-minerals, identify-the-photosynthetic-organism, moss-and-fern-life-cycles, origins-of-scientific-names, predict-heat-flow-and-temperature-changes, use-data-to-describe-climates, use-punnett-squares-to-calculate-probabilities-of-offspring-types, use-punnett-squares-to-calculate-ratios-of-offspring-types, use-scientific-names-to-classify-organisms
Technology	-	cables, font, icons, logo, parts, peripherals, photo, web, others
Engineering	grade-5	identify-laboratory-tools
	grade-6	evaluate-tests-of-engineering-design-solutions, identify-control-and-experimental-groups, identify-independent-and-dependent-variables, identify-laboratory-tools, identify-the-experimental-question, laboratory-safety-equipment
	grade-7	evaluate-tests-of-engineering-design-solutions, identify-control-and-experimental-groups, identify-independent-and-dependent-variables, identify-laboratory-tools, identify-the-experimental-question, laboratory-safety-equipment
	grade-8	identify-control-and-experimental-groups, identify-laboratory-tools, identify-the-experimental-question, laboratory-safety-equipment

Subject	Grade	Skills
Math	algebra-1	compare-linear-functions-graphs-and-equations, compare-linear-functions-tables-graphs-and-equations, describe-linear-and-exponential-growth-and-decay, domain-and-range-of-absolute-value-functions-graphs, domain-and-range-of-exponential-functions-graphs, domain-and-range-of-square-root-functions-graphs, factor-quadratics-using-algebra-tiles, identify-direct-variation-and-inverse-variation, identify-functions, identify-functions-vertical-line-test, identify-linear-and-exponential-functions-from-graphs, identify-linear-and-exponential-functions-from-tables, identify-linear-functions-from-graphs-and-equations, identify-linear-functions-from-tables, identify-linear-quadratic-and-exponential-functions-from-graphs, identify-linear-quadratic-and-exponential-functions-from-tables, identify-proportional-relationships, interpret-a-scatter-plot, interpret-the-slope-and-y-intercept-of-a-linear-function, linear-functions-over-unit-intervals, match-exponential-functions-and-graphs-ii, model-and-solve-linear-equations-using-algebra-tiles, multiply-two-binomials-using-algebra-tiles, perimeter-and-area-changes-in-scale, perimeter-area-and-volume-changes-in-scale, special-right-triangles, surface-area-and-volume-changes-in-scale, write-compound-inequalities-from-graphs
	algebra-2	classify-variation, describe-linear-and-exponential-growth-and-decay, domain-and-range-of-absolute-value-functions-graphs, domain-and-range-of-exponential-and-logarithmic-functions, domain-and-range-of-radical-functions, factor-quadratics-using-algebra-tiles, find-inverse-functions-and-relations, find-solutions-using-a-table, graphs-of-angles, identify-the-direction-a-parabola-opens, linear-functions-over-unit-intervals, match-exponential-functions-and-graphs, outliers-in-scatter-plots, solve-a-triangle
	calculus	describe-linear-and-exponential-growth-and-decay, determine-continuity-on-an-interval-using-graphs, determine-continuity-using-graphs, determine-one-sided-continuity-using-graphs, domain-and-range, domain-and-range-of-exponential-and-logarithmic-functions, find-inverse-functions-and-relations, find-limits-at-vertical-asymptotes-using-graphs, identify-functions, identify-graphs-of-continuous-functions

Subject	Grade	Skills
Math	grade-1	addition-sentences-up-to-10-what-does-the-model-show, addition-sentences-up-to-10-which-model-matches, addition-sentences-using-number-lines-sums-up-to-20, am-or-pm, certain-probable-unlikely-and-impossible, compare-clocks, compare-money-amounts, compare-objects-length-and-height, compare-sides-and-corners, compare-size-weight-and-capacity, compare-vertices-edges-and-faces, comparing-review, count-sides-and-corners, count-to-fill-a-ten-frame, cubes-and-rectangular-prisms, equal-sides, estimate-to-the-nearest-ten, even-or-odd, find-the-next-shape-in-a-growing-pattern, find-the-next-shape-in-a-pattern, flip-turn-and-slide, holds-more-or-less, identify-faces-of-three-dimensional-shapes, identify-fourths, identify-halves, identify-halves-and-fourths, identify-halves-thirds-and-fourths, identify-shapes-traced-from-solids, identify-thirds, interpret-bar-graphs-ii, light-and-heavy, match-analog-and-digital-clocks, match-analog-clocks-and-times, match-digital-clocks-and-times, more-less-and-equally-likely, name-the-three-dimensional-shape, name-the-two-dimensional-shape, names-and-values-of-all-coins, names-and-values-of-common-coins, open-and-closed-shapes, ordinal-numbers, purchases-do-you-have-enough-money, read-a-calendar, read-a-calendar-ii, rhombuses, select-three-dimensional-shapes, select-two-dimensional-shapes, shapes-of-everyday-objects, simple-fractions-what-fraction-does-the-shape-show, square-corners, subtraction-sentences-up-to-10-which-model-matches, subtraction-sentences-using-number-lines-up-to-10, subtraction-sentences-using-number-lines-up-to-20, symmetry, time-and-clocks-word-problems, times-of-everyday-events, two-dimensional-and-three-dimensional-shapes, which-bar-graph-is-correct, which-picture-graph-is-correct, which-table-is-correct, which-tally-chart-is-correct, wide-and-narrow
	grade-2	am-or-pm, certain-probable-unlikely-and-impossible, choose-the-appropriate-measuring-tool, compare-clocks, compare-sides-and-vertices, compare-vertices-edges-and-faces, correct-amount-of-change, cubes, equal-sides, equivalent-amounts-of-money-up-to-1-dollar, estimate-to-the-nearest-ten, even-or-odd, find-the-next-shape-in-a-growing-pattern, find-the-next-shape-in-a-repeating-pattern, flip-turn-and-slide, fractions-of-a-group, fractions-of-a-whole-modeling-word-problems, greatest-and-least-word-problems-up-to-100, greatest-and-least-word-problems-up-to-1000, how-much-more-to-make-a-dollar, identify-faces-of-three-dimensional-shapes, identify-fourths, identify-halves, identify-halves-thirds-and-fourths, identify-lines-of-symmetry, identify-multiplication-sentences-for-equal-groups, identify-repeated-addition-in-arrays-sums-to-10, identify-repeated-addition-in-arrays-sums-to-25, identify-shapes-traced-from-solids, identify-the-fraction, identify-thirds, interpret-bar-graphs-ii, interpret-tally-charts, match-addition-sentences-and-models-sums-to-10, match-analog-and-digital-clocks, match-analog-clocks-and-times, match-digital-clocks-and-times, more-less-and-equally-likely, name-the-three-dimensional-shape, name-the-two-dimensional-shape, names-and-values-of-all-coins, names-and-values-of-common-coins, ordinal-numbers-up-to-10th, place-value-models-up-to-hundreds, place-value-tens-and-ones, place-value-up-to-hundreds, place-value-up-to-thousands, purchases-do-you-have-enough-money-up-to-1-dollar, purchases-do-you-have-enough-money-up-to-5-dollars, read-a-calendar, read-a-calendar-ii, read-a-thermometer, select-figures-with-a-given-area, select-three-dimensional-shapes, shapes-of-everyday-objects, skip-counting-stories, symmetry, which-bar-graph-is-correct, which-picture-shows-more-up-to-5-dollars, which-shape-illustrates-the-fraction, which-table-is-correct, which-tally-chart-is-correct, write-subtraction-sentences-to-describe-pictures-up-to-18, write-subtraction-sentences-to-describe-pictures-up-to-two-digits
	grade-3	acute-obtuse-and-right-triangles, am-or-pm, angles-greater-than-less-than-or-equal-to-a-right-angle, certain-probable-unlikely-and-impossible, choose-the-appropriate-measuring-tool, compare-area-and-perimeter-of-two-figures, compare-fractions-in-recipes, compare-fractions-using-models, compare-fractions-using-number-lines, coordinate-planes-as-maps, correct-amount-of-change, division-input-output-tables-find-the-rule, find-the-next-shape-in-a-pattern, fractions-of-a-group-denominators-2-3-4-6-8, fractions-of-a-group-unit-fractions, identify-equivalent-fractions-on-number-lines, identify-faces-of-three-dimensional-shapes, identify-multiplication-expressions-for-arrays, identify-multiplication-expressions-for-equal-groups, identify-parallelograms, identify-rhombuses, identify-three-dimensional-shapes, identify-trapezoids, identify-two-dimensional-shapes, identify-unit-fractions-on-number-lines, interpret-line-graphs, is-it-a-polygon, lines-line-segments-and-rays, match-analog-and-digital-clocks, match-clocks-and-times, match-fractions-to-models-halves-thirds-and-fourths, match-mixed-numbers-to-models, multiplication-input-output-tables-find-the-rule, open-and-closed-shapes, parallel-perpendicular-and-intersecting-lines, parallel-sides-in-quadrilaterals, purchases-do-you-have-enough-money-up-to-10-dollars, read-a-calendar, read-a-thermometer, reading-schedules, reflection-rotation-and-translation, scalene-isosceles-and-equilateral-triangles, select-figures-with-a-given-area, select-fractions-equivalent-to-whole-numbers-using-models, shapes-of-everyday-objects, symmetry, which-picture-shows-more
	grade-4	acute-obtuse-and-right-triangles, acute-right-obtuse-and-straight-angles, angles-as-fractions-of-a-circle, angles-of-90-180-270-and-360-degrees, classify-triangles, compare-area-and-perimeter-of-two-figures, compare-decimals-using-models, compare-fractions-in-recipes, compare-fractions-using-models, compare-fractions-with-like-numerators-or-denominators-using-models, decompose-fractions-into-unit-fractions-using-models, elapsed-time, estimate-angle-measurements, find-the-next-shape-in-a-pattern, fractions-of-a-whole-word-problems, identify-equivalent-fractions-using-number-lines, identify-faces-of-three-dimensional-figures, identify-lines-of-symmetry, identify-parallel-perpendicular-and-intersecting-lines, identify-parallelograms, identify-rhombuses, identify-three-dimensional-figures, identify-trapezoids, interpret-bar-graphs, interpret-stem-and-leaf-plots, is-it-a-polygon, measure-angles-with-a-protractor, multiplication-input-output-tables-find-the-rule, multiply-fractions-by-whole-numbers-using-models, multiply-unit-fractions-by-whole-numbers-using-models, nets-of-three-dimensional-figures, parallel-perpendicular-and-intersecting-lines, parallel-sides-in-quadrilaterals, points-lines-line-segments-rays-and-angles, properties-of-three-dimensional-figures, rotational-symmetry, scalene-isosceles-and-equilateral-triangles, sides-and-angles-of-quadrilaterals, transportation-schedules, what-decimal-number-is-illustrated
	grade-5	acute-obtuse-and-right-triangles, adjust-a-budget, angles-of-90-180-270-and-360-degrees, classify-triangles, compare-decimals-using-grids, compare-fractions-and-mixed-numbers, compare-patterns, fractions-of-a-whole-word-problems, identify-parallelograms, identify-rhombuses, identify-three-dimensional-figures, identify-trapezoids, interpret-bar-graphs, is-it-a-polygon, line-symmetry, mean-find-the-missing-number, median-find-the-missing-number, multiplication-input-output-tables-find-the-rule, multiply-unit-fractions-by-whole-numbers-using-models, multiplying-fractions-by-whole-numbers-choose-the-model, nets-of-three-dimensional-figures, parallel-perpendicular-and-intersecting-lines, parallel-sides-in-quadrilaterals, parts-of-a-circle, points-lines-line-segments-rays-and-angles, range-find-the-missing-number, reflection-rotation-and-translation, regular-and-irregular-polygons, rotational-symmetry, rotational-symmetry-amount-of-rotation, scalene-isosceles-and-equilateral-triangles, three-dimensional-figures-viewed-from-different-perspectives, types-of-angles, understanding-probability
	grade-6	absolute-value-and-integers-word-problems, changes-in-mean-median-mode-and-range, classify-rational-numbers-using-a-diagram, classify-triangles, compare-and-order-rational-numbers-using-number-lines,compare-area-and-perimeter-of-two-figures, compare-checking-accounts, front-side-and-top-view, identify-complementary-supplementary-vertical-adjacent-and-congruent-angles, identify-equivalent-expressions-using-strip-models, identify-polyhedra, identify-trapezoids, interpret-bar-graphs, interpret-double-bar-graphs, interpret-graphs-of-proportional-relationships, interpret-histograms, line-symmetry,mean-median-mode-and-range-find-the-missing-number, model-and-solve-equations-using-algebra-tiles, nets-of-three-dimensional-figures, occupations-education-and-income, quadrants, rational-numbers-find-the-sign, reflection-rotation-and-translation, rotational-symmetry, rotational-symmetry-amount-of-rotation, similar-and-congruent-figures, understanding-area-of-a-triangle, understanding-area-of-trapezoids, understanding-percents-strip-models, which-figure-is-being-described, which-is-the-better-coupon, which-model-represents-the-ratio
	grade-7	apply-addition-and-subtraction-rules, apply-multiplication-and-division-rules, bases-of-three-dimensional-figures, changes-in-mean-median-mode-and-range, classify-quadrilaterals, classify-rational-numbers-using-a-diagram, compare-and-order-integers, cross-sections-of-three-dimensional-figures, describe-a-sequence-of-transformations, front-side-and-top-view, identify-alternate-interior-and-alternate-exterior-angles, identify-complementary-supplementary-vertical-and-adjacent-angles, identify-equivalent-linear-expressions-using-algebra-tiles, identify-linear-and-nonlinear-functions, identify-reflections-rotations-and-translations, identify-trapezoids, identify-trends-with-scatter-plots, interpret-circle-graphs, interpret-graphs-of-proportional-relationships, line-symmetry, make-predictions-with-scatter-plots, mean-median-mode-and-range-find-the-missing-number, model-and-solve-equations-using-algebra-tiles, nets-of-three-dimensional-figures, parallel-perpendicular-and-intersecting-lines, parts-of-a-circle, perimeter-and-area-changes-in-scale, rational-numbers-find-the-sign, rotational-symmetry, rotational-symmetry-amount-of-rotation, similar-and-congruent-figures, simplify-expressions-by-combining-like-terms-with-algebra-tiles, transversals-of-parallel-lines-name-angle-pairs, which-is-the-better-coupon
	grade-8	angle-angle-criterion-for-similar-triangles, apply-addition-and-subtraction-rules, apply-addition-subtraction-multiplication-and-division-rules, apply-multiplication-and-division-rules, base-plans, changes-in-mean-median-mode-and-range, classify-quadrilaterals, compare-and-order-integers, compare-linear-functions-graphs-and-equations, compare-linear-functions-tables-graphs-and-equations, congruent-triangles-sss-sas-and-asa, describe-a-sequence-of-transformations, front-side-and-top-view, identify-alternate-interior-and-alternate-exterior-angles, identify-complementary-supplementary-vertical-adjacent-and-congruent-angles, identify-congruent-figures, identify-functions-graphs, identify-linear-and-nonlinear-functions-graphs-and-equations, identify-linear-and-nonlinear-functions-tables, identify-lines-of-best-fit, identify-reflections-rotations-and-translations, identify-similar-triangles, identify-trapezoids, identify-trends-with-scatter-plots, interpret-graphs-of-proportional-relationships, interpret-the-slope-and-y-intercept-of-a-linear-function, irrational-numbers-on-number-lines, line-symmetry, make-predictions-with-scatter-plots, mean-median-mode-and-range-find-the-missing-number, model-and-solve-equations-using-algebra-tiles, multiply-polynomials-using-algebra-tiles, nets-of-three-dimensional-figures, parts-of-a-circle, parts-of-three-dimensional-figures, perimeter-and-area-changes-in-scale, quadrants-and-axes, rotational-symmetry, rotational-symmetry-amount-of-rotation, similar-and-congruent-figures, transversals-of-parallel-lines-name-angle-pairs
	kindergarten	addition-sentences-up-to-10-what-does-the-model-show, addition-sentences-up-to-10-which-model-matches, addition-sentences-up-to-5-what-does-the-model-show, addition-sentences-up-to-5-which-model-matches, am-or-pm, are-there-enough, circles, classify-shapes-by-color, coin-names-penny-through-quarter, compare-sides-and-corners, compare-size-weight-and-capacity, compare-two-groups-of-coins-pennies-through-dimes, cones, count-corners, count-cubes-up-to-10, count-cubes-up-to-5, count-dots-0-to-5, count-dots-up-to-10, count-money-pennies-and-nickels, count-money-pennies-through-dimes, count-on-ten-frames-up-to-10, count-pictures-up-to-10, count-pictures-up-to-3, count-pictures-up-to-5, count-scattered-shapes-up-to-10, count-scattered-shapes-up-to-5, count-shapes-in-rings-up-to-10, count-shapes-in-rows-up-to-10, count-shapes-in-rows-up-to-5, count-shapes-up-to-3, count-sides, count-sides-and-corners, count-to-100, count-to-fill-a-ten-frame, cubes, curved-parts, cylinders, different, equal-sides, fewer-and-more-compare-by-counting, fewer-and-more-compare-by-matching, fewer-and-more-compare-in-a-mixed-group, fewer-and-more-up-to-20, fewer-more-and-same, flat-and-solid-shapes, hexagons, holds-more-or-less, identify-halves-thirds-fourths, identify-pictures-with-symmetry, identify-shapes-traced-from-solids, inside-and-outside, introduction-to-symmetry, light-and-heavy, match-analog-and-digital-clocks, match-analog-clocks-and-times, match-digital-clocks-and-times, more-or-less-likely, name-the-three-dimensional-shape, name-the-two-dimensional-shape, one-less-with-pictures-up-to-10, one-less-with-pictures-up-to-5, one-more-and-one-less-with-pictures-up-to-10, one-more-with-pictures-up-to-10, one-more-with-pictures-up-to-5, ordinal-numbers-up-to-fifth, ordinal-numbers-up-to-tenth, rectangles, represent-numbers-up-to-10, represent-numbers-up-to-20, represent-numbers-with-pictures-up-to-3, represent-numbers-with-pictures-up-to-5, represent-numbers-with-shapes-up-to-3, represent-numbers-with-shapes-up-to-5, select-three-dimensional-shapes, select-two-dimensional-shapes, shapes-of-everyday-objects, spheres, square-corners, squares, subtraction-sentences-up-to-10-what-does-the-model-show, subtraction-sentences-up-to-10-which-model-matches, subtraction-sentences-up-to-5-what-does-the-model-show, subtraction-sentences-up-to-5-which-model-matches, take-apart-10-words, take-apart-numbers-up-to-10-words, take-apart-numbers-up-to-5-words, tall-and-short, times-of-everyday-events, triangles, wide-and-narrow
	pre-k	addition-sentences-up-to-10-what-does-the-model-show, addition-sentences-up-to-10-which-model-matches, addition-sentences-up-to-5-what-does-the-model-show, addition-sentences-up-to-5-which-model-matches, are-there-enough, circles, circles-squares-and-triangles, circles-squares-triangles-and-rectangles, classify-shapes-by-color, compare-size-weight-and-capacity, cones, count-corners, count-cubes-up-to-10, count-cubes-up-to-5, count-dots-up-to-10, count-dots-up-to-3, count-dots-up-to-5, count-on-ten-frames-up-to-10, count-on-ten-frames-up-to-3, count-on-ten-frames-up-to-5, count-pennies, count-pictures-up-to-10, count-pictures-up-to-3, count-pictures-up-to-5, count-scattered-shapes-up-to-10, count-scattered-shapes-up-to-5, count-shapes-in-rings-up-to-10, count-shapes-in-rows-up-to-10, count-shapes-in-rows-up-to-5, count-shapes-up-to-3, count-sides, count-sides-and-corners, cubes, cylinders, different, dimes-and-quarters, fewer, fewer-and-more-compare-by-counting, fewer-and-more-compare-by-matching, fewer-and-more-compare-in-a-mixed-group, fewer-more-and-same, flat-and-solid-shapes, holds-more-or-less, identify-shapes-traced-from-solids, inside-and-outside, light-and-heavy, more, name-the-shape, name-the-solid-shape, one-less-with-pictures-up-to-10, one-less-with-pictures-up-to-5, one-more-with-pictures-up-to-10, one-more-with-pictures-up-to-5, ordinal-numbers-up-to-fifth, ordinal-numbers-up-to-tenth, pennies-and-nickels, pennies-nickels-dimes-and-quarters, rectangles, represent-numbers-up-to-10, represent-numbers-up-to-20, represent-numbers-with-pictures-up-to-3, represent-numbers-with-pictures-up-to-5, represent-numbers-with-shapes-up-to-3, represent-numbers-with-shapes-up-to-5, select-solid-shapes, shapes-of-everyday-objects, spheres, squares, subtraction-sentences-up-to-10-which-model-matches, subtraction-sentences-up-to-5-which-model-matches, tall-and-short, tally-marks-up-to-10, triangles, what-comes-next, wide-and-narrow
	precalculus	determine-continuity-on-an-interval-using-graphs, determine-continuity-using-graphs, determine-one-sided-continuity-using-graphs, find-limits-at-vertical-asymptotes-using-graphs, identify-graphs-of-continuous-functions, outliers-in-scatter-plots, solve-a-triangle

Skill	Error Rate	Example
greatest-and-least-word-problems-up-to-100	76.8%	Description: The school district compared how many swings each elementary school has.Which school has the fewest swings? Picture: Choices: [Shoreline Elementary, Hillside Elementary, Valley Elementary, Lincoln Elementary, ] Answer index: 2 Prediction: 0
greatest-and-least-word-problems-up-to-1000	76.0%	Description: Paul kept a log of how many minutes he spent practicing ice skating over the past 4 days.On which day did Paul practice the least? Picture: Choices: [Tuesday, Wednesday, Thursday, Friday, ] Answer index: 3 Prediction: 2
reading-schedules	75.0%	Description: Look at the following schedule:Which meeting ends at 12:00 P.M.? Picture: Choices: [the city council meeting, the construction permit meeting, the parking meter meeting, the police meeting, ] Answer index: 0 Prediction: 2
angles-of-90-180-270-and-360-degrees	73.8%	Description: What fraction of a turn is this angle? Picture: Choices: [3/4, 1 full turn, 1/2, 1/4, ] Answer index: 2 Prediction: 3
points-lines-line-segments-rays-and-angles	73.8%	Description: What is this? Picture: Choices: [a line segment, a ray, a line, a point, ] Answer index: 1 Prediction: 0

Skill	Error Rate	Example
use-punnett-squares-to-calculate-ratios-of-offspring-types	69.10%	Description: This passage describes the antenna type trait in fruit flies:Most fruit flies have a pair of antennae on their head. But, some flies appear to have an extra pair of legs on their head instead! These flies have a mutation, or change, in a gene that affects body development. This mutation makes the cells in the fly’s head form mutated antennae that are like legs.In a group of fruit flies, some individuals have mutated antennae and others have normal antennae. In this group, the gene for the antenna type trait has two alleles. The allele for normal antennae (a) is recessive to the allele for mutated antennae (A).This Punnett square shows a cross between two fruit flies.What is the expected ratio of offspring with normal antennae to offspring with mutated antennae? Choose the most likely ratio. Picture: Choices: [0:4, 3:1, 2:2, 1:3, 4:0, ] Answer index: 0 Prediction: 3
use-punnett-squares-to-calculate-probabilities-of-offspring-types	60.10%	Description: In a group of tomato plants, some individuals have smooth fruit and others have fuzzy fruit. In this group, the gene for the fruit texture trait has two alleles. The allele for smooth fruit (F) is dominant over the allele for fuzzy fruit (f).This Punnett square shows a cross between two tomato plants.What is the probability that a tomato plant produced by this cross will be hom*ozygous recessive for the fruit texture gene? Picture: Choices: [0/4, 1/4, 2/4, 3/4, 4/4, ] Answer index: 0 Prediction: 3
predict-temperature-changes	55.00%	Description: Two identical blocks are heated to different temperatures. The blocks are placed so that they touch, and heat begins to flow between the blocks. The pair of blocks is insulated, so no energy escapes.Later, the temperature of each block is measured again. Which pair of temperatures is possible? Picture: Choices: Answer index: 1 Prediction: 0
identify-magnets-that-attract-or-repel	21.10%	Description: Two magnets are placed as shown.Hint: Magnets that attract pull together. Magnets that repel push apart. Picture: Choices: [attract, repel, ] Answer index: 1 Prediction: 0
predict-heat-flow	16.20%	Description: Two solid blocks are at different temperatures. The blocks are touching.Which picture shows how heat will move? Picture: None Choices: Answer index: 0 Prediction: 1