Measuring Vision-Language STEM Skills of Neural Models (2024)

ModelScienceTechnologyEngineeringMathAverage
Random Guesses38.625.044.939.136.9
Language Models
GloVe(Pennington etal., 2014)38.025.248.139.037.6
UnifiedQASmall(Khashabi etal., 2020)39.627.258.039.641.1
UnifiedQABase(Khashabi etal., 2020)42.628.855.440.041.7
\hdashline GPT-3(Brown etal., 2020)47.122.173.544.046.7
GPT-3.5-Turbo50.126.374.645.049.0
Vision-Language Models
Virtex (Desai & Johnson, 2021)37.524.048.138.937.1
12-in-1 (Lu etal., 2020)39.427.544.241.938.3
ViLBERT (Lu etal., 2019)39.032.144.242.739.5
UNITER (Chen etal., 2020b)50.834.655.143.245.9
\hdashlineCLIP(Radford etal., 2021) RN5047.864.455.843.652.9
RN10150.365.346.743.751.5
RN50x448.869.249.444.152.9
RN50x1649.866.151.444.352.9
RN50x6450.970.055.543.254.9
ViT-B/3248.363.759.542.853.6
ViT-B/1648.665.947.243.651.3
ViT-L/1449.868.654.343.154.0
ViT-L/14@336px50.368.755.143.654.4
+Finetuning87.071.967.778.476.3

3 Experiments

In this section, we show the performance of a wide set of neural models as well as humans on STEM. The results show that state-of-the-art foundation models like CLIP and GPT-3.5-Turbo still underperform general elementary students. The details of the experimental setup, additional results and analysis are described in the appendix.

3.1 Main Results

Zero-Shot

The results are shown in Table 2.4. We first test language models to see whether models that only understand text are proficient at the multimodal skills in STEM. GloVe has near random-chance accuracy. This means that STEMcannot be solved by simply matching the text semantic similarity between questions and answers. UnifiedQA does slightly better than GloVe with an improvement of averaging 4.1% points. GPT-3.5-Turbo performs the best among these language models, reaching 49.0% accuracy on average. Both foundation models (GPT-3.5-Turbo and GPT-3) perform well in engineering. This is mainly because engineering practices are mainly described in the text (see Figure1(a)(iii)). Recent advancements in large language models help dramatically improve text understanding capabilities. However, large language models still struggle in other subjects. This implies that the understanding of both vision and language information is essential to STEMskills.

Measuring Vision-Language STEM Skills of Neural Models (1)

Next, we examine vision-language models. We find that the performance of Virtex, 12-in-1, and ViLBERT is nearing the performance of random guesses. These models capture very limited knowledge of STEM subjects. On the other hand, UNITER and CLIP show significant improvements over the random-chance accuracy.Specifically, CLIP-RN50x64 achieves the best result on STEM. It achieves 18.0% points improvements over random guesses. Notably, CLIP-RN50x64 outperforms GPT-3.5-Turbo by 5.9% points. This shows that CLIP has a basic understanding of multimodal STEM skills. Its vision understanding ability certainly contributes to this performance. Among all subjects, we see only marginal improvements in math. This applies to all foundation models. In addition, the result implies that math is the most challenging subject for current neural models. Novel algorithm advancements that can enable strong reasoning ability are necessary to solve math problems.

Finetuning

The results are shown in Table2.4. It is encouraging as finetuning CLIP ViT-L/14@336px is able to significantly boost the performance on science and math by averaging 30% points over its zero-shot setting. The performance improvements on other subjects are 7.9% points, which is much smaller. While having a large amount of training data helps to some extent, the finetuning performance is still far behind that of an average elementary student (the human-level performance is presented in Sec.3.3). This indicates that more fundamental advancements are required to solve STEM questions in the STEMdataset. For simplicity, we use CLIP to represent CLIP ViT-L/14@336px in the rest of this section.

3.2 Results Analysis

Skills

As STEMprovides massive skills, analyzing models’ performance at the skill level helps understand models better. We show the performance of foundation models (GPT-3, GPT-3.5-Turbo, and CLIP) on an uncurated set of skills of each subject in Figure5.We find that these foundation models are able to perform well zero-shot on skills focusing on identifying common objects (e.g., classifying fruits). However, zero-shot and finetuned foundation models all fail in challenging skills that require abstract knowledge and complex reasoning (e.g., describing transformation).

Measuring Vision-Language STEM Skills of Neural Models (2)

Grades

Intuitively, questions for higher graders are more difficult than those for lower graders. We illustrate the grade-level model performance to investigate if the same trend holds for neural models as well. We show the exam scores of models along each grade in Figure6. Surprisingly, there is no obvious performance drop as the increase in grade levels. This implies the learning curve for neural models may be different from that of humans. A reason is that neural models are trained on data including all grade-level questions simultaneously while humans gradually learn from lower to higher grade-level questions. Also, the average exam scores of elementary grades (grades 1-6) equals 40.8, which is 54.7% lower than human reference (i.e., 90).

Measuring Vision-Language STEM Skills of Neural Models (3)
Measuring Vision-Language STEM Skills of Neural Models (4)

Measuring Vision-Language STEM Skills of Neural Models (5)

Measuring Vision-Language STEM Skills of Neural Models (6)
Measuring Vision-Language STEM Skills of Neural Models (7)

Calibration

A trustworthy model should be calibrated. This means that its confidence should approximately match the actual probability of the prediction being correct(Guo etal., 2017a). However modernneural networks are often not well calibrated(Nguyen etal., 2015; Guo etal., 2017b). We show the relationship between the confidence of CLIP and the corresponding accuracy in Figure8. We use the softmax probability as the confidence. We observe that the zero-shot CLIP model is not well calibrated. In fact, it is overconfident about its predictions and is only loosely related to its actual accuracy. After finetuning, CLIP is more calibrated. The results suggest that further improving calibration on STEMis another promising direction.

Scaling Laws

Figure8 shows the average accuracy of zero-shot CLIP with different model sizes. As expected, the performance improves as models grow larger. But the performance also saturates. This implies that other than increasing model scales, new advancements in model design or training schema are required to improve the performance on STEM.

3.3 Comparison with Human

In this section, we explore whether the best-performing foundation models namely CLIP, GPT-3, and GPT-3.5-Turbo are nearing human-level performance.

Figure9(a) shows the exam scores (Sec.2.4) of models and humans on each subject. A score of 90 means a student is proficient in the subject. The zero-shot performances of all tested neural models are well below that bar. In technology, CLIP finetuning achieves human-level performance. This is mainly because most technology skills are about specific empirical knowledge, which is learnable for neural models after finetuning. Overall, there is still a large performance gap between general neural models and average elementary students even in understanding the fundamental skills in STEM. In addition, the offline real-world test-takers (Sec.2.4) produce similar outputs with the above online setup on a subset of questions in the STEM. The results are shown in Figure9(b).

Measuring Vision-Language STEM Skills of Neural Models (8)

3.4 Case Study

We show examples of GPT-3.5-Turbo predictions in Figure10. We show an example of correct and incorrect predictions respectively. For the correct ones, the corresponding skills are mainly about the basics, such as names of objects (e.g., shapes or animals). The incorrect predictions are mainly due to the complex nature of skills. These skills are often about abstract concepts such as symmetry and the direction of force. They are also more relevant to logical reasoning, such as finding patterns or inferring the function of animal adaption.

4 Related Work

There are various types of vision-language tasks, such as reference resolution(Kazemzadeh etal., 2014), image captioning or tagging(Thomee etal., 2016; Sharma etal., 2018), image-text retrieval(Lin etal., 2014; Plummer etal., 2015), visual question answering(Antol etal., 2015; Goyal etal., 2017; Zhang etal., 2016; Zhu etal., 2016), and visual reasoning(Suhr etal., 2017; Johnson etal., 2017). Our STEMdiffers from the previous datasets in that it covers diverse fundamentals of STEM and requires both multimodal understanding and domain knowledge in STEM. This makes STEMa natural testbed to evaluate the real-world problem solving abilities of machine learning models.

Existing STEM related benchmarks do not cover all STEM skills for multimodal understanding. There are benchmarks targeting math(Saxton etal., 2019; Hendrycks etal., 2021b; Zheng etal., 2022; Lu etal., 2021a; b; Xiong etal., 2023b). PIQA(Bisk etal., 2020) is a benchmark for physical commonsense understanding. ScienceQA(Lu etal., 2022) is a multimodal dataset for general science. MMLU(Hendrycks etal., 2021a) contains 57 tasks including STEM but is only restricted to single text modality. Our STEMis the first to include all STEM subjects for vision-language understanding.

Pretrained foundation models help achieve state-of-the-art performance in both NLP and computer vision tasks.Pretrained language models(Radford etal., 2018; 2019; Devlin etal., 2019), especially the recent large language models(Chen etal., 2020a; Wang etal., 2020; 2022a; Ouyang etal., 2022; Crispino etal., 2023; OpenAI, 2023; Chowdhery etal., 2022) have significantly advanced the performance in general natural language understanding tasks. Based on these models, various techniques(Shen etal., 2022a; b; Imani etal., 2023; Jiang etal., 2023; Wang etal., 2023; Xiong etal., 2023a; Pan etal., 2024b; a) have been developed to address specific challenges in a domain such as math. We focus on testing the basic STEM ability of state-of-the-art models in a zero-shot setting and identifying room for improvement by referring to our finetuning results.CLIP(Radford etal., 2021) is one of the state-of-the-art pretrained vision-language models(Lu etal., 2019; Krishna etal., 2017; Chen etal., 2020b; Desai & Johnson, 2021; Lu etal., 2020). Other similar models include GLIP(Li etal., 2022b), GLIDE(Nichol etal., 2022), OFA(Wang etal., 2022b), and BLIP(Li etal., 2022a; 2023). We use CLIP in our test while the majority of existing benchmarks have not explored it yet.

5 Conclusion

We introduce STEM, a new challenge to examine the STEM skills of neural models. STEMis the largest multimodal benchmark for this challenge. It consists of a large number of multi-choice questions and skills spanning all STEM subjects. STEMfocuses on fundamentals of STEM based on the K-12 curriculum. We also include state-of-the-art foundation models such as GPT-3.5-Turbo and CLIP for evaluations. The benchmark results suggest that current neural model performances are still far behind that of elementary students. STEMposes unique challenges for the research community to develop fundamental algorithmic advancements.We hope our benchmark will foster future research in multimodal understanding.

Ethics Statement

We hereby acknowledge that all of the co-authors of this work are aware of the provided ICLR Code of Ethics and honor the code of conduct. We collected data from several sources, and we cited the data creators. The copyright belongs to the original data owners. The STEMdataset is under the CC BY-NC-SA 4.0 license (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International) and is used for non-commercial research purposes. The collected data does not contain any personally identifiable information or offensive content. Our dataset is mainly built upon instances from real-world exam data. Therefore it was less likely to contain sensitive data. We evaluate foundation models, for which the risks and potential harms are discussed(Brown etal., 2020; Radford etal., 2021).

Acknowledgements

This paper is partially supported by the National Key Research and Development Program of China with Grant No. 2023YFC3341203 as well as the National Natural Science Foundation of China with Grant No.62276002.

References

  • Anderson etal. (2018)Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang.Bottom-up and top-down attention for image captioning and visual question answering.In CVPR, pp. 6077–6086, 2018.
  • Antol etal. (2015)Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C.Lawrence Zitnick, and Devi Parikh.VQA: visual question answering.In ICCV, pp. 2425–2433, 2015.
  • Bashkov etal. (2021)BozhidarM Bashkov, Kate Mattison, and Lara Hochstein.Ixl design principles.2021.
  • Bisk etal. (2020)Yonatan Bisk, Rowan Zellers, Ronan LeBras, Jianfeng Gao, and Yejin Choi.PIQA: reasoning about physical commonsense in natural language.In AAAI, pp. 7432–7439, 2020.
  • Brown etal. (2020)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, etal.Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chen etal. (2020a)Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and GeoffreyE. Hinton.Big self-supervised models are strong semi-supervised learners.In NeurIPS, 2020a.
  • Chen etal. (2020b)Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed ElKholy, Faisal Ahmed, Zhe Gan, YuCheng, and Jingjing Liu.Uniter: Universal image-text representation learning.In ECCV, pp. 104–120, 2020b.
  • Chowdhery etal. (2022)Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, HyungWon Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, YiTay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, AndrewM. Dai, ThanumalayanSankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel.Palm: Scaling language modeling with pathways.CoRR, abs/2204.02311, 2022.
  • Crispino etal. (2023)Nicholas Crispino, Kyle Montgomery, Fankun Zeng, Dawn Song, and Chenguang Wang.Agent instructs large language models to be general zero-shot reasoners.arXiv preprint arXiv:2310.03710, 2023.
  • Desai & Johnson (2021)Karan Desai and Justin Johnson.Virtex: Learning visual representations from textual annotations.In CVPR, pp. 11162–11173, 2021.
  • Devlin etal. (2019)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.BERT: pre-training of deep bidirectional transformers for language understanding.In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,, pp. 4171–4186, 2019.
  • Dosovitskiy etal. (2020)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etal.An image is worth 16x16 words: Transformers for image recognition at scale.In ICLR, 2020.
  • Goyal etal. (2017)Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.Making the V in VQA matter: Elevating the role of image understanding in visual question answering.In CVPR, pp. 6325–6334, 2017.
  • Guo etal. (2017a)Chuan Guo, Geoff Pleiss, YuSun, and KilianQ. Weinberger.On calibration of modern neural networks.In ICML, pp. 1321–1330, 2017a.
  • Guo etal. (2017b)Chuan Guo, Geoff Pleiss, YuSun, and KilianQ Weinberger.On calibration of modern neural networks.In International conference on machine learning, pp. 1321–1330. PMLR, 2017b.
  • He etal. (2016)Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In CVPR, pp. 770–778, 2016.
  • Hendrycks etal. (2021a)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.Measuring massive multitask language understanding.In ICLR, 2021a.
  • Hendrycks etal. (2021b)Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt.Measuring mathematical problem solving with the math dataset.NeurIPS, 2021b.
  • Imani etal. (2023)Shima Imani, Liang Du, and Harsh Shrivastava.Mathprompter: Mathematical reasoning using large language models.CoRR, abs/2303.05398, 2023.
  • IXL (a)IXL.Understanding the ixl smartscore.https://blog.ixl.com/wp-content/uploads/2014/11/SmartScore-guide.pdf, a.
  • IXL (b)IXL.How does the smartscore work?https://www.ixl.com/help-center/article/1272663/how_does_the_smartscore_work, b.
  • Jiang etal. (2023)AlbertQiaochu Jiang, Sean Welleck, JinPeng Zhou, Timothée Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu.Draft, sketch, and prove: Guiding formal theorem provers with informal proofs.In ICLR, 2023.
  • Johnson etal. (2017)Justin Johnson, Bharath Hariharan, Laurens vander Maaten, LiFei-Fei, C.Lawrence Zitnick, and RossB. Girshick.CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning.In CVPR, pp. 1988–1997, 2017.
  • Kazemzadeh etal. (2014)Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and TamaraL. Berg.Referitgame: Referring to objects in photographs of natural scenes.In ACL, pp. 787–798, 2014.
  • Khashabi etal. (2020)Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi.Unifiedqa: Crossing format boundaries with a single QA system.In Findings of EMNLP, pp. 1896–1907, 2020.
  • Krishna etal. (2017)Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, DavidA Shamma, etal.Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, pp. 32–73, 2017.
  • Learning (2019)IXL Learning.The impact of ixl math and ixl ela on student achievement in grades pre-k to 12 (pp. 1–27), 2019.
  • Li etal. (2022a)Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.In International Conference on Machine Learning, pp. 12888–12900. PMLR, 2022a.
  • Li etal. (2023)Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint arXiv:2301.12597, 2023.
  • Li etal. (2022b)LiunianHarold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, LuYuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao.Grounded language-image pre-training.In CVPR, pp. 10955–10965, 2022b.
  • Lin etal. (2014)Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and CLawrence Zitnick.Microsoft coco: Common objects in context.In ECCV, pp. 740–755, 2014.
  • Liu etal. (2023)Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, YeYuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, Ming Zhang, and Qun Liu.Fimo: A challenge formal dataset for automated theorem proving, 2023.
  • Lu etal. (2019)Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee.Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.In NeurIPS, pp. 13–23, 2019.
  • Lu etal. (2020)Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee.12-in-1: Multi-task vision and language representation learning.In CVPR, 2020.
  • Lu etal. (2021a)Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu.Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning.In ACL-IJCNLP, pp. 6774–6786, 2021a.
  • Lu etal. (2021b)Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu.Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning.In NeurIPS, 2021b.
  • Lu etal. (2022)Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan.Learn to explain: Multimodal reasoning via thought chains for science question answering.In NeurIPS, 2022.
  • Nguyen etal. (2015)Anh Nguyen, Jason Yosinski, and Jeff Clune.Deep neural networks are easily fooled: High confidence predictions for unrecognizable images.In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436, 2015.
  • Nichol etal. (2022)AlexanderQuinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen.GLIDE: towards photorealistic image generation and editing with text-guided diffusion models.In ICML, pp. 16784–16804, 2022.
  • OpenAI (2023)OpenAI.GPT-4 technical report.CoRR, abs/2303.08774, 2023.
  • Ouyang etal. (2022)Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal.Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Pan etal. (2024a)YuPan, YeYuan, Yichun Yin, Jiaxin Shi, Zenglin Xu, Ming Zhang, Lifeng Shang, Xin Jiang, and Qun Liu.Preparing lessons for progressive training on language models.arXiv preprint arXiv:2401.09192, 2024a.
  • Pan etal. (2024b)YuPan, YeYuan, Yichun Yin, Zenglin Xu, Lifeng Shang, Xin Jiang, and Qun Liu.Reusing pretrained models by multi-linear operators for efficient training.Advances in Neural Information Processing Systems, 36, 2024b.
  • Peng etal. (2023)Zhiliang Peng, Wenhui Wang, LiDong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei.Kosmos-2: Grounding multimodal large language models to the world, 2023.
  • Pennington etal. (2014)Jeffrey Pennington, Richard Socher, and ChristopherD. Manning.Glove: Global vectors for word representation.In EMNLP, pp. 1532–1543, 2014.
  • Plummer etal. (2015)BryanA. Plummer, Liwei Wang, ChrisM. Cervantes, JuanC. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik.Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.In ICCV, pp. 2641–2649, 2015.
  • Radford etal. (2018)Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, etal.Improving language understanding by generative pre-training.2018.
  • Radford etal. (2019)Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, etal.Language models are unsupervised multitask learners.OpenAI blog, 2019.
  • Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In ICML, pp. 8748–8763, 2021.
  • Saxton etal. (2019)David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli.Analysing mathematical reasoning abilities of neural models.In ICLR, 2019.
  • Sharma etal. (2018)Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut.Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.In ACL, pp. 2556–2565, 2018.
  • Shen etal. (2022a)DaShen, Xinyun Chen, Chenguang Wang, Koushik Sen, and Dawn Song.Benchmarking language models for code syntax understanding.In EMNLP, 2022a.
  • Shen etal. (2022b)Jianhao Shen, Chenguang Wang, YeYuan, Jiawei Han, Heng Ji, Koushik Sen, Ming Zhang, and Dawn Song.Palt: Parameter-lite transfer of language models for knowledge graph completion.In EMNLP, 2022b.
  • Suhr etal. (2017)Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi.A corpus of natural language for visual reasoning.In ACL, pp. 217–223, 2017.
  • Sun etal. (2023)Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao.Eva-clip: Improved training techniques for clip at scale, 2023.
  • Thomee etal. (2016)Bart Thomee, DavidA. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li.YFCC100M: the new data in multimedia research.Commun. ACM, pp. 64–73, 2016.
  • Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.NeurIPS, 2017.
  • Wang etal. (2020)Chenguang Wang, Xiao Liu, and Dawn Song.Language models are open knowledge graphs.arXiv preprint arXiv:2010.11967, 2020.
  • Wang etal. (2022a)Chenguang Wang, Xiao Liu, Zui Chen, Haoyun Hong, Jie Tang, and Dawn Song.Deepstruct: Pretraining of language models for structure prediction.In ACL, 2022a.
  • Wang etal. (2023)Haiming Wang, YeYuan, Zhengying Liu, Jianhao Shen, Yichun Yin, Jing Xiong, Enze Xie, Han Shi, Yujun Li, Lin Li, Jian Yin, Zhenguo Li, and Xiaodan Liang.DT-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function.In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12632–12646, Toronto, Canada, July 2023. Association for Computational Linguistics.doi: 10.18653/v1/2023.acl-long.706.URL https://aclanthology.org/2023.acl-long.706.
  • Wang etal. (2022b)Peng Wang, AnYang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang.Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework.In International Conference on Machine Learning, pp. 23318–23340. PMLR, 2022b.
  • Xiong etal. (2023a)Jing Xiong, Zixuan Li, Chuanyang Zheng, Zhijiang Guo, Yichun Yin, Enze Xie, Zhicheng Yang, Qingxing Cao, Haiming Wang, Xiongwei Han, Jing Tang, Chengming Li, and Xiaodan Liang.Dq-lore: Dual queries with low rank approximation re-ranking for in-context learning, 2023a.
  • Xiong etal. (2023b)Jing Xiong, Jianhao Shen, YeYuan, Haiming Wang, Yichun Yin, Zhengying Liu, Lin Li, Zhijiang Guo, Qingxing Cao, Yinya Huang, Chuanyang Zheng, Xiaodan Liang, Ming Zhang, and Qun Liu.TRIGO: Benchmarking formal mathematical proof reduction for generative language models.In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 11594–11632, Singapore, December 2023b. Association for Computational Linguistics.doi: 10.18653/v1/2023.emnlp-main.711.URL https://aclanthology.org/2023.emnlp-main.711.
  • Zhang etal. (2016)Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.Yin and yang: Balancing and answering binary visual questions.In CVPR, pp. 5014–5022, 2016.
  • Zheng etal. (2022)Kunhao Zheng, JesseMichael Han, and Stanislas Polu.minif2f: a cross-system benchmark for formal olympiad-level mathematics.In ICLR, 2022.
  • Zhu etal. (2016)Yuke Zhu, Oliver Groth, MichaelS. Bernstein, and LiFei-Fei.Visual7w: Grounded question answering in images.In CVPR, pp. 4995–5004, 2016.

Appendix A More Details on STEM

In this section, we provide more details on STEM, including dataset analysis, models, evaluation settings, and dataset collection.

A.1 Analysis

Questions and Answers

STEMcontains multi-choice questions (AppendixD provides a question example for each skill). The question contains a textual description with an optional image context. Answer options are in text or in an image. We further analyze the questions from the following aspects.(i) The number of answers. STEMhas averaging 2.82.82.82.8 answer options for each question. The distribution is presented in Figure 12. In practice, the more answer options one question has, the more difficult it is.(ii) Question type. We categorize questions based on the first three words of the question text as shown in Figure 12. STEMmostly includes factoid questions that start with words such as “which” and “what”. We also show the word cloud of our STEMin Figure 13. We can see the most common words like “shape” and “number”. This indicates the questions require joint reasoning of the text and images.(iii) Question distribution. Figure15 depicts the distribution of question lengths. We can see all subjects generally follow a long-tail distribution, while math distribution is most steep and science distribution is flatter. Heuristically, longer questions are more difficult to solve. Figure 15 shows the number of questions in each grade. While pre-K has more questions, the number of questions in other grades is approximately evenly distributed.

Measuring Vision-Language STEM Skills of Neural Models (11)
Measuring Vision-Language STEM Skills of Neural Models (12)
Measuring Vision-Language STEM Skills of Neural Models (13)

Skill Comparison

We compare the skills of STEMwith other related datasets in Table 3. STEM contains the largest skill set among existing datasets, with a great number of new skills introduced to STEMthat are not yet covered by existing datasets, e.g., skills in technology and engineering.

SubjectIconQAScienceQASTEM
Science016782
Technology009
Engineering006
Math130351
Total13167448

IconQASTEM
CountingCount to 10, Count shapes in rows, Count sides and corners …
GeometryClassify triangles,Identify symmetry,Identify shapes …
TimeMatch times,Identify A.M./P.M.,Read a calendar …
Not coverScienceCompare concentrations of solutions …
TechnologyIdentify peripherals …
EngineeringIdentify laboratory tools …
MathLinear and exponential functions …

A.2 Models

In this section, we introduce the foundation models we benchmark in detail.

Vision-Language Models

CLIP (Radford etal., 2021).CLIP is pretrained on a sufficiently large dataset of 400 million text-image pairs across the Internet. It uses a Transformer as the text encoder, and has several variants of image encoder, including ResNet (RN) backbones and Vision Transformers (ViT) (Dosovitskiy etal., 2020). CLIP aligns the text and image representation by training on in-batch contrastive loss, and is able to zero-shot transfer to downstream vision language tasks. To align with CLIP pretraining, we formulate question answering as matching text and images. We use the cosine similarity between the text and image embeddings as the matching function, the same as the original zero-shot image-text retrieval settings in CLIP (Radford etal., 2021).

ViLBERT and 12-in-1 (Lu etal., 2019; 2020). ViLBERT adopts two parallel streams to process image regions and text segments separately, with co-attentional transformer layers connecting them. There is also a multi-task version called 12-in-1 (Lu etal., 2020) that trains 12 different tasks with individual task-specific heads sharing 1 “trunk” ViLBERT model. Its multi-modal alignment prediction serves as the matching score.

UNITER (Chen etal., 2020b).UNITER consists of an Image Embedder with Faster R-CNN (Anderson etal., 2018), a Text Embedder with Transformer (Vaswani etal., 2017), as well as a multi-layer Transformer to get cross-modality representation. During inference on STEM, the matching score function is the same as CLIP, i.e., the cosine similarity between the text and image embeddings (Chen etal., 2020b).

Virtex (Desai & Johnson, 2021).Virtex first extracts visual features with ResNet-50 (He etal., 2016) backbone. The visual features are then fed into a text head, which consists of two unidirectional Transformers, to predict captions. We extract the image feature with the image encoder, then feed text into the textual head and use the sum of bidirectional generation logits as the matching score.

Language Models

GPT-3(Chen etal., 2020a) and GPT-3.5-Turbo(Ouyang etal., 2022). These foundation language models are generation models pretrained on a large corpus of text. We use the OpenAI API “text-davinci-002” and “gpt-3.5-turbo” corresponding to the best-performing GPT-3 and GPT-3.5-Turbo respectively. We formalize the evaluation task as a question-answering task. The input to GPT-3 and GPT-3.5-Turbo is the concatenation of the question text, the context text, and multiple answer options. The output is to predict a final answer from answer options. For images in questions, we follow Lu etal. (2022) to convert them to visual context text based on a captioning model consisting of ViT(Dosovitskiy etal., 2020) and GPT-2(Radford etal., 2019).

UnifiedQA(Khashabi etal., 2020). UnifiedQA is a pretrained question-answering model. We use both its base and small versions. Its evaluation setup is the same as that of GPT-3 and GPT-3.5-Turbo.

GloVe(Pennington etal., 2014). GloVe is a pretrained word embedding model. We use the similarity between the average embedding of the concatenation of the question and context and the average embedding of each answer option. The answer option with the largest similarity score is the answer output. We use average pooling based on the 300-dimensional word embeddings. The images are also converted to text using the same method as GPT-3 and GPT-3.5-Turbo.

A.3 Evaluation Settings

We benchmark state-of-the-art foundation models on STEMunder different settings, including zero-shot, few-shot, finetuning, and multi-task.

(i)Zero-Shot. We use CLIP(Radford etal., 2021), ViLBERT(Lu etal., 2019), 12-in-1(Lu etal., 2020), UNITER(Chen etal., 2020b), and Virtex(Desai & Johnson, 2021) for the zero-shot evaluation of foundation multimodal models. CLIP is the state-of-the-art multimodal model. For zero-shot CLIP, we follow its original setup in Radford etal. (2021). The input to the text encoder is the concatenation of the question text and an answer option. The input to the image encoder is the image context. The output is the cosine similarity scores between the text embeddings and image embedding. Then the answer option with the largest similarity score serves as an answer. For questions with image answer options, the input to the image encoder will also add the image answer options.

(ii) Few-Shot. We also use CLIP to benchmark the multimodal few-shot results. For k𝑘kitalic_k-shot setup, we randomly select k𝑘kitalic_k questions for each skill from the training set as a meta training set. For each STEM subject, we train the model on the meta training set and select the best model on the validation set. At test time, the evaluation is the same as the zero-shot setup.

(iii) Finetuning. We also finetune CLIP on the entire training set for each subject. The remaining setup is the same as the few-shot setting.

(iv) Multi-Task. Under this setting, we train CLIP on the mixture of training sets of four subjects to produce a single model for all subjects.

A.4 Dataset Collection

We collect science, engineering and math problems from IXL111https://www.ixl.com/, and technology problems from ProProfs Quizzes222https://www.proprofs.com/quiz-school and Triviaplaza333https://www.triviaplaza.com/. We first collect multi-choice problems that have at least one image in either question context or answers. We collect at most 2,000 problems for each skill and remove duplicated problems. There are many formulas embedded in math problems that are not represented in the text. We use the Mathpix444https://mathpix.com/ OCR API to convert these math formulas into the latex format.

Appendix B More Details on Experiments

B.1 Experimental Setup

For the zero-shot setting, we evaluate all models on the test set. For the few-shot, finetuning, and multi-task setting, we train CLIP-ViT-L/14@336px on the corresponding train set, tune hyperparameters on the valid set, and finally evaluate on the test set. We use AdamW for optimization and tune hyperparameters as follows: batch size is chosen from {16, 32, 64, 128}, and set to 16 for few-shot learning, 128 for finetuning and multi-task learning after hyperparameter tuning. The learning rate is chosen between [5e-6, 5e-5] and set to 1e-5 for all training. We set the warm-up ratio to 0.1 and set weight decay as 0.2. We set the maximum of training samples to 100k for finetuning, 200k for multitask training, and 10 epochs for few-shot training, all with early stopping on the valid set. We use NVIDIA GeForce RTX 3090 GPUs for training.

B.2 Detailed Experimental Analysis

MethodScienceTechnologyEngineeringMathAverage
CLIPZero-Shot50.368.755.143.654.4
Few-Shot75.270.961.963.267.8
Finetuning87.071.967.778.476.3
Multi-Task86.360.473.477.774.5

Few-Shot

In the few-shot setting, we sample different number of samples in each grade to see how the learning performance varies. Specifically, we sample 16 samples per skill and train CLIP on the sampled data. The results are shown in Table 4.We observe that CLIP gains much improvement in all subjects after few-shot learning. This implies that CLIP has already stored STEM-related knowledge and a few samples are able to trigger such knowledge. We also show performance varies when the number of samples of each skill changes (Figure17). The overall performance improves with more samples, but 1-shot and 2-shot in technology are worse than zero-shot. Since there are only 9 skills in technology, 1-shot and 2-shot learning in technology might lead to overfitting.

Multi-Task

We show the results in Table 4. Multi-task learning improves in engineering but performs worse in other subjects compared with individual finetuned models. The reason for the great drop in technology is mainly because its data is much less than other subjects. Multi-task training actually improves performance in engineering. This implies that data from one subject may be beneficial for another when the knowledge is transferable. For example, science shares many common topics with engineering like chemical experiments.

Measuring Vision-Language STEM Skills of Neural Models (14)
Measuring Vision-Language STEM Skills of Neural Models (15)
Measuring Vision-Language STEM Skills of Neural Models (16)
Measuring Vision-Language STEM Skills of Neural Models (17)

Measuring Vision-Language STEM Skills of Neural Models (18)

Measuring Vision-Language STEM Skills of Neural Models (19)

Number of Answers

We also analyze how model performance changes with the number of answers. The results are shown in Figure 17. We find that for GPT-3, GPT-3.5-Turbo, CLIP zero-shot, and few-shot, the accuracy drops as the number of answers increases, but the accuracy of CLIP finetuning and multi-task does not drop. This implies that models after full training are actually solving the problem rather than guessing, so the number of choices does not affect the performance much.

Question Lengths

Figure 19 shows how the question length affects model accuracy. For GPT-3, GPT-3.5-Turbo and CLIP zero-shot, the accuracy decreases slightly as the question becomes longer. For tuned models, the same trend holds for questions less than 70 tokens, but the accuracy starts to increase for longer questions. We think this may be caused by some bias in longer questions and the tuned models learn such bias and achieve higher accuracy. Since there are only a small proportion of questions that are longer than 70 tokens, such bias will not affect the whole dataset much.

Question Type

We mark the types of problems as the first word in the question or request of each problem. In Figure 19 we show the accuracy of the top 10 frequent types. Questions starting with “What” and “How” have relatively low accuracy, as these questions are more difficult to answer.

Measuring Vision-Language STEM Skills of Neural Models (20)
Measuring Vision-Language STEM Skills of Neural Models (21)

Grades

We show the model accuracy on each grade in Figure 22. There is no obvious performance drop as the increase in grade levels, which is similar to the trend of exam scores. This implies the learning curve for neural models may be different from that of humans.

Correlation Between Exam Scores and Accuracy

We evaluate exam scores’ correlation with model accuracy and human accuracy(Figure 20). They in general positively correlated to each other. Even though exam score is different from accuracy, it overall captures accuracy as an important factor.

SubjectReasonRatio (%)
MathCommonsense36
Numerical calculation24
Counting16
Read table/graph12
Transformation12
ScienceComparison40
Commonsense32
Direction20
Read table/graph8

B.3 Error Analysis

To better understand the errors made by CLIP zero-shot, we sample 25 error cases of CLIP zero-shot on math and science. We manually check the reasons for these errors. Table5 shows the analysis results. For math, 36% errors are caused by a lack of mathematical commonsense, such as area formulas and symmetry. Other errors include failure of calculation (24%), counting objects (16%), reading tables or graphs (12%, e.g., graphs of functions), and transformation (12%, e.g., rotation of a 3D object). For science, comparison causes the most errors with a ratio of 40%. Most of these questions only require a straightforward comparison like the distance between two pairs of magnets. However, CLIP fails on such basic problems. This indicates that it is not good at comparing objects and properties yet. Lacking science commonsense also leads to a good number of errors (32%), followed by identifying directions (20%, e.g., the directions of push and pull, towards and away) and reading tables or graphs (8%).

Moreover, we show the top-5 skills with the most errors of fine-tuned models on math and science subsets in Table6 and Table7 respectively.

SkillError RateExample
greatest-and-least-word-problems-up-to-10076.8%

Description: The school district compared how many swings each elementary school has.Which school has the fewest swings?

Picture: Measuring Vision-Language STEM Skills of Neural Models (22)

Choices: [Shoreline Elementary, Hillside Elementary, Valley Elementary, Lincoln Elementary, ]

Answer index: 2

Prediction: 0

greatest-and-least-word-problems-up-to-100076.0%

Description: Paul kept a log of how many minutes he spent practicing ice skating over the past 4 days.On which day did Paul practice the least?

Picture: Measuring Vision-Language STEM Skills of Neural Models (23)

Choices: [Tuesday, Wednesday, Thursday, Friday, ]

Answer index: 3

Prediction: 2

reading-schedules75.0%

Description: Look at the following schedule:Which meeting ends at 12:00 P.M.?

Picture: Measuring Vision-Language STEM Skills of Neural Models (24)

Choices: [the city council meeting, the construction permit meeting, the parking meter meeting, the police meeting, ]

Answer index: 0

Prediction: 2

angles-of-90-180-270-and-360-degrees73.8%

Description: What fraction of a turn is this angle?

Picture: Measuring Vision-Language STEM Skills of Neural Models (25)

Choices: [3/4, 1 full turn, 1/2, 1/4, ]

Answer index: 2

Prediction: 3

points-lines-line-segments-rays-and-angles73.8%

Description: What is this?

Picture: Measuring Vision-Language STEM Skills of Neural Models (26)

Choices: [a line segment, a ray, a line, a point, ]

Answer index: 1

Prediction: 0

SkillError RateExample
use-punnett-squares-to-calculate-ratios-of-offspring-types69.10%

Description: This passage describes the antenna type trait in fruit flies:Most fruit flies have a pair of antennae on their head. But, some flies appear to have an extra pair of legs on their head instead! These flies have a mutation, or change, in a gene that affects body development. This mutation makes the cells in the fly’s head form mutated antennae that are like legs.In a group of fruit flies, some individuals have mutated antennae and others have normal antennae. In this group, the gene for the antenna type trait has two alleles. The allele for normal antennae (a) is recessive to the allele for mutated antennae (A).This Punnett square shows a cross between two fruit flies.What is the expected ratio of offspring with normal antennae to offspring with mutated antennae? Choose the most likely ratio.

Picture: Measuring Vision-Language STEM Skills of Neural Models (27)

Choices: [0:4, 3:1, 2:2, 1:3, 4:0, ]

Answer index: 0

Prediction: 3

use-punnett-squares-to-calculate-probabilities-of-offspring-types60.10%

Description: In a group of tomato plants, some individuals have smooth fruit and others have fuzzy fruit. In this group, the gene for the fruit texture trait has two alleles. The allele for smooth fruit (F) is dominant over the allele for fuzzy fruit (f).This Punnett square shows a cross between two tomato plants.What is the probability that a tomato plant produced by this cross will be hom*ozygous recessive for the fruit texture gene?

Picture: Measuring Vision-Language STEM Skills of Neural Models (28)

Choices: [0/4, 1/4, 2/4, 3/4, 4/4, ]

Answer index: 0

Prediction: 3

predict-temperature-changes55.00%

Description: Two identical blocks are heated to different temperatures. The blocks are placed so that they touch, and heat begins to flow between the blocks. The pair of blocks is insulated, so no energy escapes.Later, the temperature of each block is measured again. Which pair of temperatures is possible?

Picture: Measuring Vision-Language STEM Skills of Neural Models (29)

Choices: Measuring Vision-Language STEM Skills of Neural Models (30)Measuring Vision-Language STEM Skills of Neural Models (31)

Answer index: 1

Prediction: 0

identify-magnets-that-attract-or-repel21.10%

Description: Two magnets are placed as shown.Hint: Magnets that attract pull together. Magnets that repel push apart.

Picture: Measuring Vision-Language STEM Skills of Neural Models (32)

Choices: [attract, repel, ]

Answer index: 1

Prediction: 0

predict-heat-flow16.20%

Description: Two solid blocks are at different temperatures. The blocks are touching.Which picture shows how heat will move?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (33)Measuring Vision-Language STEM Skills of Neural Models (34)

Answer index: 0

Prediction: 1

B.4 Comparison with Human

Exam Score

We test exam scores on all skills in engineering and technology, and randomly choose 40 skills from math, and 30 skills from science due to technical and time constraints. We compare neural models with humans using the exam score, and the results are shown in Table 8. The detailed scores and skills are listed in Table 10.

MethodExam ScoreAccuracy
ScienceEngineeringMathTechnologyScienceTechnologyEngineeringMath
Human90.090.090.068.690.762.986.492.1
Random26.716.151.125.038.325.040.036.8
GPT-345.750.251.422.148.421.365.242.4
GPT-3.5-Turbo48.958.753.526.348.527.462.540.6
CLIPZero-Shot33.919.052.968.753.860.765.544.3
Few-Shot39.143.967.670.977.359.755.567.8
Finetuning57.837.475.771.991.962.660.383.5
Multi-Task61.950.372.060.490.950.670.282.5

Accuracy

We randomly sample 20 problems for each subject and ask 7 Ph.D. students to answer these questions, and calculate the average accuracy for each subject. To evaluate neural models on these questions, we use the corresponding skill accuracy for each sampled problem as the models’ score on this problem and average all accuracy together as the final score. We do not evaluate models on these sampled data directly since the small number of samples will lead to a large variance, and skill accuracy can avoid such variance. The comparison results are shown in Table 8 and Figure 22. All sampled problems are listed in Table12 to 17.

B.5 Zero-Shot Prompt Sensitivity

We study the effect of prompts on CLIP zero-shot. We design 5 types of prompts and demonstrate them with an example problem. The example question is “Which property matches this object?” and the answer is “Rough”. Examples of different prompt types and the corresponding accuracies are shown in Table 9. We observe that “Q+A results in the best performance on average, but the difference is only marginal, meaning that CLIP zero-shot is not very sensitive to the format of prompts.

Prompt FormatExampleScienceTechnologyEngineeringMathAverage
Q+AWhich property matches this object? Rough.50.368.755.143.654.4
A+QRough. Which property matches this object?50.066.049.643.252.2
Q “Choose the best answer:” AWhich property matches this object? Choose the best answer: Rough.50.170.749.744.253.7
“Answer the question:” Q + AAnswer the question: Which property matches this object? Rough.49.467.651.043.652.9
A “best answers the question” QRough best answers the question: Which property matches this object?49.769.550.843.853.4

SubjectGrade/SkillRandomZero-shotFinetune
Science grade-2/classify-matter-as-solid-liquid-or-gas2840100
grade-2/identify-animals-with-and-without-backbones07070
grade-2/identify-mammals-birds-fish-reptiles-and-amphibians0018
grade-2/identify-materials-in-objects2140100
grade-2/identify-properties-of-an-object356565
grade-3/compare-strengths-of-magnetic-forces01863
grade-3/describe-ecosystems6550100
grade-3/find-evidence-of-changes-to-earths-surface1738100
grade-3/identify-ecosystems35100100
grade-3/identify-minerals-using-properties351135
grade-4/compare-properties-of-objects101720
grade-4/describe-ecosystems74100100
grade-4/identify-minerals-using-properties351635
grade-4/use-evidence-to-classify-mammals-birds-fish-reptiles-and-amphibians263535
grade-5/animal-adaptations-beaks-mouths-and-necks172735
grade-5/classify-elementary-substances-and-compounds-using-models757575
grade-5/compare-ancient-and-modern-organisms-use-observations-to-support-a-hypothesis323250
grade-5/identify-directions-of-forces02635
grade-5/identify-the-photosynthetic-organism00100
grade-5/predict-temperature-changes0220
grade-5/use-evidence-to-classify-animals353535
grade-5/use-evidence-to-classify-mammals-birds-fish-reptiles-and-amphibians183535
grade-5/weather-and-climate-around-the-world603660
grade-6/compare-concentrations-of-solutions1511100
grade-6/describe-the-effects-of-gene-mutations-on-organisms521369
grade-6/diffusion-across-membranes502550
grade-7/describe-the-effects-of-gene-mutations-on-organisms421369
grade-8/classify-symbiotic-relationships253645
grade-8/diffusion-across-membranes01835
grade-8/moss-and-fern-life-cycles0120
Engineer grade-6/evaluate-tests-of-engineering-design-solutions00100
grade-6/identify-control-and-experimental-groups000
grade-6/identify-independent-and-dependent-variables00100
grade-6/identify-the-experimental-question303030
grade-7/evaluate-tests-of-engineering-design-solutions000
grade-7/identify-control-and-experimental-groups0040
grade-7/identify-independent-and-dependent-variables0030
grade-7/identify-the-experimental-question40040
grade-8/identify-control-and-experimental-groups000
grade-8/identify-the-experimental-question60040
grade-5/identify-laboratory-tools214231
grade-6/identify-laboratory-tools212121
grade-6/laboratory-safety-equipment246552
grade-7/identify-laboratory-tools102821
grade-7/laboratory-safety-equipment95852
grade-8/identify-laboratory-tools492121
grade-8/laboratory-safety-equipment95858
Math algebra-2/factor-quadratics-using-algebra-tiles405155
algebra-2/outliers-in-scatter-plots554797
calculus/determine-continuity-using-graphs366380
calculus/find-limits-at-vertical-asymptotes-using-graphs606585
grade-1/subtraction-sentences-up-to-10-which-model-matches503099
grade-2/identify-halves-thirds-and-fourths657597
grade-2/identify-lines-of-symmetry706499
grade-2/interpret-bar-graphs-ii142312
grade-2/ordinal-numbers-up-to-10th326128
grade-3/compare-fractions-in-recipes555068
grade-3/identify-parallelograms516470
grade-3/is-it-a-polygon716098
grade-3/parallel-sides-in-quadrilaterals296645
grade-4/nets-of-three-dimensional-figures684099
grade-5/nets-of-three-dimensional-figures534099
grade-6/changes-in-mean-median-mode-and-range381415
grade-6/classify-triangles473845
grade-6/identify-polyhedra757575
grade-6/mean-median-mode-and-range-find-the-missing-number554199
grade-6/model-and-solve-equations-using-algebra-tiles363657
grade-6/rational-numbers-find-the-sign317899
grade-6/rotational-symmetry625678
grade-6/similar-and-congruent-figures343346
grade-6/which-figure-is-being-described362786
grade-7/rational-numbers-find-the-sign475899
grade-8/rotational-symmetry-amount-of-rotation473263
kindergarten/count-on-ten-frames-up-to-1015249
kindergarten/fewer-and-more-up-to-20806297
kindergarten/subtraction-sentences-up-to-5-which-model-matches413096
pre-k/addition-sentences-up-to-10-which-model-matches605596
pre-k/count-on-ten-frames-up-to-3845051
pre-k/fewer-and-more-compare-by-matching635290
pre-k/one-less-with-pictures-up-to-10613766
pre-k/one-more-with-pictures-up-to-5483675
pre-k/shapes-of-everyday-objects679696
pre-k/spheres679696
pre-k/triangles577575
pre-k/what-comes-next755670
pre-k/ordinal-numbers-up-to-tenth278482
kindergarten/are-there-enough409996

B.6 Detailed Performance on Skills

We show the accuracy of neural models on all 448 skills in Figure 24 to 28. We can see that the zero-shot performance is generally better than random guesses on most skills and achieves near 100% on some skills (e.g., “circles” and “cones”). After finetuning, accuracy improves on most skills and becomes near 100%on many skills.

B.7 VQA Results

ModelAccuracy
Zero-Shot CLIP24.7%
Finetuning with Science27.3%
Finetuning with Technology26.5%
Finetuning with Engineering24.8%
Finetuning with Math24.9%

We evaluate the zero-shot CLIP model and models finetuned on each subject on the VQA(Antol etal., 2015) dataset. Results are shown in Table11. The average increase of the finetuned models over the zero-shot setting is 1.2%.

Appendix C Additional Related Work

In addition to vision-language foundation models included in the main text, we expand the discussion to some recent models, including BLIP-2(Li etal., 2023), EVA-ClIP(Sun etal., 2023), and KOSMOS-2(Peng etal., 2023). BLIP-2 provides a versatile and efficient strategy for pre-training. This strategy enhances the vision-language pre-training process by utilizing frozen pre-trained image encoders and frozen large language models, while EVA-CLIP proposes a series of methods to increase the training efficiency of the CLIP model. KOSMOS-2 enables new capabilities for perceiving object descriptions. This work focuses on the creation of a dataset to evaluate the multimodal STEM understanding and we chose the foundation models like CLIP for a pilot study on our dataset. There are more benchmarks targeting formal math reasoning(Zheng etal., 2022; Liu etal., 2023; Xiong etal., 2023b), however, they are all restricted to single text modality and they can not evaluate fundamental skills.

Measuring Vision-Language STEM Skills of Neural Models (35)
Measuring Vision-Language STEM Skills of Neural Models (36)
Measuring Vision-Language STEM Skills of Neural Models (37)

Measuring Vision-Language STEM Skills of Neural Models (38)

Measuring Vision-Language STEM Skills of Neural Models (39)

Measuring Vision-Language STEM Skills of Neural Models (40)

Appendix D Summary of Skills

We list all skills in STEMinTable 18 to 20 and show some examples in Table 21 to 27.

Subject: Technology

Description: This is a(n old) logo of which famous app or program?

Picture: Measuring Vision-Language STEM Skills of Neural Models (41)

Choices: [Microsoft Office Outlook, Microsoft Office OneDrive, OfficeSuite Pro, Opera, ]

Answer index: 3

Subject: Technology

Description: What kind of computer component do you see here?

Picture: Measuring Vision-Language STEM Skills of Neural Models (42)

Choices: [TV Tuner Card, PC Card, Motherboard, Modem Card, ]

Answer index: 2

Subject: Technology

Description: This is (part of) a (former) logo of which computer related brand?

Picture: Measuring Vision-Language STEM Skills of Neural Models (43)

Choices: [ASRock, Amiga Inc., Arctic, ATI Technologies, ]

Answer index: 3

Subject: Technology

Description: This is (part of) a (former) logo of which computer related brand?

Picture: Measuring Vision-Language STEM Skills of Neural Models (44)

Choices: [Fujitsu, Samsung, Iiyama, Brother, ]

Answer index: 2

Subject: Technology

Description: This is (part of) a (former) logo of which computer related brand?

Picture: Measuring Vision-Language STEM Skills of Neural Models (45)

Choices: [Xiaomi, Cisco, Intel, Wii, ]

Answer index: 3

Subject: Technology

Description: What kind of computer component do you see here?

Picture: Measuring Vision-Language STEM Skills of Neural Models (46)

Choices: [Display Adapter/Video Card, PC Card, Power Supply Unit, Hard Disk Drive, ]

Answer index: 3

Subject: Technology

Description: What meaning or function is usually associated with this web interface symbol?

Picture: Measuring Vision-Language STEM Skills of Neural Models (47)

Choices: [Paste, Search, Tip/Idea, Calendar/Event, ]

Answer index: 2

Subject: Technology

Description: This is a(n old) logo of which famous app or program?

Picture: Measuring Vision-Language STEM Skills of Neural Models (48)

Choices: [YouTube Music, Beats Music, MX Player, YouTube, ]

Answer index: 2

Subject: Technology

Description: What kind of computer related plug or port do you see here?

Picture: Measuring Vision-Language STEM Skills of Neural Models (49)

Choices: [USB type-C plug, DVI plug (type D), HDMI plug, 3.5mm Audio Cable plug, ]

Answer index: 0

Subject: Technology

Description: Identify this font type

Picture: Measuring Vision-Language STEM Skills of Neural Models (50)

Choices: [Lucida MT, News Gothic MT, Fixedsys, Courier New, ]

Answer index: 2

Subject: Technology

Description: Identify this font type

Picture: Measuring Vision-Language STEM Skills of Neural Models (51)

Choices: [Commercial Script BT, Brush Script MT, Vivaldi D, ShelleyVolante BT, ]

Answer index: 3

Subject: Technology

Description: Identify this font type

Picture: Measuring Vision-Language STEM Skills of Neural Models (52)

Choices: [Garamond, Times New Roman, Courier New, Georgia, ]

Answer index: 3

Subject: Technology

Description: This is (part of) a (former) logo of which computer related brand?

Picture: Measuring Vision-Language STEM Skills of Neural Models (53)

Choices: [BenQ, Lexmark, Creative Technology, Lenovo, ]

Answer index: 2

Subject: Technology

Description: Identify this font type

Picture: Measuring Vision-Language STEM Skills of Neural Models (54)

Choices: [Webdings, Courier, Impact, System, ]

Answer index: 3

Subject: Technology

Description: Identify this font type

Picture: Measuring Vision-Language STEM Skills of Neural Models (55)

Choices: [Serifa BT, Stylus ITC, Calisto MT, Tempus Sans ITC, ]

Answer index: 0

Subject: Technology

Description: What meaning or function is usually associated with this web interface symbol?

Picture: Measuring Vision-Language STEM Skills of Neural Models (56)

Choices: [Pin/Make something sticky, Storage for deleted files, Options/Settings, Print (preview), ]

Answer index: 1

Subject: Technology

Description: What meaning or function is usually associated with this web interface symbol?

Picture: Measuring Vision-Language STEM Skills of Neural Models (57)

Choices: [Zoom in, Help, Like something, Link select, ]

Answer index: 3

Subject: Technology

Description: What meaning or function is usually associated with this web interface symbol?

Picture: Measuring Vision-Language STEM Skills of Neural Models (58)

Choices: [Apply, Options/Settings, Reload/Refresh, Download, ]

Answer index: 2

Subject: Technology

Description: What type of video game console do you see here?

Picture: Measuring Vision-Language STEM Skills of Neural Models (59)

Choices: [Mattel Intellivision, Sega Master System, Magnavox Odyssey 2, Atari 5200, ]

Answer index: 3

Subject: Technology

Description: What meaning or function is usually associated with this web interface symbol?

Picture: Measuring Vision-Language STEM Skills of Neural Models (60)

Choices: [Find, Delete, Attachment, Calendar/Event, ]

Answer index: 0

Subject: Engineer

Description: Select the gloves.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (61)Measuring Vision-Language STEM Skills of Neural Models (62)Measuring Vision-Language STEM Skills of Neural Models (63)Measuring Vision-Language STEM Skills of Neural Models (64)

Answer index: 0

Subject: Engineer

Description: iption: In this experiment, which were part of an experimental group?The passage below describes an experiment.Lucy and Erik were taking a snowboarding class. During the class, their instructor said they would go faster if they applied wax to the undersides of their snowboards.After the class, Lucy applied a thin layer of wax to the underside of a snowboard and rode the board straight down a hill. Then, she removed the wax and rode the snowboard straight down the hill again. Erik timed how long each ride took. Lucy repeated these rides on four other snowboards, alternating whether she first rode with or without wax.

Picture: Measuring Vision-Language STEM Skills of Neural Models (65)

Choices: [the snowboards with wax removed, the snowboards with wax added, ]

Answer index: 1

Subject: Engineer

Description: Select the test tube.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (66)Measuring Vision-Language STEM Skills of Neural Models (67)Measuring Vision-Language STEM Skills of Neural Models (68)Measuring Vision-Language STEM Skills of Neural Models (69)

Answer index: 2

Subject: Engineer

Description: Select the funnel.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (70)Measuring Vision-Language STEM Skills of Neural Models (71)Measuring Vision-Language STEM Skills of Neural Models (72)Measuring Vision-Language STEM Skills of Neural Models (73)

Answer index: 2

Subject: Engineer

Description: Select the round-bottom flask.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (74)Measuring Vision-Language STEM Skills of Neural Models (75)Measuring Vision-Language STEM Skills of Neural Models (76)Measuring Vision-Language STEM Skills of Neural Models (77)

Answer index: 1

Subject: Engineer

Description: iption: In this experiment, which were part of an experimental group?The passage below describes an experiment.Kimberly grew roses for a flower shop. One day, she noticed tumor-like growths on her rose stems. She could tell that the plants had crown gall disease, which is caused by a type of bacteria. She knew that allicin, a chemical in garlic, can kill bacteria. Kimberly wondered if spraying her plants with garlic juice would prevent more tumors from forming on her plants.Once a day, Kimberly sprayed garlic juice on ten infected plants and left another 10 infected plants unsprayed. After one month, she compared the number of new tumors on plants in the two groups.

Picture: Measuring Vision-Language STEM Skills of Neural Models (78)

Choices: [the roses sprayed with garlic juice, the roses that were not sprayed, ]

Answer index: 0

Subject: Engineer

Description: iption: Which of the following could Kendra’s test show?Wind turbines use wind power to produce electricity. Kendra was a materials engineer who designed wind turbines. She wanted to design a new turbine that would produce 10% more electricity than older wind turbines. She thought that a turbine made from lightweight material would turn more easily and produce more electricity. So, Kendra created a computer model of a turbine made from lightweight material. Then she used the model to calculate how much more electricity the new turbine could produce compared to the older turbines.The passage below describes how the engineering-design process was used to test a solution to a problem. Read the passage. Then answer the question below.

Picture: Measuring Vision-Language STEM Skills of Neural Models (79)

Choices: [how much the new turbine would weigh, whether the new turbine could produce 10% more electricity, if the new turbine could turn easily, ]

Answer index: 1

Subject: Engineer

Description: iption: In this experiment, which were part of an experimental group?The passage below describes an experiment.Isaac and his friend Belle flew nylon kites on the beach. They wondered if putting a tail on a kite would affect how well the kite flew.Isaac flew a kite that did not have a tail for five minutes. Then, he attached a four-foot-long tail and flew the kite for five more minutes. Isaac repeated this with three similar kites, alternating whether he started the kite with or without a tail. During each flight, Belle counted the number of times the kite crashed to the ground.

Picture: Measuring Vision-Language STEM Skills of Neural Models (80)

Choices: [the kites without tails, the kites with tails, ]

Answer index: 1

Subject: Engineer

Description: iption: Identify the question that Bryant and Lamar’s experiment can best answer.The passage below describes an experiment. Read the passage and then follow the instructions below.Bryant placed a ping pong ball in a catapult, pulled the catapult’s arm back to a 45° angle, and launched the ball. Then, Bryant launched another ping pong ball, this time pulling the catapult’s arm back to a 30° angle. With each launch, his friend Lamar measured the distance between the catapult and the place where the ball hit the ground. Bryant and Lamar repeated the launches with ping pong balls in four more identical catapults. They compared the distances the balls traveled when launched from a 45° angle to the distances the balls traveled when launched from a 30° angle.

Picture: Measuring Vision-Language STEM Skills of Neural Models (81)

Choices: [Do ping pong balls stop rolling along the ground sooner after being launched from a 30° angle or a 45° angle?, Do ping pong balls travel farther when launched from a 30° angle compared to a 45° angle?, ]

Answer index: 1

Subject: Engineer

Description: Select the Erlenmeyer flask.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (82)Measuring Vision-Language STEM Skills of Neural Models (83)Measuring Vision-Language STEM Skills of Neural Models (84)Measuring Vision-Language STEM Skills of Neural Models (85)

Answer index: 3

Subject: Engineer

Description: iption: Which of the following could Ivan’s test show?Ivan was a landscape architect who was hired to design a new city park. The city council wanted the park to have space for outdoor concerts and to have at least 20% of the park shaded by trees. Ivan thought the concert area should be at least 150 meters from the road so traffic noise didn’t interrupt the music. He developed three possible designs for the park with the concert area in a different location in each design. Then, he tested each design by measuring the distance between the road and the concert area.The passage below describes how the engineering-design process was used to test a solution to a problem. Read the passage. Then answer the question below.

Picture: Measuring Vision-Language STEM Skills of Neural Models (86)

Choices: [if at least 20% of the park would be shaded by trees in each design, which design would have the greatest distance between the concert area and the road, which design would have the least traffic noise in the concert area, ]

Answer index: 1

Subject: Engineer

Description: Select the beaker.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (87)Measuring Vision-Language STEM Skills of Neural Models (88)Measuring Vision-Language STEM Skills of Neural Models (89)Measuring Vision-Language STEM Skills of Neural Models (90)

Answer index: 3

Subject: Engineer

Description: iption: Identify the question that Zeke’s experiment can best answer.The passage below describes an experiment. Read the passage and then follow the instructions below.Zeke divided 40 unripe bananas evenly among eight paper bags and sealed the bags. He poked 20 small holes in four of the bags and left the other four without holes. He kept the bags at room temperature for three days. Then, Zeke opened the bags and counted the number of brown spots on each banana. He compared the average number of brown spots on bananas from bags with holes to the average number of brown spots on bananas from bags without holes.

Picture: Measuring Vision-Language STEM Skills of Neural Models (91)

Choices: [Do bananas develop more brown spots if they are kept in bags with holes compared to bags without holes?, Do bananas develop more brown spots when they are kept at room temperature compared to in a cold refrigerator?, ]

Answer index: 0

Subject: Engineer

Description: iption: Hint: An independent variable is a variable whose effect you are investigating. A dependent variable is a variable that you measure.Which of the following was an independent variable in this experiment?The passage below describes an experiment. Read the passage and think about the variables that are described.Tyler designed an electric circuit to test how well different types of metal conduct electricity. The circuit included a battery, a light bulb, wires, and clips that could be attached to a sheet of metal. If the metal conducted electricity poorly, the light bulb would appear dim. If the metal conducted electricity well, the light bulb would appear bright.Tyler collected nine equally sized sheets of metal: three sheets of copper, three sheets of iron, and three sheets of aluminum. He used the clips to attach each metal sheet, one sheet at a time, to the circuit. For each sheet, Tyler used a light meter to measure how much light the bulb produced.

Picture: Measuring Vision-Language STEM Skills of Neural Models (92)

Choices: [the amount of light produced by the light bulb, the type of metal sheet used in the circuit, ]

Answer index: 1

Subject: Engineer

Description: iption: Identify the question that Devon’s experiment can best answer.The passage below describes an experiment. Read the passage and then follow the instructions below.Devon poured four ounces of water into each of six glasses. Devon dissolved one tablespoon of salt in each of three glasses, and did not add salt to the other three. Then, Devon placed an egg in one glass and observed if the egg floated. She removed the egg and dried it. She repeated the process with the other five glasses, recording each time if the egg floated. Devon repeated this test with two more eggs and counted the number of times the eggs floated in fresh water compared to salty water.

Picture: Measuring Vision-Language STEM Skills of Neural Models (93)

Choices: [Does the amount of water in a glass affect whether eggs sink or float in the water?, Are eggs more likely to float in fresh water or salty water?, ]

Answer index: 1

Subject: Engineer

Description: iption: Which of the following could Luke’s test show?Luke had a cookie recipe that made soft, thick cookies. But he preferred crunchy cookies. Luke read that using different types of sugar affects how firm the cookies are. His recipe used both white and brown sugar, so he decided to see if the cookies would be crunchy if he didn’t use any brown sugar.Luke baked a batch of cookies using his recipe, but he left out the brown sugar and doubled the amount of white sugar. He baked the cookies for the same amount of time as in his original recipe. After the cookies finished baking and cooling, he tried one to find out how firm it was.The passage below describes how the engineering-design process was used to test a solution to a problem. Read the passage. Then answer the question below.

Picture: Measuring Vision-Language STEM Skills of Neural Models (94)

Choices: [if cookies made with only white sugar were soft, if baking cookies for longer made them more crunchy, if cookies made with double the amount of brown sugar were crunchy, ]

Answer index: 0

Subject: Engineer

Description: iption: Identify the question that Myra’s experiment can best answer.The passage below describes an experiment. Read the passage and then follow the instructions below.Myra glued lids onto 16 cardboard shoe boxes of equal size. She painted eight of the boxes black and eight of the boxes white. Myra made a small hole in the side of each box and then stuck a thermometer partially into each hole so she could measure the temperatures inside the boxes. She placed the boxes in direct sunlight in her backyard. Two hours later, she measured the temperature inside each box. Myra compared the average temperature inside the black boxes to the average temperature inside the white boxes.

Picture: Measuring Vision-Language STEM Skills of Neural Models (95)

Choices: [Do the temperatures inside boxes depend on the sizes of the boxes?, Do the insides of white boxes get hotter than the insides of black boxes when the boxes are left in the sun?, ]

Answer index: 1

Subject: Engineer

Description: iption: Which of the following could Zoe and Evelyn’s test show?Zoe and Evelyn were making batches of concrete for a construction project. To make the concrete, they mixed together dry cement powder, gravel, and water. Then, they checked if each batch was firm enough using a test called a slump test.They poured some of the fresh concrete into an upside-down metal cone. They left the concrete in the metal cone for 30 seconds. Then, they lifted the cone to see if the concrete stayed in a cone shape or if it collapsed. If the concrete in a batch collapsed, they would know the batch should not be used.The passage below describes how the engineering-design process was used to test a solution to a problem. Read the passage. Then answer the question below.

Picture: Measuring Vision-Language STEM Skills of Neural Models (96)

Choices: [if the concrete from each batch took the same amount of time to dry, if a new batch of concrete was firm enough to use, ]

Answer index: 1

Subject: Engineer

Description: iption: Identify the question that Belle’s experiment can best answer.The passage below describes an experiment. Read the passage and then follow the instructions below.Belle planted 25 tomato seeds one-half inch below the soil surface in each of six pots. Belle added an equal amount of fertilizer to three of the six pots. She placed the pots in a plant growth chamber where all the seeds experienced the same temperature, amount of light, and humidity level. After two weeks, Belle counted the number of seedlings that grew in each pot. She compared the number of seedlings in the pots with fertilizer to the number of seedlings in the pots without fertilizer.

Picture: Measuring Vision-Language STEM Skills of Neural Models (97)

Choices: [Do more tomato seedlings grow when they are planted in soil with fertilizer compared to soil without fertilizer?, Does the humidity level where tomato seeds are planted affect the number of tomato seedlings that grow?, ]

Answer index: 0

Subject: Engineer

Description: iption: In this experiment, which were part of a control group?The passage below describes an experiment.After a severe winter storm, Sandeep’s driveway was covered with ice. He read that salt makes ice melt at a lower temperature. Before covering his entire driveway with salt, he wanted to know if adding salt could actually help melt ice in the freezing outdoor temperatures.Sandeep weighed twenty ice cubes. He sprinkled salt on half of the ice cubes and left the other half unsalted. He placed all the ice cubes outside. One hour later, Sandeep quickly dried each ice cube and reweighed it to see how much it had melted.

Picture: Measuring Vision-Language STEM Skills of Neural Models (98)

Choices: [the salted ice cubes, the unsalted ice cubes, ]

Answer index: 1

Subject: Science

Description: Select the gray mineral.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (99)Measuring Vision-Language STEM Skills of Neural Models (100)

Answer index: 1

Subject: Science

Description: Select the one substance that is not a mineral.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (101)Measuring Vision-Language STEM Skills of Neural Models (102)Measuring Vision-Language STEM Skills of Neural Models (103)

Answer index: 0

Subject: Science

Description: iption: This organism is a spot-fin porcupinefish. Its scientific name is Diodon hystrix.Select the organism in the same genus as the spot-fin porcupinefish.

Picture: Measuring Vision-Language STEM Skills of Neural Models (104)

Choices: Measuring Vision-Language STEM Skills of Neural Models (105)Measuring Vision-Language STEM Skills of Neural Models (106)Measuring Vision-Language STEM Skills of Neural Models (107)

Answer index: 1

Subject: Science

Description: Select the liquid.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (108)Measuring Vision-Language STEM Skills of Neural Models (109)Measuring Vision-Language STEM Skills of Neural Models (110)Measuring Vision-Language STEM Skills of Neural Models (111)

Answer index: 3

Subject: Science

Description: iption: Fish are a group of animals with similar traits. The following traits can be used to identify fish:They have fins, not limbs.They make eggs with no shells.Observe the animals and read the descriptions. Select the one animal that has all of the fish traits listed above.Brown pelicans live along the west coast of North America. They dive underwater to catch fish in their beaks. Brown pelicans keep their eggs warm by standing on the shells with their large, webbed feet.Salmon lay eggs with no shells at the bottom of freshwater streams. Salmon use their powerful fins to swim. They can even jump up small waterfalls!

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (112)Measuring Vision-Language STEM Skills of Neural Models (113)

Answer index: 0

Subject: Science

Description: iption: Two identical blocks are heated to different temperatures. The blocks are placed so that they touch each other. Heat can flow from one block to another but cannot escape from the blocks.Later, the temperature of each block is measured again. Which pair of temperatures is possible?

Picture: Measuring Vision-Language STEM Skills of Neural Models (114)

Choices: Measuring Vision-Language STEM Skills of Neural Models (115)Measuring Vision-Language STEM Skills of Neural Models (116)

Answer index: 0

Subject: Science

Description: iption: Two solid blocks are heated to the temperatures shown. The blocks are placed so they touch.Which diagram shows the direction heat will flow?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (117)Measuring Vision-Language STEM Skills of Neural Models (118)Measuring Vision-Language STEM Skills of Neural Models (119)

Answer index: 0

Subject: Science

Description: Select the plant.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (120)Measuring Vision-Language STEM Skills of Neural Models (121)

Answer index: 0

Subject: Science

Description: iption: Use the data to answer the question below.Is the following statement about our solar system true or false?Of the four smallest planets, two are made mainly of gas.

Picture: Measuring Vision-Language STEM Skills of Neural Models (122)

Choices: [false, true, ]

Answer index: 0

Subject: Science

Description: iption: Think about the magnetic force between the magnets in each pair. Which of the following statements is true?The images below show two pairs of magnets. The magnets in different pairs do not affect each other. All the magnets shown are made of the same material, but some of them are different sizes.

Picture: Measuring Vision-Language STEM Skills of Neural Models (123)

Choices: [The magnitude of the magnetic force is smaller in Pair 2., The magnitude of the magnetic force is the same in both pairs., The magnitude of the magnetic force is smaller in Pair 1., ]

Answer index: 0

Subject: Science

Description: iption: Think about the magnetic force between the magnets in each pair. Which of the following statements is true?The images below show two pairs of magnets. The magnets in different pairs do not affect each other. All the magnets shown are made of the same material.

Picture: Measuring Vision-Language STEM Skills of Neural Models (124)

Choices: [The magnetic force is stronger in Pair 2., The magnetic force is stronger in Pair 1., The strength of the magnetic force is the same in both pairs., ]

Answer index: 0

Subject: Science

Description: Select the gas.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (125)Measuring Vision-Language STEM Skills of Neural Models (126)Measuring Vision-Language STEM Skills of Neural Models (127)Measuring Vision-Language STEM Skills of Neural Models (128)

Answer index: 2

Subject: Science

Description: Select the plant.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (129)Measuring Vision-Language STEM Skills of Neural Models (130)Measuring Vision-Language STEM Skills of Neural Models (131)Measuring Vision-Language STEM Skills of Neural Models (132)

Answer index: 0

Subject: Science

Description: iption: Which property matches this object?Select the better answer.

Picture: Measuring Vision-Language STEM Skills of Neural Models (133)

Choices: [soft, smooth, ]

Answer index: 1

Subject: Science

Description: iption: Select the animal that does not have a backbone.Hint: Insects, spiders, and worms do not have backbones.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (134)Measuring Vision-Language STEM Skills of Neural Models (135)

Answer index: 0

Subject: Science

Description: iption: The diagram below is a model of two solutions. Each green ball represents one particle of solute.Which solution has a higher concentration of green particles?

Picture: Measuring Vision-Language STEM Skills of Neural Models (136)

Choices: [neither; their concentrations are the same, Solution A, Solution B, ]

Answer index: 2

Subject: Science

Description: iption: Two solid blocks are at different temperatures. The blocks are touching.Which picture shows how heat will move?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (137)Measuring Vision-Language STEM Skills of Neural Models (138)

Answer index: 1

Subject: Science

Description: Select the chemical formula for this molecule.

Picture: Measuring Vision-Language STEM Skills of Neural Models (139)

Choices: [H2C, HCl, HC, HCl2, ]

Answer index: 1

Subject: Science

Description: iption: Which statement best describes the climate of Bangor?Hint: Summers in the Northern Hemisphere occur in June, July, and August. Winters in the Northern Hemisphere occur in December, January, and February.Bangor, Maine, is a city in the United States. It has a warm summer continental climate.

Picture: Measuring Vision-Language STEM Skills of Neural Models (140)

Choices: [Summers have higher temperatures and slightly more precipitation than winters., On average, On average, ]

Answer index: 1

Subject: Science

Description: Select the temperature shown by this thermometer.

Picture: Measuring Vision-Language STEM Skills of Neural Models (141)

Choices: [13°F, 61°F, 56°F, ]

Answer index: 2

Subject: Math

Description: iption: This table shows Jason’s January budget.What could Jason do to balance his budget?

Picture: Measuring Vision-Language STEM Skills of Neural Models (142)

Choices: [increase income from shoveling snow to 40,spend40𝑠𝑝𝑒𝑛𝑑40,spend40 , italic_s italic_p italic_e italic_n italic_d15 less at Pizza Palace, spend only $40 at the arcade, spend $20 more on video games, ]

Answer index: 2

Subject: Math

Description: In solving this triangle, which law must you use first?

Picture: Measuring Vision-Language STEM Skills of Neural Models (143)

Choices: [Law of Cosines, Law of Sines, ]

Answer index: 1

Subject: Math

Description: Is this angle acute, right, obtuse, or straight?

Picture: Measuring Vision-Language STEM Skills of Neural Models (144)

Choices: [straight, obtuse, acute, right, ]

Answer index: 1

Subject: Math

Description: iption: Look at this cube:If the side lengths are tripled, then which of the following statements about its volume will be true?

Picture: Measuring Vision-Language STEM Skills of Neural Models (145)

Choices: [The ratio of the new volume to the old volume will be 81:1., The ratio of the new volume to the old volume will be 1:8., The ratio of the new volume to the old volume will be 3:1., The ratio of the new volume to the old volume will be 27:1., ]

Answer index: 3

Subject: Math

Description: Which shape is a cone?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (146)Measuring Vision-Language STEM Skills of Neural Models (147)Measuring Vision-Language STEM Skills of Neural Models (148)

Answer index: 2

Subject: Math

Description: Is the function f(x) continuous on the open interval (3,7)?

Picture: Measuring Vision-Language STEM Skills of Neural Models (149)

Choices: [no, yes, ]

Answer index: 1

Subject: Math

Description: iption: Use the diagram to help you answer the question below.Which of the following is a rational number but not an integer?

Picture: Measuring Vision-Language STEM Skills of Neural Models (150)

Choices: [–123, 83, 194, 6.53, ]

Answer index: 3

Subject: Math

Description: Is this polygon a trapezoid?

Picture: Measuring Vision-Language STEM Skills of Neural Models (151)

Choices: [no, yes, ]

Answer index: 1

Subject: Math

Description: Look at the colored part of each shape. Which shape shows one-third?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (152)Measuring Vision-Language STEM Skills of Neural Models (153)Measuring Vision-Language STEM Skills of Neural Models (154)Measuring Vision-Language STEM Skills of Neural Models (155)

Answer index: 0

Subject: Math

Description: Which shape has 5 equal sides?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (156)Measuring Vision-Language STEM Skills of Neural Models (157)Measuring Vision-Language STEM Skills of Neural Models (158)

Answer index: 1

Subject: Math

Description: iption: Identify the cross section of this object.Assume objects are perpendicular if they appear so.

Picture: Measuring Vision-Language STEM Skills of Neural Models (159)

Choices: Measuring Vision-Language STEM Skills of Neural Models (160)Measuring Vision-Language STEM Skills of Neural Models (161)Measuring Vision-Language STEM Skills of Neural Models (162)Measuring Vision-Language STEM Skills of Neural Models (163)

Answer index: 0

Subject: Math

Description: Are there more circles or triangles?

Picture: Measuring Vision-Language STEM Skills of Neural Models (164)

Choices: [ circles, triangles, ]

Answer index: 0

Subject: Math

Description: iption: Look at this shape:Which image shows a reflection?

Picture: Measuring Vision-Language STEM Skills of Neural Models (165)

Choices: [C, A, B, ]

Answer index: 2

Subject: Math

Description: Is the function f(x) continuous?

Picture: Measuring Vision-Language STEM Skills of Neural Models (166)

Choices: [no, yes, ]

Answer index: 0

Subject: Math

Description: An ice cream sundae costs 1 dollar and 41 cents. Do you have enough money to buy it?

Picture: Measuring Vision-Language STEM Skills of Neural Models (167)

Choices: [yes, no, ]

Answer index: 0

Subject: Math

Description: Which shape has a triangle as a face?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (168)Measuring Vision-Language STEM Skills of Neural Models (169)

Answer index: 1

Subject: Math

Description: iption: Look at this figure:What is the shape of its bases?

Picture: Measuring Vision-Language STEM Skills of Neural Models (170)

Choices: [decagon, octagon, rectangle, circle, ]

Answer index: 2

Subject: Math

Description: The graph below shows a function. Is its inverse also a function?

Picture: Measuring Vision-Language STEM Skills of Neural Models (171)

Choices: [yes, no, ]

Answer index: 0

Subject: Math

Description: What is the range of this exponential function?

Picture: Measuring Vision-Language STEM Skills of Neural Models (172)

Choices: Measuring Vision-Language STEM Skills of Neural Models (173)Measuring Vision-Language STEM Skills of Neural Models (174)Measuring Vision-Language STEM Skills of Neural Models (175)Measuring Vision-Language STEM Skills of Neural Models (176)Measuring Vision-Language STEM Skills of Neural Models (177)

Answer index: 0

Subject: Math

Description: iption: Look at this graph:Is this relation a function?

Picture: Measuring Vision-Language STEM Skills of Neural Models (178)

Choices: [no, yes, ]

Answer index: 1

SubjectGradeSkills
Science grade-2classify-fruits-and-vegetables-as-plant-parts, classify-matter-as-solid-liquid-or-gas, classify-matter-as-solid-or-liquid, classify-rocks-and-minerals-by-color-and-shape, compare-properties-of-materials, compare-properties-of-objects, compare-temperatures-on-thermometers, find-evidence-of-changes-to-earths-surface, identify-animals-with-and-without-backbones, identify-earth-s-land-features, identify-living-and-nonliving-things, identify-magnets-that-attract-or-repel, identify-mammals-birds-fish-reptiles-and-amphibians, identify-materials-in-objects, identify-plants-and-animals, identify-properties-of-an-object, identify-pushes-and-pulls, identify-solids-and-liquids, identify-solids-liquids-and-gases, identifying-mixtures, natural-resources, predict-heat-flow, read-a-thermometer
grade-3animal-adaptations-beaks-mouths-and-necks, animal-adaptations-feet-and-limbs, animal-adaptations-skins-and-body-coverings, classify-fruits-and-vegetables-as-plant-parts, classify-matter-as-solid-liquid-or-gas, classify-rocks-and-minerals-by-color-shape-and-texture, classify-rocks-as-igneous-sedimentary-or-metamorphic, compare-ancient-and-modern-organisms-use-observations-to-support-a-hypothesis, compare-properties-of-materials, compare-properties-of-objects, compare-strengths-of-magnetic-forces, compare-temperatures-on-thermometers, find-evidence-of-changes-to-earths-surface, how-do-balanced-and-unbalanced-forces-affect-motion, identify-earth-s-land-features, identify-ecosystems, identify-living-and-nonliving-things, identify-magnets-that-attract-or-repel, identify-mammals-birds-fish-reptiles-and-amphibians, identify-materials-in-objects, identify-minerals-using-properties, identify-plants-and-animals, identify-properties-of-an-object, identify-pushes-and-pulls, identify-rocks-using-properties, identify-roles-in-food-chains, identify-solids-liquids-and-gases, identify-vertebrates-and-invertebrates, interpret-food-webs, natural-resources, predict-heat-flow, predict-temperature-changes, read-a-thermometer, use-climate-data-to-make-predictions, use-data-to-describe-u-s-climates, use-data-to-describe-world-climates, weather-and-climate-around-the-world
grade-4animal-adaptations-beaks-mouths-and-necks, animal-adaptations-feet-and-limbs, animal-adaptations-skins-and-body-coverings, classify-fruits-and-vegetables-as-plant-parts, classify-rocks-as-igneous-sedimentary-or-metamorphic, compare-amplitudes-and-wavelengths-of-waves, compare-ancient-and-modern-organisms-use-observations-to-support-a-hypothesis, compare-properties-of-materials, compare-properties-of-objects, compare-strengths-of-magnetic-forces, compare-temperatures-on-thermometers, describe-classify-and-compare-kingdoms, evaluate-natural-energy-sources, how-do-balanced-and-unbalanced-forces-affect-motion, identify-and-classify-fossils, identify-and-sort-solids-liquids-and-gases, identify-common-and-scientific-names, identify-directions-of-forces, identify-earths-land-features-using-photographs, identify-earths-land-features-using-satellite-images, identify-ecosystems, identify-living-and-nonliving-things, identify-magnets-that-attract-or-repel, identify-mammals-birds-fish-reptiles-and-amphibians, identify-minerals-using-properties, identify-phases-of-the-moon, identify-rocks-using-properties, identify-roles-in-food-chains, identify-vertebrates-and-invertebrates, interpret-food-webs, origins-of-scientific-names, predict-heat-flow, predict-temperature-changes, read-a-thermometer, use-climate-data-to-make-predictions, use-data-to-describe-climates, use-evidence-to-classify-animals, use-evidence-to-classify-mammals-birds-fish-reptiles-and-amphibians, use-scientific-names-to-classify-organisms, weather-and-climate-around-the-world
grade-5animal-adaptations-beaks-mouths-and-necks, animal-adaptations-feet-and-limbs, animal-adaptations-skins-and-body-coverings, classify-elementary-substances-and-compounds-using-models, classify-fruits-and-vegetables-as-plant-parts, classify-rocks-as-igneous-sedimentary-or-metamorphic, compare-amplitudes-and-wavelengths-of-waves, compare-ancient-and-modern-organisms-use-observations-to-support-a-hypothesis, compare-magnitudes-of-magnetic-forces, compare-properties-of-objects, describe-classify-and-compare-kingdoms, evaluate-natural-energy-sources, flowering-plant-and-conifer-life-cycles, how-do-balanced-and-unbalanced-forces-affect-motion, identify-and-classify-fossils, identify-common-and-scientific-names, identify-directions-of-forces, identify-earths-land-features-using-photographs, identify-earths-land-features-using-satellite-images, identify-ecosystems, identify-magnets-that-attract-or-repel, identify-mammals-birds-fish-reptiles-and-amphibians, identify-phases-of-the-moon, identify-rocks-and-minerals, identify-roles-in-food-chains, identify-the-photosynthetic-organism, identify-vertebrates-and-invertebrates, match-chemical-formulas-to-ball-and-stick-models, moss-and-fern-life-cycles, origins-of-scientific-names, predict-heat-flow, predict-temperature-changes, use-data-to-describe-climates, use-evidence-to-classify-animals, use-evidence-to-classify-mammals-birds-fish-reptiles-and-amphibians, use-scientific-names-to-classify-organisms, weather-and-climate-around-the-world
grade-6analyze-data-to-compare-properties-of-planets, classify-elementary-substances-and-compounds-using-models, classify-rocks-as-igneous-sedimentary-or-metamorphic, classify-symbiotic-relationships, compare-ages-of-fossils-in-a-rock-sequence, compare-amplitudes-wavelengths-and-frequencies-of-waves, compare-concentrations-of-solutions, compare-magnitudes-of-magnetic-forces, compare-thermal-energy-transfers, describe-populations-communities-and-ecosystems, describe-tectonic-plate-boundaries-around-the-world, describe-the-effects-of-gene-mutations-on-organisms, diffusion-across-membranes, flowering-plant-and-conifer-life-cycles, identify-and-compare-air-masses, identify-common-and-scientific-names, identify-earths-land-features-using-photographs, identify-earths-land-features-using-satellite-images, identify-ecosystems, identify-elementary-substances-and-compounds-using-models, identify-how-particle-motion-affects-temperature-and-pressure, identify-phases-of-the-moon, identify-rocks-and-minerals, identify-the-photosynthetic-organism, match-chemical-formulas-to-ball-and-stick-models, moss-and-fern-life-cycles, origins-of-scientific-names, predict-heat-flow-and-temperature-changes, use-data-to-describe-climates, use-scientific-names-to-classify-organisms, weather-and-climate-around-the-world
grade-7analyze-data-to-compare-properties-of-planets, angiosperm-and-conifer-life-cycles, classify-elementary-substances-and-compounds-using-models, classify-rocks-as-igneous-sedimentary-or-metamorphic, classify-symbiotic-relationships, compare-ages-of-fossils-in-a-rock-sequence, compare-amplitudes-wavelengths-and-frequencies-of-waves, compare-concentrations-of-solutions, compare-magnitudes-of-magnetic-forces, compare-thermal-energy-transfers, describe-populations-communities-and-ecosystems, describe-tectonic-plate-boundaries-around-the-world, describe-the-effects-of-gene-mutations-on-organisms, diffusion-across-membranes, identify-and-compare-air-masses, identify-chemical-formulas-for-ball-and-stick-models, identify-common-and-scientific-names, identify-ecosystems, identify-how-particle-motion-affects-temperature-and-pressure, identify-phases-of-the-moon, identify-rocks-and-minerals, identify-the-photosynthetic-organism, moss-and-fern-life-cycles, origins-of-scientific-names, predict-heat-flow-and-temperature-changes, use-data-to-describe-climates, use-scientific-names-to-classify-organisms
grade-8analyze-data-to-compare-properties-of-planets, angiosperm-and-conifer-life-cycles, classify-elementary-substances-and-compounds-using-models, classify-symbiotic-relationships, compare-ages-of-fossils-in-a-rock-sequence, compare-amplitudes-wavelengths-and-frequencies-of-waves, compare-concentrations-of-solutions, compare-magnitudes-of-magnetic-forces, compare-thermal-energy-transfers, describe-populations-communities-and-ecosystems, describe-tectonic-plate-boundaries-around-the-world, describe-the-effects-of-gene-mutations-on-organisms, diffusion-across-membranes, identify-and-compare-air-masses, identify-chemical-formulas-for-ball-and-stick-models, identify-common-and-scientific-names, identify-ecosystems, identify-how-particle-motion-affects-temperature-and-pressure, identify-phases-of-the-moon, identify-rocks-and-minerals, identify-the-photosynthetic-organism, moss-and-fern-life-cycles, origins-of-scientific-names, predict-heat-flow-and-temperature-changes, use-data-to-describe-climates, use-punnett-squares-to-calculate-probabilities-of-offspring-types, use-punnett-squares-to-calculate-ratios-of-offspring-types, use-scientific-names-to-classify-organisms
Technology -cables, font, icons, logo, parts, peripherals, photo, web, others
Engineering grade-5identify-laboratory-tools
grade-6evaluate-tests-of-engineering-design-solutions, identify-control-and-experimental-groups, identify-independent-and-dependent-variables, identify-laboratory-tools, identify-the-experimental-question, laboratory-safety-equipment
grade-7evaluate-tests-of-engineering-design-solutions, identify-control-and-experimental-groups, identify-independent-and-dependent-variables, identify-laboratory-tools, identify-the-experimental-question, laboratory-safety-equipment
grade-8identify-control-and-experimental-groups, identify-laboratory-tools, identify-the-experimental-question, laboratory-safety-equipment

SubjectGradeSkills
Math algebra-1compare-linear-functions-graphs-and-equations, compare-linear-functions-tables-graphs-and-equations, describe-linear-and-exponential-growth-and-decay, domain-and-range-of-absolute-value-functions-graphs, domain-and-range-of-exponential-functions-graphs, domain-and-range-of-square-root-functions-graphs, factor-quadratics-using-algebra-tiles, identify-direct-variation-and-inverse-variation, identify-functions, identify-functions-vertical-line-test, identify-linear-and-exponential-functions-from-graphs, identify-linear-and-exponential-functions-from-tables, identify-linear-functions-from-graphs-and-equations, identify-linear-functions-from-tables, identify-linear-quadratic-and-exponential-functions-from-graphs, identify-linear-quadratic-and-exponential-functions-from-tables, identify-proportional-relationships, interpret-a-scatter-plot, interpret-the-slope-and-y-intercept-of-a-linear-function, linear-functions-over-unit-intervals, match-exponential-functions-and-graphs-ii, model-and-solve-linear-equations-using-algebra-tiles, multiply-two-binomials-using-algebra-tiles, perimeter-and-area-changes-in-scale, perimeter-area-and-volume-changes-in-scale, special-right-triangles, surface-area-and-volume-changes-in-scale, write-compound-inequalities-from-graphs
algebra-2classify-variation, describe-linear-and-exponential-growth-and-decay, domain-and-range-of-absolute-value-functions-graphs, domain-and-range-of-exponential-and-logarithmic-functions, domain-and-range-of-radical-functions, factor-quadratics-using-algebra-tiles, find-inverse-functions-and-relations, find-solutions-using-a-table, graphs-of-angles, identify-the-direction-a-parabola-opens, linear-functions-over-unit-intervals, match-exponential-functions-and-graphs, outliers-in-scatter-plots, solve-a-triangle
calculusdescribe-linear-and-exponential-growth-and-decay, determine-continuity-on-an-interval-using-graphs, determine-continuity-using-graphs, determine-one-sided-continuity-using-graphs, domain-and-range, domain-and-range-of-exponential-and-logarithmic-functions, find-inverse-functions-and-relations, find-limits-at-vertical-asymptotes-using-graphs, identify-functions, identify-graphs-of-continuous-functions

SubjectGradeSkills
Math grade-1addition-sentences-up-to-10-what-does-the-model-show, addition-sentences-up-to-10-which-model-matches, addition-sentences-using-number-lines-sums-up-to-20, am-or-pm, certain-probable-unlikely-and-impossible, compare-clocks, compare-money-amounts, compare-objects-length-and-height, compare-sides-and-corners, compare-size-weight-and-capacity, compare-vertices-edges-and-faces, comparing-review, count-sides-and-corners, count-to-fill-a-ten-frame, cubes-and-rectangular-prisms, equal-sides, estimate-to-the-nearest-ten, even-or-odd, find-the-next-shape-in-a-growing-pattern, find-the-next-shape-in-a-pattern, flip-turn-and-slide, holds-more-or-less, identify-faces-of-three-dimensional-shapes, identify-fourths, identify-halves, identify-halves-and-fourths, identify-halves-thirds-and-fourths, identify-shapes-traced-from-solids, identify-thirds, interpret-bar-graphs-ii, light-and-heavy, match-analog-and-digital-clocks, match-analog-clocks-and-times, match-digital-clocks-and-times, more-less-and-equally-likely, name-the-three-dimensional-shape, name-the-two-dimensional-shape, names-and-values-of-all-coins, names-and-values-of-common-coins, open-and-closed-shapes, ordinal-numbers, purchases-do-you-have-enough-money, read-a-calendar, read-a-calendar-ii, rhombuses, select-three-dimensional-shapes, select-two-dimensional-shapes, shapes-of-everyday-objects, simple-fractions-what-fraction-does-the-shape-show, square-corners, subtraction-sentences-up-to-10-which-model-matches, subtraction-sentences-using-number-lines-up-to-10, subtraction-sentences-using-number-lines-up-to-20, symmetry, time-and-clocks-word-problems, times-of-everyday-events, two-dimensional-and-three-dimensional-shapes, which-bar-graph-is-correct, which-picture-graph-is-correct, which-table-is-correct, which-tally-chart-is-correct, wide-and-narrow
grade-2am-or-pm, certain-probable-unlikely-and-impossible, choose-the-appropriate-measuring-tool, compare-clocks, compare-sides-and-vertices, compare-vertices-edges-and-faces, correct-amount-of-change, cubes, equal-sides, equivalent-amounts-of-money-up-to-1-dollar, estimate-to-the-nearest-ten, even-or-odd, find-the-next-shape-in-a-growing-pattern, find-the-next-shape-in-a-repeating-pattern, flip-turn-and-slide, fractions-of-a-group, fractions-of-a-whole-modeling-word-problems, greatest-and-least-word-problems-up-to-100, greatest-and-least-word-problems-up-to-1000, how-much-more-to-make-a-dollar, identify-faces-of-three-dimensional-shapes, identify-fourths, identify-halves, identify-halves-thirds-and-fourths, identify-lines-of-symmetry, identify-multiplication-sentences-for-equal-groups, identify-repeated-addition-in-arrays-sums-to-10, identify-repeated-addition-in-arrays-sums-to-25, identify-shapes-traced-from-solids, identify-the-fraction, identify-thirds, interpret-bar-graphs-ii, interpret-tally-charts, match-addition-sentences-and-models-sums-to-10, match-analog-and-digital-clocks, match-analog-clocks-and-times, match-digital-clocks-and-times, more-less-and-equally-likely, name-the-three-dimensional-shape, name-the-two-dimensional-shape, names-and-values-of-all-coins, names-and-values-of-common-coins, ordinal-numbers-up-to-10th, place-value-models-up-to-hundreds, place-value-tens-and-ones, place-value-up-to-hundreds, place-value-up-to-thousands, purchases-do-you-have-enough-money-up-to-1-dollar, purchases-do-you-have-enough-money-up-to-5-dollars, read-a-calendar, read-a-calendar-ii, read-a-thermometer, select-figures-with-a-given-area, select-three-dimensional-shapes, shapes-of-everyday-objects, skip-counting-stories, symmetry, which-bar-graph-is-correct, which-picture-shows-more-up-to-5-dollars, which-shape-illustrates-the-fraction, which-table-is-correct, which-tally-chart-is-correct, write-subtraction-sentences-to-describe-pictures-up-to-18, write-subtraction-sentences-to-describe-pictures-up-to-two-digits
grade-3acute-obtuse-and-right-triangles, am-or-pm, angles-greater-than-less-than-or-equal-to-a-right-angle, certain-probable-unlikely-and-impossible, choose-the-appropriate-measuring-tool, compare-area-and-perimeter-of-two-figures, compare-fractions-in-recipes, compare-fractions-using-models, compare-fractions-using-number-lines, coordinate-planes-as-maps, correct-amount-of-change, division-input-output-tables-find-the-rule, find-the-next-shape-in-a-pattern, fractions-of-a-group-denominators-2-3-4-6-8, fractions-of-a-group-unit-fractions, identify-equivalent-fractions-on-number-lines, identify-faces-of-three-dimensional-shapes, identify-multiplication-expressions-for-arrays, identify-multiplication-expressions-for-equal-groups, identify-parallelograms, identify-rhombuses, identify-three-dimensional-shapes, identify-trapezoids, identify-two-dimensional-shapes, identify-unit-fractions-on-number-lines, interpret-line-graphs, is-it-a-polygon, lines-line-segments-and-rays, match-analog-and-digital-clocks, match-clocks-and-times, match-fractions-to-models-halves-thirds-and-fourths, match-mixed-numbers-to-models, multiplication-input-output-tables-find-the-rule, open-and-closed-shapes, parallel-perpendicular-and-intersecting-lines, parallel-sides-in-quadrilaterals, purchases-do-you-have-enough-money-up-to-10-dollars, read-a-calendar, read-a-thermometer, reading-schedules, reflection-rotation-and-translation, scalene-isosceles-and-equilateral-triangles, select-figures-with-a-given-area, select-fractions-equivalent-to-whole-numbers-using-models, shapes-of-everyday-objects, symmetry, which-picture-shows-more
grade-4acute-obtuse-and-right-triangles, acute-right-obtuse-and-straight-angles, angles-as-fractions-of-a-circle, angles-of-90-180-270-and-360-degrees, classify-triangles, compare-area-and-perimeter-of-two-figures, compare-decimals-using-models, compare-fractions-in-recipes, compare-fractions-using-models, compare-fractions-with-like-numerators-or-denominators-using-models, decompose-fractions-into-unit-fractions-using-models, elapsed-time, estimate-angle-measurements, find-the-next-shape-in-a-pattern, fractions-of-a-whole-word-problems, identify-equivalent-fractions-using-number-lines, identify-faces-of-three-dimensional-figures, identify-lines-of-symmetry, identify-parallel-perpendicular-and-intersecting-lines, identify-parallelograms, identify-rhombuses, identify-three-dimensional-figures, identify-trapezoids, interpret-bar-graphs, interpret-stem-and-leaf-plots, is-it-a-polygon, measure-angles-with-a-protractor, multiplication-input-output-tables-find-the-rule, multiply-fractions-by-whole-numbers-using-models, multiply-unit-fractions-by-whole-numbers-using-models, nets-of-three-dimensional-figures, parallel-perpendicular-and-intersecting-lines, parallel-sides-in-quadrilaterals, points-lines-line-segments-rays-and-angles, properties-of-three-dimensional-figures, rotational-symmetry, scalene-isosceles-and-equilateral-triangles, sides-and-angles-of-quadrilaterals, transportation-schedules, what-decimal-number-is-illustrated
grade-5acute-obtuse-and-right-triangles, adjust-a-budget, angles-of-90-180-270-and-360-degrees, classify-triangles, compare-decimals-using-grids, compare-fractions-and-mixed-numbers, compare-patterns, fractions-of-a-whole-word-problems, identify-parallelograms, identify-rhombuses, identify-three-dimensional-figures, identify-trapezoids, interpret-bar-graphs, is-it-a-polygon, line-symmetry, mean-find-the-missing-number, median-find-the-missing-number, multiplication-input-output-tables-find-the-rule, multiply-unit-fractions-by-whole-numbers-using-models, multiplying-fractions-by-whole-numbers-choose-the-model, nets-of-three-dimensional-figures, parallel-perpendicular-and-intersecting-lines, parallel-sides-in-quadrilaterals, parts-of-a-circle, points-lines-line-segments-rays-and-angles, range-find-the-missing-number, reflection-rotation-and-translation, regular-and-irregular-polygons, rotational-symmetry, rotational-symmetry-amount-of-rotation, scalene-isosceles-and-equilateral-triangles, three-dimensional-figures-viewed-from-different-perspectives, types-of-angles, understanding-probability
grade-6absolute-value-and-integers-word-problems, changes-in-mean-median-mode-and-range, classify-rational-numbers-using-a-diagram, classify-triangles, compare-and-order-rational-numbers-using-number-lines,compare-area-and-perimeter-of-two-figures, compare-checking-accounts, front-side-and-top-view, identify-complementary-supplementary-vertical-adjacent-and-congruent-angles, identify-equivalent-expressions-using-strip-models, identify-polyhedra, identify-trapezoids, interpret-bar-graphs, interpret-double-bar-graphs, interpret-graphs-of-proportional-relationships, interpret-histograms, line-symmetry,mean-median-mode-and-range-find-the-missing-number, model-and-solve-equations-using-algebra-tiles, nets-of-three-dimensional-figures, occupations-education-and-income, quadrants, rational-numbers-find-the-sign, reflection-rotation-and-translation, rotational-symmetry, rotational-symmetry-amount-of-rotation, similar-and-congruent-figures, understanding-area-of-a-triangle, understanding-area-of-trapezoids, understanding-percents-strip-models, which-figure-is-being-described, which-is-the-better-coupon, which-model-represents-the-ratio
grade-7apply-addition-and-subtraction-rules, apply-multiplication-and-division-rules, bases-of-three-dimensional-figures, changes-in-mean-median-mode-and-range, classify-quadrilaterals, classify-rational-numbers-using-a-diagram, compare-and-order-integers, cross-sections-of-three-dimensional-figures, describe-a-sequence-of-transformations, front-side-and-top-view, identify-alternate-interior-and-alternate-exterior-angles, identify-complementary-supplementary-vertical-and-adjacent-angles, identify-equivalent-linear-expressions-using-algebra-tiles, identify-linear-and-nonlinear-functions, identify-reflections-rotations-and-translations, identify-trapezoids, identify-trends-with-scatter-plots, interpret-circle-graphs, interpret-graphs-of-proportional-relationships, line-symmetry, make-predictions-with-scatter-plots, mean-median-mode-and-range-find-the-missing-number, model-and-solve-equations-using-algebra-tiles, nets-of-three-dimensional-figures, parallel-perpendicular-and-intersecting-lines, parts-of-a-circle, perimeter-and-area-changes-in-scale, rational-numbers-find-the-sign, rotational-symmetry, rotational-symmetry-amount-of-rotation, similar-and-congruent-figures, simplify-expressions-by-combining-like-terms-with-algebra-tiles, transversals-of-parallel-lines-name-angle-pairs, which-is-the-better-coupon
grade-8angle-angle-criterion-for-similar-triangles, apply-addition-and-subtraction-rules, apply-addition-subtraction-multiplication-and-division-rules, apply-multiplication-and-division-rules, base-plans, changes-in-mean-median-mode-and-range, classify-quadrilaterals, compare-and-order-integers, compare-linear-functions-graphs-and-equations, compare-linear-functions-tables-graphs-and-equations, congruent-triangles-sss-sas-and-asa, describe-a-sequence-of-transformations, front-side-and-top-view, identify-alternate-interior-and-alternate-exterior-angles, identify-complementary-supplementary-vertical-adjacent-and-congruent-angles, identify-congruent-figures, identify-functions-graphs, identify-linear-and-nonlinear-functions-graphs-and-equations, identify-linear-and-nonlinear-functions-tables, identify-lines-of-best-fit, identify-reflections-rotations-and-translations, identify-similar-triangles, identify-trapezoids, identify-trends-with-scatter-plots, interpret-graphs-of-proportional-relationships, interpret-the-slope-and-y-intercept-of-a-linear-function, irrational-numbers-on-number-lines, line-symmetry, make-predictions-with-scatter-plots, mean-median-mode-and-range-find-the-missing-number, model-and-solve-equations-using-algebra-tiles, multiply-polynomials-using-algebra-tiles, nets-of-three-dimensional-figures, parts-of-a-circle, parts-of-three-dimensional-figures, perimeter-and-area-changes-in-scale, quadrants-and-axes, rotational-symmetry, rotational-symmetry-amount-of-rotation, similar-and-congruent-figures, transversals-of-parallel-lines-name-angle-pairs
kindergartenaddition-sentences-up-to-10-what-does-the-model-show, addition-sentences-up-to-10-which-model-matches, addition-sentences-up-to-5-what-does-the-model-show, addition-sentences-up-to-5-which-model-matches, am-or-pm, are-there-enough, circles, classify-shapes-by-color, coin-names-penny-through-quarter, compare-sides-and-corners, compare-size-weight-and-capacity, compare-two-groups-of-coins-pennies-through-dimes, cones, count-corners, count-cubes-up-to-10, count-cubes-up-to-5, count-dots-0-to-5, count-dots-up-to-10, count-money-pennies-and-nickels, count-money-pennies-through-dimes, count-on-ten-frames-up-to-10, count-pictures-up-to-10, count-pictures-up-to-3, count-pictures-up-to-5, count-scattered-shapes-up-to-10, count-scattered-shapes-up-to-5, count-shapes-in-rings-up-to-10, count-shapes-in-rows-up-to-10, count-shapes-in-rows-up-to-5, count-shapes-up-to-3, count-sides, count-sides-and-corners, count-to-100, count-to-fill-a-ten-frame, cubes, curved-parts, cylinders, different, equal-sides, fewer-and-more-compare-by-counting, fewer-and-more-compare-by-matching, fewer-and-more-compare-in-a-mixed-group, fewer-and-more-up-to-20, fewer-more-and-same, flat-and-solid-shapes, hexagons, holds-more-or-less, identify-halves-thirds-fourths, identify-pictures-with-symmetry, identify-shapes-traced-from-solids, inside-and-outside, introduction-to-symmetry, light-and-heavy, match-analog-and-digital-clocks, match-analog-clocks-and-times, match-digital-clocks-and-times, more-or-less-likely, name-the-three-dimensional-shape, name-the-two-dimensional-shape, one-less-with-pictures-up-to-10, one-less-with-pictures-up-to-5, one-more-and-one-less-with-pictures-up-to-10, one-more-with-pictures-up-to-10, one-more-with-pictures-up-to-5, ordinal-numbers-up-to-fifth, ordinal-numbers-up-to-tenth, rectangles, represent-numbers-up-to-10, represent-numbers-up-to-20, represent-numbers-with-pictures-up-to-3, represent-numbers-with-pictures-up-to-5, represent-numbers-with-shapes-up-to-3, represent-numbers-with-shapes-up-to-5, select-three-dimensional-shapes, select-two-dimensional-shapes, shapes-of-everyday-objects, spheres, square-corners, squares, subtraction-sentences-up-to-10-what-does-the-model-show, subtraction-sentences-up-to-10-which-model-matches, subtraction-sentences-up-to-5-what-does-the-model-show, subtraction-sentences-up-to-5-which-model-matches, take-apart-10-words, take-apart-numbers-up-to-10-words, take-apart-numbers-up-to-5-words, tall-and-short, times-of-everyday-events, triangles, wide-and-narrow
pre-kaddition-sentences-up-to-10-what-does-the-model-show, addition-sentences-up-to-10-which-model-matches, addition-sentences-up-to-5-what-does-the-model-show, addition-sentences-up-to-5-which-model-matches, are-there-enough, circles, circles-squares-and-triangles, circles-squares-triangles-and-rectangles, classify-shapes-by-color, compare-size-weight-and-capacity, cones, count-corners, count-cubes-up-to-10, count-cubes-up-to-5, count-dots-up-to-10, count-dots-up-to-3, count-dots-up-to-5, count-on-ten-frames-up-to-10, count-on-ten-frames-up-to-3, count-on-ten-frames-up-to-5, count-pennies, count-pictures-up-to-10, count-pictures-up-to-3, count-pictures-up-to-5, count-scattered-shapes-up-to-10, count-scattered-shapes-up-to-5, count-shapes-in-rings-up-to-10, count-shapes-in-rows-up-to-10, count-shapes-in-rows-up-to-5, count-shapes-up-to-3, count-sides, count-sides-and-corners, cubes, cylinders, different, dimes-and-quarters, fewer, fewer-and-more-compare-by-counting, fewer-and-more-compare-by-matching, fewer-and-more-compare-in-a-mixed-group, fewer-more-and-same, flat-and-solid-shapes, holds-more-or-less, identify-shapes-traced-from-solids, inside-and-outside, light-and-heavy, more, name-the-shape, name-the-solid-shape, one-less-with-pictures-up-to-10, one-less-with-pictures-up-to-5, one-more-with-pictures-up-to-10, one-more-with-pictures-up-to-5, ordinal-numbers-up-to-fifth, ordinal-numbers-up-to-tenth, pennies-and-nickels, pennies-nickels-dimes-and-quarters, rectangles, represent-numbers-up-to-10, represent-numbers-up-to-20, represent-numbers-with-pictures-up-to-3, represent-numbers-with-pictures-up-to-5, represent-numbers-with-shapes-up-to-3, represent-numbers-with-shapes-up-to-5, select-solid-shapes, shapes-of-everyday-objects, spheres, squares, subtraction-sentences-up-to-10-which-model-matches, subtraction-sentences-up-to-5-which-model-matches, tall-and-short, tally-marks-up-to-10, triangles, what-comes-next, wide-and-narrow
precalculusdetermine-continuity-on-an-interval-using-graphs, determine-continuity-using-graphs, determine-one-sided-continuity-using-graphs, find-limits-at-vertical-asymptotes-using-graphs, identify-graphs-of-continuous-functions, outliers-in-scatter-plots, solve-a-triangle

Subject: Engineer

Skill: evaluate-tests-of-engineering-design-solutions

Description: Which of the following could Eliana’s test show?Eliana was taking part in her school’s engineering competition. To win the competition, she needed to build the popsicle-stick bridge that would hold the most weight. She could use only 200 popsicle sticks. She had two different design ideas. She had to pick one of the designs to use in the competition.To test which design was strongest, Eliana built two prototypes, each with 200 popsicle sticks. She then added 1 kg weights to each prototype until one of them broke.The passage below describes how the engineering-design process was used to test a solution to a problem. Read the passage. Then answer the question below.

Picture: Measuring Vision-Language STEM Skills of Neural Models (179)

Choices: [how much weight a bridge built with 300 popsicle sticks could hold, which design could hold more weight, ]

Answer index: 1

Subject: Engineer

Skill: identify-control-and-experimental-groups

Description: In this experiment, which were part of a control group?The passage below describes an experiment.Madelyn has a bubble machine and wants to know how to make the bubbles last longer. She read that bubbles burst when the liquid that makes up the bubbles evaporates. Madelyn knew that when liquids are warmer, they evaporate faster. So, she wondered if she could make her bubbles last longer by cooling the bubble solution.Madelyn cooled six bottles of bubble solution to 30°F below room temperature. She left another six bottles of bubble solution at room temperature. Then, she measured how long bubbles made from the solution in each bottle lasted.

Picture: Measuring Vision-Language STEM Skills of Neural Models (180)

Choices: [the bottles that were cooled down, the bottles that were at room temperature, ]

Answer index: 1

Subject: Engineer

Skill: identify-independent-and-dependent-variables

Description: Hint: An independent variable is a variable whose effect you are investigating. A dependent variable is a variable that you measure.Which of the following was a dependent variable in this experiment?The passage below describes an experiment. Read the passage and think about the variables that are described.Giardia is a microscopic parasite that lives in water and can infect humans. Dr. Roth designed a drinking straw that contained a filter to remove Giardia from water. Dr. Roth wanted to know if a longer filtering straw would remove more Giardia.Dr. Roth made six filtering straws: three that were five inches long and three that were ten inches long. She prepared six one-liter batches of water, each containing 10,000 Giardia. Then, Dr. Roth passed one batch of water through each straw. After each batch passed through the straw, she used a microscope to count the number of Giardia that remained in a small sample of the water.

Picture: Measuring Vision-Language STEM Skills of Neural Models (181)

Choices: [the number of Giardia that remained in the water, the length of the filtering straw, ]

Answer index: 0

Subject: Engineer

Skill: identify-laboratory-tools

Description: Select the round-bottom flask.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (182)Measuring Vision-Language STEM Skills of Neural Models (183)Measuring Vision-Language STEM Skills of Neural Models (184)Measuring Vision-Language STEM Skills of Neural Models (185)

Answer index: 1

Subject: Engineer

Skill: identify-the-experimental-question

Description: Identify the question that Jeffrey’s experiment can best answer.The passage below describes an experiment. Read the passage and then follow the instructions below.Jeffrey mixed bacteria into a nutrient-rich liquid where the bacteria could grow. He poured four ounces of the mixture into each of ten glass flasks. In five of the ten flasks, he also added one teaspoon of cinnamon. He allowed the bacteria in the flasks to grow overnight in a 37°C room. Then, Jeffrey used a microscope to count the number of bacteria in a small sample from each flask. He compared the amount of bacteria in the liquid with cinnamon to the amount of bacteria in the liquid without cinnamon.

Picture: Measuring Vision-Language STEM Skills of Neural Models (186)

Choices: [Do more bacteria grow in liquid with cinnamon than in liquid without cinnamon?, Does temperature affect how much bacteria can grow in liquid?, ]

Answer index: 0

Subject: Engineer

Skill: laboratory-safety-equipment

Description: Select the apron.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (187)Measuring Vision-Language STEM Skills of Neural Models (188)Measuring Vision-Language STEM Skills of Neural Models (189)Measuring Vision-Language STEM Skills of Neural Models (190)

Answer index: 3

Subject: Math

Skill: absolute-value-and-integers-word-problems

Description: Debbie likes watching the show Engineering Marvels. In last night’s episode, the engineering team visited a tall skyscraper and a deep mine. A banner at the bottom of the screen showed the elevation of each location the team visited.Which location is closer to sea level?

Picture: Measuring Vision-Language STEM Skills of Neural Models (191)

Choices: [bottom of the mine, top of the skyscraper, ]

Answer index: 0

Subject: Math

Skill: acute-obtuse-and-right-triangles

Description: What kind of triangle is this?

Picture: Measuring Vision-Language STEM Skills of Neural Models (192)

Choices: [obtuse, right, acute, ]

Answer index: 0

Subject: Math

Skill: acute-right-obtuse-and-straight-angles

Description: Is this angle acute, right, obtuse, or straight?

Picture: Measuring Vision-Language STEM Skills of Neural Models (193)

Choices: [straight, obtuse, acute, right, ]

Answer index: 1

Subject: Math

Skill: addition-sentences-up-to-10-what-does-the-model-show

Description: Which addition sentence does the picture show?

Picture: Measuring Vision-Language STEM Skills of Neural Models (194)

Choices: [4+3=7, 5+2=7, ]

Answer index: 0

Subject: Math

Skill: addition-sentences-up-to-10-which-model-matches

Description: Which shows 8+1=9?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (195)Measuring Vision-Language STEM Skills of Neural Models (196)

Answer index: 0

Subject: Math

Skill: addition-sentences-up-to-5-what-does-the-model-show

Description: Which addition sentence does the picture show?

Picture: Measuring Vision-Language STEM Skills of Neural Models (197)

Choices: [4+1=5, 3+1=4, ]

Answer index: 0

Subject: Math

Skill: addition-sentences-up-to-5-which-model-matches

Description: Which shows 2+1=3?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (198)Measuring Vision-Language STEM Skills of Neural Models (199)

Answer index: 0

Subject: Math

Skill: addition-sentences-using-number-lines-sums-up-to-20

Description: Which addition sentence does this model show?

Picture: Measuring Vision-Language STEM Skills of Neural Models (200)

Choices: [3+3=6, 5+6=11, 5+4=9, 5+3=8, ]

Answer index: 3

Subject: Math

Skill: adjust-a-budget

Description: This table shows Angie’s February budget.What could Angie do to balance her budget?

Picture: Measuring Vision-Language STEM Skills of Neural Models (201)

Choices: [spend only $60 at the mall, and teach a baking class for $30, decorate more custom cookies to earn another $35, and spend only $60 at the mall, spend $20 more on the baking kit, and teach a baking class for $30, decorate more custom cookies to earn another $35, and spend $20 more on the baking kit, ]

Answer index: 1

Subject: Math

Skill: am-or-pm

Description: Farmer Keenan is getting up to go milk his cows. It is just before sunrise. His watch shows:What time is it?

Picture: Measuring Vision-Language STEM Skills of Neural Models (202)

Choices: [4:30 P.M., 4:30 A.M., ]

Answer index: 1

Subject: Math

Skill: angle-angle-criterion-for-similar-triangles

Description: FGH and JKL are shown below.Which statement is true?

Picture: Measuring Vision-Language STEM Skills of Neural Models (203)

Choices: [FGH is similar to JKL., FGH is not similar to JKL., There is not enough information to determine whether the triangles are similar., ]

Answer index: 2

Subject: Math

Skill: angles-as-fractions-of-a-circle

Description: What fraction of the circle does this angle cut out?

Picture: Measuring Vision-Language STEM Skills of Neural Models (204)

Choices: [1/4, 3/4, 1 whole, 1/2, ]

Answer index: 0

Subject: Math

Skill: count-money-pennies-and-nickels

Description: How much money is there?

Picture: Measuring Vision-Language STEM Skills of Neural Models (205)

Choices: [6¢, 11¢, 16¢, ]

Answer index: 1

Subject: Math

Skill: count-money-pennies-through-dimes

Description: How much money is there?

Picture: Measuring Vision-Language STEM Skills of Neural Models (206)

Choices: [18¢, 16¢, 19¢, ]

Answer index: 0

Subject: Math

Skill: count-on-ten-frames-up-to-10

Description: How many dots are on the frame?

Picture: Measuring Vision-Language STEM Skills of Neural Models (207)

Choices: [4, 9, 2, 8, 1, 7, 5, 10, 6, 3, ]

Answer index: 1

Subject: Math

Skill: count-on-ten-frames-up-to-3

Description: How many dots are on the frame?

Picture: Measuring Vision-Language STEM Skills of Neural Models (208)

Choices: [3, 2, 1, ]

Answer index: 1

Subject: Math

Skill: count-on-ten-frames-up-to-5

Description: How many squares are on the frame?

Picture: Measuring Vision-Language STEM Skills of Neural Models (209)

Choices: [4, 5, 1, 2, 3, ]

Answer index: 1

Subject: Math

Skill: count-pennies

Description: How much money is there?

Picture: Measuring Vision-Language STEM Skills of Neural Models (210)

Choices: [7¢, 6¢, , ]

Answer index: 2

Subject: Math

Skill: count-pictures-up-to-10

Description: How many parrots are there?

Picture: Measuring Vision-Language STEM Skills of Neural Models (211)

Choices: [1, 4, 3, 6, 9, 5, 7, 10, 8, 2, ]

Answer index: 4

Subject: Math

Skill: count-pictures-up-to-3

Description: How many butterflies are there?

Picture: Measuring Vision-Language STEM Skills of Neural Models (212)

Choices: [2, 1, 3, ]

Answer index: 2

Subject: Math

Skill: count-pictures-up-to-5

Description: How many snowmen are there?

Picture: Measuring Vision-Language STEM Skills of Neural Models (213)

Choices: [2, 5, 1, 4, 3, ]

Answer index: 4

Subject: Math

Skill: count-scattered-shapes-up-to-10

Description: How many shapes are there?

Picture: Measuring Vision-Language STEM Skills of Neural Models (214)

Choices: [2, 3, 8, 1, 9, 10, 4, 6, 7, 5, ]

Answer index: 5

Subject: Math

Skill: count-scattered-shapes-up-to-5

Description: How many rectangles are there?

Picture: Measuring Vision-Language STEM Skills of Neural Models (215)

Choices: [2, 4, 5, 3, 1, ]

Answer index: 2

Subject: Math

Skill: count-shapes-in-rings-up-to-10

Description: How many triangles are there?

Picture: Measuring Vision-Language STEM Skills of Neural Models (216)

Choices: [9, 3, 5, 2, 7, 6, 8, 1, 4, 10, ]

Answer index: 4

Subject: Math

Skill: count-shapes-in-rows-up-to-10

Description: How many hearts are there?

Picture: Measuring Vision-Language STEM Skills of Neural Models (217)

Choices: [5, 1, 7, 9, 8, 10, 3, 4, 2, 6, ]

Answer index: 3

Subject: Math

Skill: count-shapes-in-rows-up-to-5

Description: How many shapes are there?

Picture: Measuring Vision-Language STEM Skills of Neural Models (218)

Choices: [3, 2, 1, 4, 5, ]

Answer index: 0

Subject: Math

Skill: count-shapes-up-to-3

Description: How many triangles are there?

Picture: Measuring Vision-Language STEM Skills of Neural Models (219)

Choices: [2, 1, 3, ]

Answer index: 0

Subject: Math

Skill: count-sides

Description: Which shape has 4 sides?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (220)Measuring Vision-Language STEM Skills of Neural Models (221)Measuring Vision-Language STEM Skills of Neural Models (222)

Answer index: 1

Subject: Math

Skill: count-sides-and-corners

Description: Which shape has 5 corners?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (223)Measuring Vision-Language STEM Skills of Neural Models (224)Measuring Vision-Language STEM Skills of Neural Models (225)

Answer index: 1

Subject: Math

Skill: count-to-100

Description: How many dots are there?

Picture: Measuring Vision-Language STEM Skills of Neural Models (226)

Choices: [46, 49, 44, ]

Answer index: 0

Subject: Math

Skill: identify-alternate-interior-and-alternate-exterior-angles

Description:

line{RT} and

line{UW} are parallel lines.Which angles are alternate interior angles?

Picture: Measuring Vision-Language STEM Skills of Neural Models (227)

Choices: [angle{TSV} and angle{UVS}, angle{TSV} and angle{TSQ}, angle{TSV} and angle{RSV}, angle{TSV} and angle{WVS}, ]

Answer index: 0

Subject: Math

Skill: identify-complementary-supplementary-vertical-adjacent-and-congruent-angles

Description: Which angle is vertical to angle{3}?

Picture: Measuring Vision-Language STEM Skills of Neural Models (228)

Choices: [angle{6}, angle{5}, angle{4}, angle{2}, ]

Answer index: 0

Subject: Math

Skill: identify-complementary-supplementary-vertical-and-adjacent-angles

Description: Which angles are adjacent to each other?

Picture: Measuring Vision-Language STEM Skills of Neural Models (229)

Choices: [angle{1}angle{3} and angle{7}, angle{1}angle{5} and angle{1}angle{4}, angle{8} and angle{4}, angle{1}angle{0} and angle{4}, ]

Answer index: 1

Subject: Math

Skill: identify-congruent-figures

Description: Are these shapes congruent?

Picture: Measuring Vision-Language STEM Skills of Neural Models (230)

Choices: [no, yes, ]

Answer index: 1

Subject: Math

Skill: identify-direct-variation-and-inverse-variation

Description: Which equation shows direct variation?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (231)Measuring Vision-Language STEM Skills of Neural Models (232)

Answer index: 1

Subject: Math

Skill: identify-equivalent-expressions-using-strip-models

Description: This model represents the expression x+x+1+1.Which expression is equivalent to x+x+1+1?

Picture: Measuring Vision-Language STEM Skills of Neural Models (233)

Choices: [4x, 2x+3, 3x+1, 2x+2, ]

Answer index: 3

Subject: Math

Skill: identify-equivalent-fractions-on-number-lines

Description: Is 1/2 equivalent to 1/3 ?

Picture: Measuring Vision-Language STEM Skills of Neural Models (234)

Choices: [no, yes, ]

Answer index: 0

Subject: Math

Skill: identify-equivalent-fractions-using-number-lines

Description: Is 2/3 equivalent to 4/6 ?

Picture: Measuring Vision-Language STEM Skills of Neural Models (235)

Choices: [yes, no, ]

Answer index: 0

Subject: Math

Skill: identify-equivalent-linear-expressions-using-algebra-tiles

Description: These tiles represent the expression 3x+5x.Which expression is equivalent to 3x+5x?

Picture: Measuring Vision-Language STEM Skills of Neural Models (236)

Choices: [x+8, 2x, 8x, 8x+2, ]

Answer index: 2

Subject: Math

Skill: identify-faces-of-three-dimensional-figures

Description: Which shape has a circle as a face?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (237)Measuring Vision-Language STEM Skills of Neural Models (238)

Answer index: 1

Subject: Math

Skill: identify-faces-of-three-dimensional-shapes

Description: Which shape has a circle as a face?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (239)Measuring Vision-Language STEM Skills of Neural Models (240)

Answer index: 1

Subject: Math

Skill: identify-fourths

Description: Look at the colored part of each shape. Which shape shows one-fourth?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (241)Measuring Vision-Language STEM Skills of Neural Models (242)Measuring Vision-Language STEM Skills of Neural Models (243)Measuring Vision-Language STEM Skills of Neural Models (244)

Answer index: 3

Subject: Math

Skill: identify-functions

Description: Look at this graph:Is this relation a function?

Picture: Measuring Vision-Language STEM Skills of Neural Models (245)

Choices: [yes, no, ]

Answer index: 1

Subject: Math

Skill: identify-functions-graphs

Description: Which of these relations is a function?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (246)Measuring Vision-Language STEM Skills of Neural Models (247)Measuring Vision-Language STEM Skills of Neural Models (248)Measuring Vision-Language STEM Skills of Neural Models (249)

Answer index: 3

Subject: Math

Skill: identify-functions-vertical-line-test

Description: Which of these relations is a function?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (250)Measuring Vision-Language STEM Skills of Neural Models (251)Measuring Vision-Language STEM Skills of Neural Models (252)Measuring Vision-Language STEM Skills of Neural Models (253)

Answer index: 3

Subject: Math

Skill: identify-graphs-of-continuous-functions

Description: Is the function f(x) continuous?

Picture: Measuring Vision-Language STEM Skills of Neural Models (254)

Choices: [yes, no, ]

Answer index: 0

Subject: Math

Skill: identify-halves

Description: Look at the colored part of each shape. Which shape shows one-half?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (255)Measuring Vision-Language STEM Skills of Neural Models (256)Measuring Vision-Language STEM Skills of Neural Models (257)Measuring Vision-Language STEM Skills of Neural Models (258)

Answer index: 0

Subject: Math

Skill: identify-halves-and-fourths

Description: Which figure shows fourths?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (259)Measuring Vision-Language STEM Skills of Neural Models (260)

Answer index: 1

Subject: Math

Skill: lines-line-segments-and-rays

Description: What is this?

Picture: Measuring Vision-Language STEM Skills of Neural Models (261)

Choices: [line, line segment, ray, ]

Answer index: 1

Subject: Math

Skill: make-predictions-with-scatter-plots

Description: Based on the scatter plot below, which is a better prediction for x when y = 46?

Picture: Measuring Vision-Language STEM Skills of Neural Models (262)

Choices: [50, 98, ]

Answer index: 0

Subject: Math

Skill: match-addition-sentences-and-models-sums-to-10

Description: Which shows 2+2=4?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (263)Measuring Vision-Language STEM Skills of Neural Models (264)

Answer index: 0

Subject: Math

Skill: match-analog-and-digital-clocks

Description: Look at the analog clock:Which digital clock shows the same time?

Picture: Measuring Vision-Language STEM Skills of Neural Models (265)

Choices: Measuring Vision-Language STEM Skills of Neural Models (266)Measuring Vision-Language STEM Skills of Neural Models (267)Measuring Vision-Language STEM Skills of Neural Models (268)

Answer index: 0

Subject: Math

Skill: match-analog-clocks-and-times

Description: What time does the clock show?

Picture: Measuring Vision-Language STEM Skills of Neural Models (269)

Choices: [5:00, 4:30, ]

Answer index: 0

Subject: Math

Skill: match-clocks-and-times

Description: What time does the clock show?

Picture: Measuring Vision-Language STEM Skills of Neural Models (270)

Choices: [eight fifty, seven fifty, nine forty, ]

Answer index: 1

Subject: Math

Skill: match-digital-clocks-and-times

Description: Which clock shows six thirty-five?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (271)Measuring Vision-Language STEM Skills of Neural Models (272)Measuring Vision-Language STEM Skills of Neural Models (273)

Answer index: 1

Subject: Math

Skill: match-exponential-functions-and-graphs

Description: formula_desc.png

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (274)Measuring Vision-Language STEM Skills of Neural Models (275)Measuring Vision-Language STEM Skills of Neural Models (276)Measuring Vision-Language STEM Skills of Neural Models (277)

Answer index: 0

Subject: Math

Skill: match-exponential-functions-and-graphs-ii

Description: formula_desc.png

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (278)Measuring Vision-Language STEM Skills of Neural Models (279)Measuring Vision-Language STEM Skills of Neural Models (280)Measuring Vision-Language STEM Skills of Neural Models (281)

Answer index: 2

Subject: Math

Skill: match-fractions-to-models-halves-thirds-and-fourths

Description: Look at the colored part of each shape. Which shape shows one-fourth?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (282)Measuring Vision-Language STEM Skills of Neural Models (283)Measuring Vision-Language STEM Skills of Neural Models (284)Measuring Vision-Language STEM Skills of Neural Models (285)

Answer index: 2

Subject: Math

Skill: match-mixed-numbers-to-models

Description: Which mixed number is shown?

Picture: Measuring Vision-Language STEM Skills of Neural Models (286)

Choices: [3 3/8, 4 2/8, 3 2/8, 3 5/8, ]

Answer index: 0

Subject: Math

Skill: mean-find-the-missing-number

Description: Susan has the following data:If the mean is 25, which number could r be?

Picture: Measuring Vision-Language STEM Skills of Neural Models (287)

Choices: [29, 38, ]

Answer index: 0

Subject: Math

Skill: mean-median-mode-and-range-find-the-missing-number

Description: Jayla has the following data:If the mean is 14, which number could s be?

Picture: Measuring Vision-Language STEM Skills of Neural Models (288)

Choices: [11, 3, ]

Answer index: 0

Subject: Math

Skill: measure-angles-with-a-protractor

Description: Is this angle acute, right, or obtuse?

Picture: Measuring Vision-Language STEM Skills of Neural Models (289)

Choices: [right, obtuse, acute, ]

Answer index: 2

Subject: Math

Skill: median-find-the-missing-number

Description: Danny has the following data:If the median is 97, which number could c be?

Picture: Measuring Vision-Language STEM Skills of Neural Models (290)

Choices: [98, 47, ]

Answer index: 0

Subject: Math

Skill: model-and-solve-equations-using-algebra-tiles

Description: Which equation does this set of algebra tiles represent?

Picture: Measuring Vision-Language STEM Skills of Neural Models (291)

Choices: [– 4x–1= – 9, – 8x–1= – 9, 8x–1= – 9, – x–1= – 10, ]

Answer index: 3

Subject: Math

Skill: model-and-solve-linear-equations-using-algebra-tiles

Description: Which equation does this set of algebra tiles represent?

Picture: Measuring Vision-Language STEM Skills of Neural Models (292)

Choices: [3x=27, 3x=24, 2x=26, 2x=24, ]

Answer index: 1

Subject: Math

Skill: more

Description: Which group has more?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (293)Measuring Vision-Language STEM Skills of Neural Models (294)

Answer index: 0

Subject: Math

Skill: reflection-rotation-and-translation

Description: How has this figure been transformed?It has been…

Picture: Measuring Vision-Language STEM Skills of Neural Models (295)

Choices: [translated, reflected, rotated, ]

Answer index: 1

Subject: Math

Skill: regular-and-irregular-polygons

Description: Is this shape a regular polygon?

Picture: Measuring Vision-Language STEM Skills of Neural Models (296)

Choices: [yes, no, ]

Answer index: 1

Subject: Math

Skill: represent-numbers-up-to-10

Description: Which group has 6 triangles?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (297)Measuring Vision-Language STEM Skills of Neural Models (298)

Answer index: 0

Subject: Math

Skill: represent-numbers-up-to-20

Description: Which picture shows 8 dots?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (299)Measuring Vision-Language STEM Skills of Neural Models (300)Measuring Vision-Language STEM Skills of Neural Models (301)

Answer index: 0

Subject: Math

Skill: represent-numbers-with-pictures-up-to-3

Description: Which shows 2?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (302)Measuring Vision-Language STEM Skills of Neural Models (303)

Answer index: 0

Subject: Math

Skill: represent-numbers-with-pictures-up-to-5

Description: Which shows 1?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (304)Measuring Vision-Language STEM Skills of Neural Models (305)

Answer index: 0

Subject: Math

Skill: represent-numbers-with-shapes-up-to-3

Description: Which group has 3 circles?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (306)Measuring Vision-Language STEM Skills of Neural Models (307)Measuring Vision-Language STEM Skills of Neural Models (308)

Answer index: 0

Subject: Math

Skill: represent-numbers-with-shapes-up-to-5

Description: Which group has 4 hexagons?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (309)Measuring Vision-Language STEM Skills of Neural Models (310)

Answer index: 0

Subject: Math

Skill: rhombuses

Description: Which shape is a rhombus?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (311)Measuring Vision-Language STEM Skills of Neural Models (312)

Answer index: 0

Subject: Math

Skill: rotational-symmetry

Description: Does this picture have rotational symmetry?

Picture: Measuring Vision-Language STEM Skills of Neural Models (313)

Choices: [no, yes, ]

Answer index: 0

Subject: Math

Skill: rotational-symmetry-amount-of-rotation

Description: This image has rotational symmetry. What is the smallest fraction of a full turn you need to rotate the image for it to look the same?

Picture: Measuring Vision-Language STEM Skills of Neural Models (314)

Choices: [1 2 of a full turn, 1 6 of a full turn, 1 4 of a full turn, 1 3 of a full turn, ]

Answer index: 0

Subject: Math

Skill: scalene-isosceles-and-equilateral-triangles

Description: Is this triangle scalene?

Picture: Measuring Vision-Language STEM Skills of Neural Models (315)

Choices: [yes, no, ]

Answer index: 1

Subject: Math

Skill: select-figures-with-a-given-area

Description: Which shape has an area of 7 square units? The shapes are made of unit squares.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (316)Measuring Vision-Language STEM Skills of Neural Models (317)

Answer index: 1

Subject: Math

Skill: select-fractions-equivalent-to-whole-numbers-using-models

Description: Count the equal parts. What fraction does this picture show?

Picture: Measuring Vision-Language STEM Skills of Neural Models (318)

Choices: [2/4, 4/8, 8/2, 2/8, ]

Answer index: 2

Subject: Math

Skill: select-solid-shapes

Description: Which shape is a cone?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (319)Measuring Vision-Language STEM Skills of Neural Models (320)Measuring Vision-Language STEM Skills of Neural Models (321)

Answer index: 2

Subject: Math

Skill: select-three-dimensional-shapes

Description: Which shape is a rectangular prism?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (322)Measuring Vision-Language STEM Skills of Neural Models (323)Measuring Vision-Language STEM Skills of Neural Models (324)

Answer index: 2

Subject: Math

Skill: select-two-dimensional-shapes

Description: Which shape is a hexagon?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (325)Measuring Vision-Language STEM Skills of Neural Models (326)Measuring Vision-Language STEM Skills of Neural Models (327)

Answer index: 2

Subject: Math

Skill: shapes-of-everyday-objects

Description: Which is shaped like a cylinder?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (328)Measuring Vision-Language STEM Skills of Neural Models (329)Measuring Vision-Language STEM Skills of Neural Models (330)Measuring Vision-Language STEM Skills of Neural Models (331)

Answer index: 1

Subject: Science

Skill: animal-adaptations-feet-and-limbs

Description: Star-nosed moles are found in many parts of North America. They live in burrows. The moles eat earthworms and nuts, which they find in the soil. The feet of the star-nosed mole are adapted for digging.Which animal’s feet are also adapted for digging?

Picture: Measuring Vision-Language STEM Skills of Neural Models (332)

Choices: Measuring Vision-Language STEM Skills of Neural Models (333)Measuring Vision-Language STEM Skills of Neural Models (334)

Answer index: 0

Subject: Science

Skill: animal-adaptations-skins-and-body-coverings

Description: Emerald tree boas live in the forests of South America. The tree boa is adapted to be camouflaged among green leaves.Which animal is also adapted to be camouflaged among green leaves?

Picture: Measuring Vision-Language STEM Skills of Neural Models (335)

Choices: Measuring Vision-Language STEM Skills of Neural Models (336)Measuring Vision-Language STEM Skills of Neural Models (337)

Answer index: 1

Subject: Science

Skill: classify-elementary-substances-and-compounds-using-models

Description: Complete the statement.Nitrogen isThe model below represents a molecule of nitrogen. Nitrogen gas makes up nearly 80% of the air you breathe.

Picture: Measuring Vision-Language STEM Skills of Neural Models (338)

Choices: [an elementary substance, a compound, ]

Answer index: 0

Subject: Science

Skill: classify-fruits-and-vegetables-as-plant-parts

Description: People use lettuce plants for food. We usually eat the part of this plant that makes most of the food for the plant.Hint: A plant’s leaves make food. A plant’s seeds can grow into a new plant.Which part of the lettuce plant do we usually eat?

Picture: Measuring Vision-Language STEM Skills of Neural Models (339)

Choices: [the leaves, the seeds, ]

Answer index: 0

Subject: Science

Skill: classify-matter-as-solid-liquid-or-gas

Description: Is the water from a faucet a solid, a liquid, or a gas?

Picture: Measuring Vision-Language STEM Skills of Neural Models (340)

Choices: [a solid, a liquid, a gas, ]

Answer index: 1

Subject: Science

Skill: classify-matter-as-solid-or-liquid

Description: Is a coin a solid or a liquid?

Picture: Measuring Vision-Language STEM Skills of Neural Models (341)

Choices: [a liquid, a solid, ]

Answer index: 1

Subject: Science

Skill: classify-rocks-and-minerals-by-color-and-shape

Description: Select the black mineral.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (342)Measuring Vision-Language STEM Skills of Neural Models (343)

Answer index: 0

Subject: Science

Skill: classify-rocks-and-minerals-by-color-shape-and-texture

Description: Select the brown rock.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (344)Measuring Vision-Language STEM Skills of Neural Models (345)

Answer index: 1

Subject: Science

Skill: classify-rocks-as-igneous-sedimentary-or-metamorphic

Description: Diorite is a type of rock. When melted rock cools below the earth’s surface, it can form diorite. Diorite is usually made of large mineral grains.What type of rock is diorite?

Picture: Measuring Vision-Language STEM Skills of Neural Models (346)

Choices: [sedimentary, igneous, ]

Answer index: 1

Subject: Science

Skill: classify-symbiotic-relationships

Description: Which type of relationship is formed when an Alcon blue caterpillar lives in a Myrmica ant nest?Read the passage. Then answer the question.Alcon blue butterflies spend the first part of their lives as caterpillars that live with Myrmica ants. When a caterpillar lives with the ants, it mimics, or pretends to be, an ant. The caterpillar can mimic the ants by copying their smell. The caterpillar can also make noises that make it sound like a queen ant. Queen ants receive more food and better protection than any other ants in the nest.So, when the caterpillar mimics an ant, the ants feed and protect the caterpillar instead of other ants in the nest.

Picture: Measuring Vision-Language STEM Skills of Neural Models (347)

Choices: [mutualistic, commensal, parasitic, ]

Answer index: 2

Subject: Science

Skill: compare-ages-of-fossils-in-a-rock-sequence

Description: This diagram shows fossils in an undisturbed sedimentary rock sequence.Which of the following fossils is older? Select the more likely answer.

Picture: Measuring Vision-Language STEM Skills of Neural Models (348)

Choices: Measuring Vision-Language STEM Skills of Neural Models (349)Measuring Vision-Language STEM Skills of Neural Models (350)

Answer index: 1

Subject: Science

Skill: compare-amplitudes-and-wavelengths-of-waves

Description: Select the wave with the greater amplitude.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (351)Measuring Vision-Language STEM Skills of Neural Models (352)

Answer index: 0

Subject: Science

Skill: compare-amplitudes-wavelengths-and-frequencies-of-waves

Description: Select the graph of the wave with the greater amplitude.The graphs below describe two waves. The waves are traveling at the same speed.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (353)Measuring Vision-Language STEM Skills of Neural Models (354)

Answer index: 1

Subject: Science

Skill: compare-ancient-and-modern-organisms-use-observations-to-support-a-hypothesis

Description: Which statement supports the following hypothesis?The American lobster and Homarus hakelensis have similar adaptations to survive underwater.The American lobster and Homarus hakelensis have similar adaptations to survive underwater.

Picture: Measuring Vision-Language STEM Skills of Neural Models (355)

Choices: [Homarus hakelensis used its claws to find food underwater, The American lobster uses its claws to find food underwater, The American lobster uses its claws to find food underwater, ]

Answer index: 2

Subject: Science

Skill: compare-concentrations-of-solutions

Description: The diagram below is a model of two solutions. Each green ball represents one particle of solute.Which solution has a higher concentration of green particles?

Picture: Measuring Vision-Language STEM Skills of Neural Models (356)

Choices: [Solution B, Solution A, neither; their concentrations are the same, ]

Answer index: 0

Subject: Science

Skill: compare-magnitudes-of-magnetic-forces

Description: Think about the magnetic force between the magnets in each pair. Which of the following statements is true?The images below show two pairs of magnets. The magnets in different pairs do not affect each other. All the magnets shown are made of the same material, but some of them are different shapes.

Picture: Measuring Vision-Language STEM Skills of Neural Models (357)

Choices: [The magnitude of the magnetic force is smaller in Pair 2., The magnitude of the magnetic force is the same in both pairs., The magnitude of the magnetic force is smaller in Pair 1., ]

Answer index: 1

Subject: Science

Skill: compare-properties-of-materials

Description: Which is harder?

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (358)Measuring Vision-Language STEM Skills of Neural Models (359)

Answer index: 1

Subject: Science

Skill: compare-properties-of-objects

Description: Which property do these four objects have in common?Select the best answer.

Picture: Measuring Vision-Language STEM Skills of Neural Models (360)

Choices: [sticky, sour, soft, ]

Answer index: 2

Subject: Science

Skill: use-data-to-describe-world-climates

Description: Which statement best describes the climate of Santa Fe?Hint: Summers in the Northern Hemisphere occur in June, July, and August. Winters in the Northern Hemisphere occur in December, January, and February.Santa Fe, New Mexico, is a city in the United States. It has a semiarid climate.

Picture: Measuring Vision-Language STEM Skills of Neural Models (361)

Choices: [Winters have much lower temperatures than summers., Winters have less precipitation than summers., ]

Answer index: 0

Subject: Science

Skill: use-evidence-to-classify-animals

Description: Placental mammals are a group of animals with similar traits. The following traits can be used to identify placental mammals:They give birth to live offspring.They have fur or hair.Observe the animals and read the descriptions. Select the one animal that has all of the placental mammal traits listed above.Sea otters have very thick fur. Their fur helps keep them warm in cold water. Female sea otters give birth to live offspring in the water.Red salamanders do not have lungs! They can breathe through their moist, smooth skin. Adult red salamanders live near rivers or ponds. They lay eggs with no shells under rocks or logs. The baby red salamanders live underwater.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (362)Measuring Vision-Language STEM Skills of Neural Models (363)

Answer index: 0

Subject: Science

Skill: use-evidence-to-classify-mammals-birds-fish-reptiles-and-amphibians

Description: Fish are a group of animals with similar traits. The following traits can be used to identify fish:They have fins, not limbs.They make eggs with no shells.Observe the animals and read the descriptions. Select the one animal that has all of the fish traits listed above.Thresher sharks hatch from eggs with no shells. They have a long tail and fins. They can use their tail to hit and stun their prey. Thresher sharks live in salt water.Greater flameback woodpeckers have feathers and two wings. They use their strong beaks to make holes in trees. The woodpeckers use these holes as nests for their eggs, which have white shells.

Picture: None

Choices: Measuring Vision-Language STEM Skills of Neural Models (364)Measuring Vision-Language STEM Skills of Neural Models (365)

Answer index: 0

Subject: Science

Skill: use-punnett-squares-to-calculate-probabilities-of-offspring-types

Description: In a group of dachshund dogs, some individuals have rough fur and others have soft fur. In this group, the gene for the fur texture trait has two alleles. The allele for rough fur (F) is dominant over the allele for soft fur (f).This Punnett square shows a cross between two dachshund dogs.What is the probability that a dachshund dog produced by this cross will be hom*ozygous dominant for the fur texture gene?

Picture: Measuring Vision-Language STEM Skills of Neural Models (366)

Choices: [3/4, 0/4, 2/4, 1/4, 4/4, ]

Answer index: 2

Subject: Science

Skill: use-punnett-squares-to-calculate-ratios-of-offspring-types

Description: In a group of Syrian hamsters, some individuals have short fur and others have long fur. In this group, the gene for the fur length trait has two alleles. The allele for short fur (F) is dominant over the allele for long fur (f).This Punnett square shows a cross between two Syrian hamsters.What is the expected ratio of offspring with long fur to offspring with short fur? Choose the most likely ratio.

Picture: Measuring Vision-Language STEM Skills of Neural Models (367)

Choices: [3:1, 1:3, 0:4, 2:2, 4:0, ]

Answer index: 2

Subject: Science

Skill: use-scientific-names-to-classify-organisms

Description: This organism is a mantled howler. Its scientific name is Alouatta palliata.Select the organism in the same species as the mantled howler.

Picture: Measuring Vision-Language STEM Skills of Neural Models (368)

Choices: Measuring Vision-Language STEM Skills of Neural Models (369)Measuring Vision-Language STEM Skills of Neural Models (370)Measuring Vision-Language STEM Skills of Neural Models (371)

Answer index: 1

Subject: Science

Skill: weather-and-climate-around-the-world

Description: Does this passage describe the weather or the climate?Hint: Weather is what the atmosphere is like at a certain place and time. Climate is the pattern of weather in a certain place.A cloud forest is a mountain ecosystem that is home to a wide variety of species. The skies were mostly clear last week over this cloud forest, which is in Ecuador.

Picture: Measuring Vision-Language STEM Skills of Neural Models (372)

Choices: [weather, climate, ]

Answer index: 0

Subject: Technology

Skill: cables

Description: What kind of computer related plug or port do you see here?

Picture: Measuring Vision-Language STEM Skills of Neural Models (373)

Choices: [USB type-A port, HDMI plug, VGA port, USB type-C plug, ]

Answer index: 3

Subject: Technology

Skill: font

Description: Identify this font type

Picture: Measuring Vision-Language STEM Skills of Neural Models (374)

Choices: [Times Ancient Roman, Matisse ITC, Human521 BT, Bookman Old Style, ]

Answer index: 2

Subject: Technology

Skill: icons

Description: This is a(n old) logo of which famous app or program?

Picture: Measuring Vision-Language STEM Skills of Neural Models (375)

Choices: [Acrobat Reader, Google Pay, Microsoft Office PowerPoint, GoFundMe, ]

Answer index: 2

Subject: Technology

Skill: logo

Description: This is (part of) a (former) logo of which computer related brand?

Picture: Measuring Vision-Language STEM Skills of Neural Models (376)

Choices: [Imation, Cisco, Nintendo, Verbatim, ]

Answer index: 3

Subject: Technology

Skill: others

Description: What is the function of this key?

Picture: Measuring Vision-Language STEM Skills of Neural Models (377)

Choices: [Copy, Undo, Delete, Paste, ]

Answer index: 1

Subject: Technology

Skill: parts

Description: What kind of computer component do you see here?

Picture: Measuring Vision-Language STEM Skills of Neural Models (378)

Choices: [Power Supply Unit, Computer Fan, CPU Socket, Molex Connector, ]

Answer index: 2

Subject: Technology

Skill: peripherals

Description: What kind of computer peripheral do you see here?

Picture: Measuring Vision-Language STEM Skills of Neural Models (379)

Choices: [Floppy Disk, DVD Spindle, Tablet, Bluetooth Headset, ]

Answer index: 3

Subject: Technology

Skill: photo

Description: What type of video game console do you see here?

Picture: Measuring Vision-Language STEM Skills of Neural Models (380)

Choices: [Nintendo Wii, Microsoft Xbox One, Microsoft Xbox, Mattel Intellivision, ]

Answer index: 0

Subject: Technology

Skill: web

Description: What meaning or function is usually associated with this web interface symbol?

Picture: Measuring Vision-Language STEM Skills of Neural Models (381)

Choices: [Storage for deleted files, Computer games, Reload/Refresh, Send e-mail, ]

Answer index: 2

Measuring Vision-Language STEM Skills of Neural Models (2024)

References

Top Articles
Latest Posts
Article information

Author: Dong Thiel

Last Updated:

Views: 6079

Rating: 4.9 / 5 (79 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Dong Thiel

Birthday: 2001-07-14

Address: 2865 Kasha Unions, West Corrinne, AK 05708-1071

Phone: +3512198379449

Job: Design Planner

Hobby: Graffiti, Foreign language learning, Gambling, Metalworking, Rowing, Sculling, Sewing

Introduction: My name is Dong Thiel, I am a brainy, happy, tasty, lively, splendid, talented, cooperative person who loves writing and wants to share my knowledge and understanding with you.