HackerRank ASTRA Benchmark Measures AI Performance in Software Development

A New Benchmark for Evaluating AI in Software Development: The HackerRank-ASTRA Benchmark

The development and application of large language models (LLMs) in software development is progressing rapidly. To capture the true potential of these models for real-world applications, meaningful evaluation criteria are essential. Existing benchmarks often focus on isolated programming tasks or specific libraries. However, the complexity of software projects, which often involve multiple files and diverse technologies, is not sufficiently considered. Similarly, there is a lack of stringent evaluation of the consistency of the generated results.

The HackerRank-ASTRA Benchmark addresses this gap by introducing project-based programming tasks that reflect real-world scenarios. In contrast to isolated code snippets, these tasks require interaction with multiple files and consideration of dependencies within a project. This approach enables a more realistic assessment of the capabilities of LLMs in software development.

A central aspect of the HackerRank-ASTRA Benchmark is the evaluation of the consistency of the model outputs. By performing 32 runs (k = 32) for each task and calculating the mean standard deviation, the reliability of the models is quantified. Fluctuations in the results indicate instabilities in the model behavior, which can lead to problems in practice. High consistency is therefore an important criterion for the use of LLMs in professional software projects.

In addition to the overall evaluation, the benchmark also enables a detailed analysis of model capabilities at the sub-skill level. By categorizing the tasks according to specific technologies and concepts, strengths and weaknesses of the models in different areas can be identified. This detailed analysis provides valuable information for the targeted further development of LLMs and the selection of the optimal model for specific use cases.

Initial evaluations with 65 tasks show that the three leading models - o1, o1-preview, and Claude-3.5-Sonnet-1022 - achieve comparable averages of 75%. Statistically significant differences in performance between these models could not be determined. However, the high consistency of Claude-3.5-Sonnet-1022 is noteworthy. With a low standard deviation (SD = 0.0497), this model showed the least variability in the results, which was statistically significant compared to the other models. This stability underscores the suitability of Claude-3.5-Sonnet-1022 for use in real-world software development projects.

The HackerRank-ASTRA Benchmark represents an important step towards a comprehensive and practical evaluation of LLMs. By considering project-based tasks and focusing on consistency, it provides valuable insights for the further development and effective use of AI in software development. The detailed analysis at the sub-skill level also enables targeted optimization of the models and contributes to unlocking the full potential of LLMs in practice.

Outlook

The further development of benchmarks like HackerRank-ASTRA is essential to measure and promote progress in the field of AI-supported software development. Future research could focus on expanding the task pool, integrating further evaluation criteria, and investigating the influence of model size and training data on performance and consistency. The continuous improvement of evaluation methodology will help to advance the development of robust and reliable LLMs for software development and facilitate their integration into practice.

Bibliographie: Xing, J., Bhatia, M., Phulwani, S., Suresh, D., & Matta, R. (2025). HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language Models on cross-domain multi-file project problems. arXiv preprint arXiv:2502.00226. FuelGenesis.digital. (n.d.). Search results for author: Xing. [Website]. Retrieved from https://fuelgenesis.digital/?searchtype=author&query=Xing Zhong, H., & Xie, T. (2021). Jointly Learning to Repair Code and Generate Commit Message. In Proceedings of the 28th International Conference on Software Analysis, Evolution and Reengineering (SANER '21). Li, J., Liu, Y., Li, Z., Liu, Y., & Liu, Y. (2020). JLLAR: A Logging Recommendation Plug-in Tool for Java. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 601-605). IEEE. Arxiv Daily. (2023, February 6). [Thread on HackerRank-ASTRA Benchmark]. Retrieved from https://www.arxivdaily.com/thread/63800 Anala, H. (n.d.). [LinkedIn Profile]. Retrieved from https://in.linkedin.com/in/harshita-anala-ba8229228 Satriatama, R. (n.d.). [LinkedIn Profile]. Retrieved from https://id.linkedin.com/in/risnanda-satriatama ```