APBench and benchmarking large language model performance in fundamental astrodynamics problems for space engineering

Authors: Di Wu,Raymond Zhang,Enrico M. Zucchelli,Yongchao Chen,Richard Linares

Journal: Scientific Reports

Publisher: Springer Science and Business Media LLC

Publish date: 2025-3-7

ISSN: 2045-2322 DOI: 10.1038/s41598-025-91150-5

View on Publisher's Website

merlyn

Participant

2 months ago 0 Replies

Down

The paper states the error margin as:
(0.1 + 0.01·log(|Answer|)), with a fixed 0.1 if the computed margin exceeds 0–1.

In the example:
True Answer = –5.2 (implied by “Model Answer −5.2 is between −4.479 and −5.655”).
Compute: |Answer| = 5.2 → log(5.2) ≈ 0.716 → margin = 0.1 + 0.01×0.716 = 0.10716.
Acceptable range = [–5.307, –5.093] (not [–5.655, –4.479]).

The shown interval [–5.655, –4.479] is ~6 times wider (width 1.176 vs. 0.214). This error is not a typo – it fundamentally misrepresents their own scoring rule.

You must be logged in to reply to this topic.

ScienceGuardians

Did You Know?

Welcome to ScienceGuardians, the First Fully Verified Journal Club,Safeguarding the Integrity of Science

APBench and benchmarking large language model performance in fundamental astrodynamics problems for space engineering

Welcome to ScienceGuardians, the First Fully Verified Journal Club,
Safeguarding the Integrity of Science