ScienceGuardians

ScienceGuardians

Did You Know?

ScienceGuardians identifies anonymous intimidation & coordinated campaigns

APBench and benchmarking large language model performance in fundamental astrodynamics problems for space engineering

Authors: Di Wu,Raymond Zhang,Enrico M. Zucchelli,Yongchao Chen,Richard Linares
Journal: Scientific Reports
Publisher: Springer Science and Business Media LLC
Publish date: 2025-3-7
ISSN: 2045-2322 DOI: 10.1038/s41598-025-91150-5
View on Publisher's Website
Up
0
Down
::

The paper states the error margin as:
(0.1 + 0.01·log(|Answer|)), with a fixed 0.1 if the computed margin exceeds 0–1.

In the example:
True Answer = –5.2 (implied by “Model Answer −5.2 is between −4.479 and −5.655”).
Compute: |Answer| = 5.2 → log(5.2) ≈ 0.716 → margin = 0.1 + 0.01×0.716 = 0.10716.
Acceptable range = [–5.307, –5.093] (not [–5.655, –4.479]).

The shown interval [–5.655, –4.479] is ~6 times wider (width 1.176 vs. 0.214). This error is not a typo – it fundamentally misrepresents their own scoring rule.

  • You must be logged in to reply to this topic.