ScienceGuardians

ScienceGuardians

Did You Know?

ScienceGuardians gives voice to all stakeholders

Tool learning with large language models: a survey

Authors: Changle Qu,Sunhao Dai,Xiaochi Wei,Hengyi Cai,Shuaiqiang Wang,Dawei Yin,Jun Xu,Ji-rong Wen
Journal: Frontiers of Computer Science
Publisher: Springer Science and Business Media LLC
Publish date: 2025-1-13
ISSN: 2095-2228 DOI: 10.1007/s11704-024-40678-2
View on Publisher's Website
Up
0
Down
::

In the part where they define what a “tool” is, they say that different research papers use this word in different ways. For some, one tool is one single API. For others, one tool is actually a group of many APIs. This seems like a big problem, no?

But then, for their whole survey, they just decide to pretend that every tool is only one API. I’m not sure this is a good idea. If the studies they are comparing don’t even agree on what they are counting, then how we can trust the numbers and the analysis? It’s like comparing the price of one single apple with the price of a whole bag of apples and saying they are the same thing.

So my question is, doesn’t this choice make their whole study a bit… unstable? The foundation is not solid because the main thing they study, the “tool,” is not defined in a consistent way across the field. How can we be sure their findings are correct for the entire area of tool learning if they are mixing different meanings like this?

Maybe I don’t understand something, but this feels like a big hole.

All Replies

Viewing 1 replies (of 1 total)

1 week, 4 days ago

In Section 5, they talk about evaluation and present this big table with many benchmarks. But I see a very dangerous pattern. Look at the “Tool Source” column: so many of these benchmarks, especially the big ones like ToolBench with 16,000+ tools, use “Rapid API” or other public APIs.

We all know these public APIs can be very unstable. They change, they get deprecated, they have rate limits, they sometimes just go offline. If you build a benchmark on such shifting sand, how can your results be reproducible? A result you get today might be impossible to get next month because a critical API is gone. This is a crisis for scientific comparison!

My question is: did the authors consider this “temporal validity” problem? 

Viewing 1 replies (of 1 total)

  • You must be logged in to reply to this topic.