TY - JOUR
T1 - Hold on! is my feedback useful? evaluating the usefulness of code review comments
AU - Ahmed, Sharif
AU - Eisty, Nasir U.
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
PY - 2025/6
Y1 - 2025/6
N2 - Context: In collaborative software development, the peer code review process proves beneficial only when the reviewers provide useful comments. Objective: This paper investigates the usefulness of Code Review Comments (CR comments) through textual feature-based and featureless approaches. Method: We select three available datasets from both open-source and commercial projects. Additionally, we introduce new features from software and non-software domains. Moreover, we experiment with the presence of jargon, voice, and codes in CR Comments and classify the usefulness of CR Comments through featurization, bag-of-words, and transfer learning techniques. Results: Our models outperform the baseline by achieving state-of-the-art performance. Furthermore, the result demonstrates that the commercial gigantic LLM, GPT-4o, and non-commercial naive featureless approach, Bag-of-Word with TF-IDF, are more effective for predicting the usefulness of CR Comments. Conclusion: The significant improvement in predicting usefulness solely from CR Comments escalates research on this task. Our analyses portray the similarities and differences of domains, projects, datasets, models, and features for predicting the usefulness of CR Comments.
AB - Context: In collaborative software development, the peer code review process proves beneficial only when the reviewers provide useful comments. Objective: This paper investigates the usefulness of Code Review Comments (CR comments) through textual feature-based and featureless approaches. Method: We select three available datasets from both open-source and commercial projects. Additionally, we introduce new features from software and non-software domains. Moreover, we experiment with the presence of jargon, voice, and codes in CR Comments and classify the usefulness of CR Comments through featurization, bag-of-words, and transfer learning techniques. Results: Our models outperform the baseline by achieving state-of-the-art performance. Furthermore, the result demonstrates that the commercial gigantic LLM, GPT-4o, and non-commercial naive featureless approach, Bag-of-Word with TF-IDF, are more effective for predicting the usefulness of CR Comments. Conclusion: The significant improvement in predicting usefulness solely from CR Comments escalates research on this task. Our analyses portray the similarities and differences of domains, projects, datasets, models, and features for predicting the usefulness of CR Comments.
KW - Modern code review
KW - Natural language processing
KW - Software engineering
KW - Software quality
KW - Transfer learning
UR - http://www.scopus.com/inward/record.url?scp=85218423639&partnerID=8YFLogxK
U2 - 10.1007/s10664-025-10617-1
DO - 10.1007/s10664-025-10617-1
M3 - Article
AN - SCOPUS:85218423639
SN - 1382-3256
VL - 30
JO - Empirical Software Engineering
JF - Empirical Software Engineering
IS - 3
M1 - 70
ER -