”My AI is Lying to Me”: User-reported LLM hallucinations in AI mobile apps reviews

Takale, D., Mahalle, P. & Sule, B. Advancements and applications of generative artificial intelligence. Journal of Information Technology and Sciences 10, 20–27 (2024).

Google Scholar

Ramdurai, B. & Adhithya, P. The impact, advancements and applications of generative AI. International Journal of Computer Science and Engineering 10, 1–8 (2023).

Google Scholar

Wang, J. et al. Evaluation and analysis of hallucination in large vision-language models (2023). arXiv:2308.15126.

Nwanna, M. et al. AI-driven personalisation: Transforming user experience across mobile applications. Journal of Artificial Intelligence, Machine Learning and Data Science 3, 1930–1937 (2025).

Google Scholar

Behare, N., Bhagat, S. & Sarangdhar, P. Revolutionizing Customer Experience With AI-Powered Personalization. In Strategic Brand Management in the Age of AI and Disruption, 439–462 (IGI Global Scientific Publishing, 2025).

Ji, Z. et al. Survey of hallucination in natural language generation. ACM Computing Surveys 55, 1–38 (2023).

Google Scholar

Zhang, Y. et al. Siren’s song in the AI ocean: a survey on hallucination in large language models (2023). arXiv:2309.01219.

Huang, L. et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems 43, 1–55 (2025).

Google Scholar

Rawte, V. et al. The troubling emergence of hallucination in large language models-an extensive definition, quantification, and prescriptive remediations. In Findings of the Association for Computational Linguistics: EMNLP 2023 (Association for Computational Linguistics, 2023).

Bang, Y. et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity (2023). arXiv:2302.04023.

Li, J., Cheng, X., Zhao, W., Nie, J. & Wen, J. Halueval: A large-scale hallucination evaluation benchmark for large language models (2023). arXiv:2305.11747.

Zhu, Z., Yang, Y. & Sun, Z. Halueval-wild: Evaluating hallucinations of language models in the wild (2024). arXiv:2403.04307.

Shao, A. Beyond Misinformation: A Conceptual Framework for Studying AI Hallucinations in (Science) Communication (2025). arXiv:2504.13777.

Massenon, R. et al. Mobile app review analysis for crowdsourcing of software requirements: a mapping study of automated and semi-automated tools. PeerJ Computer Science 10, e2401 (2024).

PubMed
PubMed Central

Google Scholar

Gambo, I. et al. Enhancing user trust and interpretability in ai-driven feature request detection for mobile app reviews: an explainable approach. IEEE Access (2024).

Dąbrowski, J., Letier, E., Perini, A. & Susi, A. Analysing app reviews for software engineering: a systematic literature review. Empirical Software Engineering 27, 43 (2022).

Google Scholar

Genc-Nayebi, N. & Abran, A. A systematic literature review: Opinion mining studies from mobile app store user reviews. Journal of Systems and Software 125, 207–219 (2017).

Google Scholar

Palomba, F. et al. User reviews matter! tracking crowdsourced reviews to support evolution of successful apps. In 2015 IEEE international conference on software maintenance and evolution (ICSME), 291–300 (IEEE, 2015).

Fan, A. et al. Large language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), 31–53 (IEEE, 2023).

Görmez, M., Yılmaz, M. & Clarke, P. Large Language Models for Software Engineering: A Systematic Mapping Study. In European Conference on Software Process Improvement, 64–79 (Springer Nature Switzerland, Cham, 2024).

Khan, W., Daud, A., Khan, K., Muhammad, S. & Haq, R. Exploring the frontiers of deep learning and natural language processing: A comprehensive overview of key challenges and emerging trends. Natural Language Processing Journal 4, 100026 (2023).

Google Scholar

Desai, B., Patil, K., Patil, A. & Mehta, I. Large Language Models: A Comprehensive Exploration of Modern AI’s Potential and Pitfalls. Journal of Innovative Technologies 6 (2023).

Koenecke, A., Choi, A., Mei, K., Schellmann, H. & Sloane, M. Careless whisper: Speech-to-text hallucination harms. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 1672–1681 (ACM, 2024).

Moffatt v. Air Canada. McCarthy Tétrault TechLex Blog (2024). Available at: Last accessed 2025/05/05.

Maynez, J., Narayan, S., Bohnet, B. & McDonald, R. On faithfulness and factuality in abstractive summarization (2020). arXiv:2005.00661.

Leiser, F. et al. From ChatGPT to FactGPT: A participatory design study to mitigate the effects of large language model hallucinations on users. In Proceedings of Mensch und Computer 2023, 81–90 (ACM, 2023).

Leiser, F. et al. Hill: A hallucination identifier for large language models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 1–13 (ACM, 2024).

Xu, Z., Jain, S. & Kankanhalli, M. Hallucination is inevitable: An innate limitation of large language models (2024). arXiv:2401.11817.

Tonmoy, S. et al. A comprehensive survey of hallucination mitigation techniques in large language models (2024). arXiv:2401.01313.

Martino, A., Iannelli, M. & Truong, C. Knowledge injection to counter large language model (LLM) hallucination. In European Semantic Web Conference, 182–185 (Springer Nature Switzerland, Cham, 2023).

Agrawal, A., Suzgun, M., Mackey, L. & Kalai, A. Do Language Models Know When They’re Hallucinating References? (2023). arXiv:2305.18248.

Jiang, Z., Araki, J., Ding, H. & Neubig, G. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics 9, 962–977 (2021).

Google Scholar

Xiong, M. et al. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms (2023). arXiv:2306.13063.

Khan, J., Qayyum, S. & Dar, H. Large Language Model for Requirements Engineering: A Systematic Literature Review. Research Square, (2025).

Min, B. et al. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys 56, 1–40 (2023).

Google Scholar

Hariri, W. Unlocking the potential of ChatGPT: A comprehensive exploration of its applications, advantages, limitations, and future directions in natural language processing (2023). arXiv:2304.02017.

Vinothkumar, J. & Karunamurthy, A. Recent advancements in artificial intelligence technology: trends and implications. Quing: International Journal of Multidisciplinary Scientific Research and Development 2, 1–11 (2023).

Farquhar, S., Kossen, J., Kuhn, L. & Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630 (2024).

ADS
PubMed
PubMed Central
CAS

Google Scholar

Dhuliawala, S. et al. Chain-of-verification reduces hallucination in large language models (2023). arXiv:2309.11495.

Béchard, P. & Ayala, O. M. Reducing hallucination in structured outputs via retrieval-augmented generation (2024). arXiv:2404.08189.

He, B. et al. Retrieving, rethinking and revising: The chain-of-verification can improve retrieval augmented generation (2024). arXiv:2410.05801.

Liu, F. et al. Exploring and evaluating hallucinations in llm-powered code generation (2024). arXiv:2404.00971.

Lee, Y. et al. Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges (2025). arXiv:2504.20799.

Lin, S., Hilton, J. & Evans, O. Truthfulqa: Measuring how models mimic human falsehoods (2021). arXiv:2109.07958.

Zheng, S., Huang, J. & Chang, K. Why Does ChatGPT Fall Short in Providing Truthful Answers? (2023). arXiv:2304.10513.

Guerreiro, N. et al. Mitigating Hallucinations in Neural Machine Translation through Fuzzy-match Repair. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 123–132 (EAMT, 2023).

Chen, N., Lin, J., Hoi, S., Xiao, X. & Zhang, B. AR-miner: mining informative reviews for developers from mobile app marketplace. In Proceedings of the 36th international conference on software engineering, 767–778 (ACM, 2014).

Wu, H., Deng, W., Niu, X. & Nie, C. Identifying key features from app user reviews. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 922–932 (IEEE, 2021).

Guzman, E. & Maalej, W. How do users like this feature? a fine grained sentiment analysis of app reviews. In 2014 IEEE 22nd international requirements engineering conference (RE), 153–162 (IEEE, 2014).

Ballas, V., Michalakis, K., Alexandridis, G. & Caridakis, G. Automating mobile app review user feedback with aspect-based sentiment analysis. In International Conference on Human-Computer Interaction, 179–193 (Springer Nature Switzerland, Cham, 2024).

Shah, F., Sabir, A. & Sharma, R. A Fine-grained Sentiment Analysis of App Reviews using Large Language Models: An Evaluation Study (2024). arXiv:2409.07162.

Ossai, C. & Wickramasinghe, N. Automatic user sentiments extraction from diabetes mobile apps–An evaluation of reviews with machine learning. Informatics for Health and Social Care 48, 211–230 (2023).

PubMed

Google Scholar

Gambo, I. et al. Extracting Features from App Store Reviews to Improve Requirements Analysis: Natural Language Processing and Machine Learning Approach. International Journal of Computing 17, 1–19 (2025).

Google Scholar

Gambo, I., Massenon, R., Ogundokun, R. O., Agarwal, S. & Pak, W. Identifying and resolving conflict in mobile application features through contradictory feedback analysis. Heliyon 10 (2024).

Dam, S., Hong, C., Qiao, Y. & Zhang, C. A complete survey on llm-based ai chatbots (2024). arXiv:2406.16937.

link

Serumset

”My AI is Lying to Me”: User-reported LLM hallucinations in AI mobile apps reviews

Leave a Reply Cancel reply

FANUC’s CRX-3iA robot targets heavy equipment, mining component tasks

Shenzhen’s robotics industry surpassed 240 billion yuan in output last year

US Security Firms Eye Mexico as Physical Security Market Grows

Top 10 Countries by Annual Industrial Robot Installations