STREAMING SPEECH TO TEXT ON ANDROID A SOCKET.IO BASED SERVER APPROACH FOR ANDROID MOBILE APPLICATION
DOI:
https://doi.org/10.58432/mxqehx55Keywords:
Real-time ASR, Socket.IO, Server-side processing, Streaming Speech-to-Text, Android applicationsAbstract
This paper details a robust system enabling real-time Speech-to-Text capabilities on Android devices, leveraging a Socket.IO-based server architecture to manage audio streams and integrate with advanced language models. This approach effectively addresses the inherent challenges of on-device processing, such as latency, power consumption, and computational overhead, by offloading the intensive Speech-to-Text and Natural Language Processing tasks to a scalable server infrastructure. This distributed processing paradigm ensures minimal resource drain on the client device while maximizing accuracy and responsiveness.
Downloads
References
Alsayadi, H. A., Abdelhamid, A. A., Hegazy, I., & Fayed, Z. T. (2021). Arabic speech recognition using end-to-end deep learning. IET Signal Processing, 15(8), 521–534. https://doi.org/10.1049/sil2.12057
Ansari, Z., Pourhoseini, F., & Hadaeghi, F. (2022). Heterogeneous Reservoir Computing Models for Persian Speech Recognition. 2022 International Joint Conference on Neural Networks (IJCNN), 1–7. https://doi.org/10.1109/IJCNN55064.2022.9892570
Bao, C., Huo, C., Chen, Q., & Gao, C. (2025). AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition. ArXiv Preprint ArXiv:2506.06566. https://doi.org/10.48550/arXiv.2506.06566
Benazir, A., Xu, Z., & Lin, F. X. (2024). Speech Understanding on Tiny Devices with A Learning Cache. Proceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services, 425–437. https://doi.org/10.1145/3643832.3661886
Chakravarty, A. (2024). Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment. ArXiv, abs/2405.0. https://doi.org/10.48550/arXiv.2405.01004
Chen, Y., Zhao, J., & Han, H. (2025). A survey on collaborative mechanisms between large and small language models. ArXiv Preprint ArXiv:2505.07460. https://doi.org/10.48550/arXiv.2505.07460
Dutta, S., Chandupatla, S., & Hansen, J. (2025). Adapting Whisper for Lightweight and Efficient Automatic Speech Recognition of Children for On-device Edge Applications. https://doi.org/10.48550/arXiv.2507.14451
Feng, C., Lin, Y., Zhuo, S., Su, C., Ramakrishnan, R. K., Yuan, Z., & Zhang, X. (2025). Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models. ArXiv Preprint ArXiv:2507.07877. https://doi.org/10.48550/arXiv.2507.07877
Georgescu, A.-L., Pappalardo, A., Cucu, H., & Blott, M. (2021). Performance vs. hardware requirements in state-of-the-art automatic speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2021(1), 28. https://doi.org/10.1186/s13636-021-00217-4
Ghangam, S., Whitenack, D., & Nemecek, J. (2021). Dyn-asr: Compact, multilingual speech recognition via spoken language and accent identification. ArXiv Preprint ArXiv:2108.02034. https://doi.org/10.1109/WF-IoT51360.2021.9594961
Joshi, P., Hasanuzzaman, M., Thapa, C., Afli, H., & Scully, T. (2023). Enabling all in-edge deep learning: A literature review. IEEE Access, 11, 3431–3460. https://doi.org/10.48550/arXiv.2204.03326
Kheddar, H., Hemis, M., & Himeur, Y. (2024). Automatic speech recognition using advanced deep learning approaches: A survey. Information Fusion, 109, 102422. https://doi.org/10.1016/j.inffus.2024.102422
Nethil, K., Mishra, V., Anandan, K., & Manohar, K. (2025). Scalable Offline ASR for Command-Style Dictation in Courtrooms. ArXiv Preprint ArXiv:2507.01021. https://doi.org/doi.org/10.48550/arXiv.2507.01021
Ning, J., Zheng, C., & Yang, T. (2025). DSSD: Efficient Edge-Device LLM Deployment and Collaborative Inference via Distributed Split Speculative Decoding. ArXiv Preprint ArXiv:2507.12000. https://doi.org/10.48550/arXiv.2507.12000
O’Shaughnessy, D. (2024). Trends and developments in automatic speech recognition research. Computer Speech & Language, 2(1), 15–30. https://doi.org/10.1016/j.csl.2023.101538
Sainath, T. N., He, Y., Li, B., Narayanan, A., Pang, R., Bruguier, A., Chang, S., Li, W., Alvarez, R., & Chen, Z. (2020). A streaming on-device end-to-end model surpassing server-side conventional model quality and latency. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6059–6063. https://doi.org/10.48550/arXiv.2003.12710
Sarkar, S., Babar, M. F., Hassan, M. M., Hasan, M., & Karmaker Santu, S. K. (2024). Processing Natural Language on Embedded Devices: How Well Do Transformer Models Perform? Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering, 211–222. https://doi.org/10.48550/arXiv.2304.11520
Wang, R., & Lin, F. (2023). Efficient Deep Speech Understanding at the Edge. https://doi.org/10.48550/arXiv.2311.17065
Xu, M., Jin, A., Wang, S., Su, M., Ng, T., Mason, H., Han, S., Lei, Z., Deng, Y., Huang, Z., & Krishnamoorthy, M. (2024). Conformer-Based Speech Recognition On Extreme Edge-Computing Devices. In Y. Yang, A. Davani, A. Sil, & A. Kumar (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track) (pp. 131–139). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.naacl-industry.12
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Douglas Rakasiwi Nugroho, Christopher Limawan, Kelvin (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.


