pith. sign in

arxiv: 1907.06690 · v1 · pith:Q3LRMGYRnew · submitted 2019-07-15 · 📡 eess.SY · cs.LG· cs.SY

A Scalable Framework for Multilevel Streaming Data Analytics using Deep Learning

Pith reviewed 2026-05-24 21:18 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SY
keywords streaming data analyticsdeep learningLSTMSpark streamingsentiment analysistext analyticsbig datareal-time processing
0
0 comments X

The pith

A framework integrates Spark streaming, LSTM models, and SQL tools to handle scalable multilevel streaming text analytics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a framework designed to process continuous streams of text data in real time while supporting offline analysis on the same system. It combines Spark streaming for initial text handling, LSTM deep learning for sentiment analysis, and SQL tools for queries and indexing. This approach aims to meet the needs of organizations that require unified real-time and batch processing on large hybrid datasets. The authors demonstrate it through a use case focused on language understanding and market information extraction.

Core claim

Our framework combines Spark streaming for real time text processing, the Long Short Term Memory (LSTM) deep learning model for higher level sentiment analysis, and other tools for SQL-based analytical processing to provide a scalable solution for multilevel streaming text analytics.

What carries the argument

The multilevel streaming text analytics framework that combines Spark streaming for real-time processing, LSTM for sentiment analysis, and SQL-based tools for queries.

If this is right

  • Businesses can extract real-time market information from hybrid streaming data without maintaining separate offline systems.
  • Sentiment analysis becomes available as a higher-level step applied directly to processed streams.
  • SQL-based queries and data indexing support analytical processing on the combined output.
  • Scalability is achieved for large volumes of text data in velocity, volume, variety, and veracity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same integration pattern could extend to non-text streams such as sensor or transaction data in manufacturing or finance.
  • Organizations might reduce infrastructure complexity by replacing separate real-time and batch pipelines with one unified system.
  • Performance could be further tested by swapping LSTM for other deep learning models to measure accuracy and speed trade-offs.

Load-bearing premise

The specific combination of Spark streaming, LSTM models, and SQL tools will integrate seamlessly and deliver scalable performance for real-world multilevel streaming text analytics without major unaddressed bottlenecks or integration failures.

What would settle it

Run the framework on high-velocity real-world text streams and observe whether real-time processing latency stays low while sentiment analysis accuracy holds above baseline levels.

Figures

Figures reproduced from arXiv: 1907.06690 by Farhana Zulkernine, Haruna Isah, Shahzad Khan, Shihao Ge.

Figure 1
Figure 1. Figure 1: Architecture of the multilevel streaming analytics framework [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Implementation of our framework with selected tools [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

The rapid growth of data in velocity, volume, value, variety, and veracity has enabled exciting new opportunities and presented big challenges for businesses of all types. Recently, there has been considerable interest in developing systems for processing continuous data streams with the increasing need for real-time analytics for decision support in the business, healthcare, manufacturing, and security. The analytics of streaming data usually relies on the output of offline analytics on static or archived data. However, businesses and organizations like our industry partner Gnowit, strive to provide their customers with real time market information and continuously look for a unified analytics framework that can integrate both streaming and offline analytics in a seamless fashion to extract knowledge from large volumes of hybrid streaming data. We present our study on designing a multilevel streaming text data analytics framework by comparing leading edge scalable open-source, distributed, and in-memory technologies. We demonstrate the functionality of the framework for a use case of multilevel text analytics using deep learning for language understanding and sentiment analysis including data indexing and query processing. Our framework combines Spark streaming for real time text processing, the Long Short Term Memory (LSTM) deep learning model for higher level sentiment analysis, and other tools for SQL-based analytical processing to provide a scalable solution for multilevel streaming text analytics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a multilevel streaming text data analytics framework that integrates Spark Streaming for real-time text processing, LSTM deep learning models for sentiment analysis and language understanding, and SQL-based tools for analytical processing and querying. It aims to unify streaming and offline analytics in a scalable manner for applications such as real-time market information, with a demonstration on a use case involving the industry partner Gnowit, including data indexing and query processing.

Significance. If the integration and scalability claims were empirically validated, the work could offer a practical, open-source-based architecture for hybrid real-time analytics in business, healthcare, and security domains. The combination of established technologies addresses a genuine need for seamless streaming/offline systems. However, the manuscript provides only a high-level systems description without supporting measurements, limiting its contribution to an architectural sketch.

major comments (2)
  1. [Abstract] Abstract: The assertion of a 'scalable solution for multilevel streaming text analytics' and 'seamless' integration of Spark streaming, LSTM, and SQL tools is presented without any reported benchmarks, throughput/latency numbers, scaling behavior, error analysis, or baseline comparisons. This directly undermines the central claim of functionality and scalability.
  2. [Demonstration / use case section] Use-case demonstration: The described demonstration of multilevel text analytics (including LSTM-based sentiment analysis, data indexing, and query processing) supplies no quantitative validation data, failure-mode analysis, or discussion of integration bottlenecks such as LSTM inference latency under streaming load or Spark checkpointing overhead.
minor comments (2)
  1. The manuscript would benefit from explicit architecture diagrams with component interfaces and data-flow arrows to clarify how the components interact in real time.
  2. Notation for the 'multilevel' analytics pipeline is used without a formal definition or layered diagram, making it difficult to assess exactly what each level comprises.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the manuscript is primarily an architectural description with a functional demonstration and lacks quantitative performance data, which limits the strength of scalability claims. We will revise to qualify those claims and add discussion of limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion of a 'scalable solution for multilevel streaming text analytics' and 'seamless' integration of Spark streaming, LSTM, and SQL tools is presented without any reported benchmarks, throughput/latency numbers, scaling behavior, error analysis, or baseline comparisons. This directly undermines the central claim of functionality and scalability.

    Authors: We accept the point. The abstract overstates the contribution by using 'scalable solution' and 'seamless' without supporting measurements. The work describes a framework design and shows functional integration on a use case with Gnowit but does not evaluate performance. We will revise the abstract to state that the framework 'combines Spark Streaming, LSTM models, and SQL tools for multilevel streaming text analytics' and remove the unsupported adjectives. We will also add a limitations paragraph noting that empirical benchmarks are left for future work. revision: yes

  2. Referee: [Demonstration / use case section] Use-case demonstration: The described demonstration of multilevel text analytics (including LSTM-based sentiment analysis, data indexing, and query processing) supplies no quantitative validation data, failure-mode analysis, or discussion of integration bottlenecks such as LSTM inference latency under streaming load or Spark checkpointing overhead.

    Authors: The use-case section is intended to illustrate component integration and real-world applicability rather than to provide performance results. We agree that the absence of quantitative data and bottleneck discussion weakens the paper. We will expand the section with a qualitative discussion of design considerations (e.g., LSTM batching for streaming, Spark checkpointing) and explicitly state that no latency or throughput measurements were collected in this study, marking this as an area for future empirical validation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive systems framework with no derivations or predictions

full rationale

This is a high-level systems paper proposing an architectural framework that combines Spark streaming, LSTM models, and SQL tools for multilevel text analytics. The abstract and description contain no equations, no fitted parameters, no predictions derived from data, and no mathematical derivation chain. The central claim is an integration sketch rather than a result that reduces to its own inputs by construction. No self-citations are load-bearing for any uniqueness theorem or ansatz, and the paper does not rename known results or smuggle assumptions via citation. The reader's assessment of zero circularity is correct; this is the normal case for non-mathematical framework descriptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a systems integration description with no mathematical model, derivations, or formal claims, so no free parameters, axioms, or invented entities are present.

pith-pipeline@v0.9.0 · 5766 in / 1027 out tokens · 26552 ms · 2026-05-24T21:18:47.923311+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

  1. [1]

    Ballard et al., Ibm infosphere streams: Assembling continuous insight in the information revolution

    C. Ballard et al., Ibm infosphere streams: Assembling continuous insight in the information revolution. IBM Redbooks, 2012

  2. [2]

    Big data for Natural Language Processing: A streaming approach,

    R. Agerri, X. Artola, Z. Beloki, G. Rigau, and A. Soroa, "Big data for Natural Language Processing: A streaming approach," Knowledge - Based Systems, vol. 79, pp. 36-42, 2015

  3. [3]

    NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news,

    P. Vossen et al., "NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news," Knowledge -Based Systems, vol. 110, pp. 60-85, 2016

  4. [4]

    Applying spark based machine learning model on streaming big data for health status prediction,

    L. R. Nair, S. D. Shetty, and S. D. Shetty, "Applying spark based machine learning model on streaming big data for health status prediction," Computers & Electrical Engineering, vol. 65, pp. 393 - 399, 2018

  5. [5]

    A Scalable and Robust Framework for Data Stream Ingestion,

    H. Isah and F. Zulkernine, "A Scalable and Robust Framework for Data Stream Ingestion," in 2018 IEEE International Conference on Big Data (Big Data), 2018: IEEE, pp. 2900-2905

  6. [6]

    A. G. Psaltis, Streaming Data: Understanding the Real-Time Pipeline. Manning Publications Company, 2017

  7. [7]

    Dean, Fast Data Architectures for streaming applications

    W. Dean, Fast Data Architectures for streaming applications. Lightbend and O'Reilly, 2016

  8. [8]

    O'Reilly Media, Inc

    S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: patterns for learning from data at scale. " O'Reilly Media, Inc.", 2017

  9. [9]

    Opinion mining and sentiment analysis,

    B. Pang and L. Lee, "Opinion mining and sentiment analysis," Foundations and Trends® in Information Retrieval, vol. 2, no. 1 –2, pp. 1-135, 2008

  10. [10]

    Deep learning for sentiment analysis: A survey,

    L. Zhang, S. Wang, and B. Liu, "Deep learning for sentiment analysis: A survey," Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 4, p. e1253, 2018

  11. [11]

    Deep learning,

    Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," nature , vol. 521, no. 7553, p. 436, 2015

  12. [12]

    BigDL: A distributed deep learning framework for big data,

    Y. Wang et al., "BigDL: A distributed deep learning framework for big data," arXiv preprint arXiv:1804.05839, 2018

  13. [13]

    Two Tales of the World: Comparison of Widely Used World News Datasets GDELT and EventRegistry,

    H. K. J. An, "Two Tales of the World: Comparison of Widely Used World News Datasets GDELT and EventRegistry," 2016

  14. [14]

    Citizenhelper: A streaming analytics system to mine citizen and web data for humanitarian organizations,

    P. Karuna, M. Rana, and H. Purohit, "Citizenhelper: A streaming analytics system to mine citizen and web data for humanitarian organizations," in Eleventh International AAAI Conference on Web and Social Media, 2017

  15. [15]

    Twitter sentiment analysis with deep convolutional neural networks,

    A. Severyn and A. Moschitti , "Twitter sentiment analysis with deep convolutional neural networks," in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015: ACM, pp. 959-962

  16. [16]

    Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks

    K. S. Tai, R. Socher, and C. D. Manning, "I mproved semantic representations from tree -structured long short -term memory networks," arXiv preprint arXiv:1503.00075, 2015

  17. [17]

    Learning to Generate Reviews and Discovering Sentiment

    A. Radford, R. Jozefowicz, and I. Sutskever, "Learning to generate reviews and discovering sentiment," arXiv preprint arXiv:1704.01444, 2017

  18. [18]

    Multiplicative LSTM for sequence modelling

    B. Krause, L. Lu, I. Murray, and S. Renals, "Multiplicative LSTM for sequence modelling," arXiv preprint arXiv:1609.07959, 2016

  19. [19]

    Improving language understanding by generative pre-training,

    A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving language understanding by generative pre-training," URL https://s3-us-west-2. amazonaws. com/openai -assets/research- covers/languageunsupervised/language understanding paper. pdf, 2018

  20. [20]

    Semi -supervised sequence learning,

    A. M. Dai and Q. V. Le, "Semi -supervised sequence learning," in Advances in neural information processing systems, 2 015, pp. 3079 - 3087

  21. [21]

    Universal Language Model Fine-tuning for Text Classification

    J. Howard and S. Ruder, "Universal language model fine -tuning for text classification," arXiv preprint arXiv:1801.06146, 2018

  22. [22]

    Large scale distributed deep networks,

    J. Dean et al., "Large scale distributed deep networks," in Advances in neural information processing systems, 2012, pp. 1223-1231

  23. [23]

    Long short -term memory,

    S. Hochreiter and J. Schmidhuber, "Long short -term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997

  24. [24]

    Apache spark: a unified engine for big data processing,

    M. Zaharia et al., "Apache spark: a unified engine for big data processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, 2016

  25. [25]

    Recent trends in deep learning based natural language processing,

    T. Young, D. Hazarika, S. Poria, and E. Cambria, "Recent trends in deep learning based natural language processing," ieee Computational intelligenCe magazine, vol. 13, no. 3, pp. 55-75, 2018

  26. [26]

    Structured Streaming In Apache Spark

    Z. Matei, D. Tathagata, A. Michael, and X . Reynold. "Structured Streaming In Apache Spark." @databricks