A Scalable Framework for Multilevel Streaming Data Analytics using Deep Learning

Farhana Zulkernine; Haruna Isah; Shahzad Khan; Shihao Ge

arxiv: 1907.06690 · v1 · pith:Q3LRMGYRnew · submitted 2019-07-15 · 📡 eess.SY · cs.LG· cs.SY

A Scalable Framework for Multilevel Streaming Data Analytics using Deep Learning

Shihao Ge , Haruna Isah , Farhana Zulkernine , Shahzad Khan This is my paper

Pith reviewed 2026-05-24 21:18 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SY

keywords streaming data analyticsdeep learningLSTMSpark streamingsentiment analysistext analyticsbig datareal-time processing

0 comments

The pith

A framework integrates Spark streaming, LSTM models, and SQL tools to handle scalable multilevel streaming text analytics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a framework designed to process continuous streams of text data in real time while supporting offline analysis on the same system. It combines Spark streaming for initial text handling, LSTM deep learning for sentiment analysis, and SQL tools for queries and indexing. This approach aims to meet the needs of organizations that require unified real-time and batch processing on large hybrid datasets. The authors demonstrate it through a use case focused on language understanding and market information extraction.

Core claim

Our framework combines Spark streaming for real time text processing, the Long Short Term Memory (LSTM) deep learning model for higher level sentiment analysis, and other tools for SQL-based analytical processing to provide a scalable solution for multilevel streaming text analytics.

What carries the argument

The multilevel streaming text analytics framework that combines Spark streaming for real-time processing, LSTM for sentiment analysis, and SQL-based tools for queries.

If this is right

Businesses can extract real-time market information from hybrid streaming data without maintaining separate offline systems.
Sentiment analysis becomes available as a higher-level step applied directly to processed streams.
SQL-based queries and data indexing support analytical processing on the combined output.
Scalability is achieved for large volumes of text data in velocity, volume, variety, and veracity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same integration pattern could extend to non-text streams such as sensor or transaction data in manufacturing or finance.
Organizations might reduce infrastructure complexity by replacing separate real-time and batch pipelines with one unified system.
Performance could be further tested by swapping LSTM for other deep learning models to measure accuracy and speed trade-offs.

Load-bearing premise

The specific combination of Spark streaming, LSTM models, and SQL tools will integrate seamlessly and deliver scalable performance for real-world multilevel streaming text analytics without major unaddressed bottlenecks or integration failures.

What would settle it

Run the framework on high-velocity real-world text streams and observe whether real-time processing latency stays low while sentiment analysis accuracy holds above baseline levels.

Figures

Figures reproduced from arXiv: 1907.06690 by Farhana Zulkernine, Haruna Isah, Shahzad Khan, Shihao Ge.

**Figure 2.** Figure 2: Implementation of our framework with selected tools [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

The rapid growth of data in velocity, volume, value, variety, and veracity has enabled exciting new opportunities and presented big challenges for businesses of all types. Recently, there has been considerable interest in developing systems for processing continuous data streams with the increasing need for real-time analytics for decision support in the business, healthcare, manufacturing, and security. The analytics of streaming data usually relies on the output of offline analytics on static or archived data. However, businesses and organizations like our industry partner Gnowit, strive to provide their customers with real time market information and continuously look for a unified analytics framework that can integrate both streaming and offline analytics in a seamless fashion to extract knowledge from large volumes of hybrid streaming data. We present our study on designing a multilevel streaming text data analytics framework by comparing leading edge scalable open-source, distributed, and in-memory technologies. We demonstrate the functionality of the framework for a use case of multilevel text analytics using deep learning for language understanding and sentiment analysis including data indexing and query processing. Our framework combines Spark streaming for real time text processing, the Long Short Term Memory (LSTM) deep learning model for higher level sentiment analysis, and other tools for SQL-based analytical processing to provide a scalable solution for multilevel streaming text analytics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper describes a Spark Streaming plus LSTM architecture for real-time text analytics but supplies no benchmarks, metrics, or new techniques.

read the letter

The main thing to know is that this paper describes wiring together Spark Streaming for real-time text processing, an LSTM model for sentiment analysis, and SQL tools for queries in a multilevel setup. It claims the result is a scalable framework, yet the text contains no performance numbers, latency figures, throughput tests, or comparisons to support that claim. The work is an applied engineering description tied to an industry partner use case at Gnowit rather than a research result. It outlines the components and how they fit for streaming text analytics including indexing and query handling. That kind of concrete architecture sketch can be useful as a reference for teams building similar pipelines. Beyond the specific combination for this business scenario, nothing in the methods or approach is new. The technologies are standard and the integration follows patterns already common in industry applications. The paper mentions comparing leading open-source options but gives no details on the comparisons or the reasons for the final choices. The central soft spot is the complete lack of evaluation. Scalability and seamless operation are asserted without any data on how the system behaves under load, how LSTM inference affects streaming latency, or where bottlenecks appear. This makes the main claims impossible to assess. The paper is aimed at practitioners or engineers who want an example of tool assembly for real-time analytics. Researchers will not find new algorithms, formal results, or validated insights. It does not deserve peer review because the contribution stays at the level of an untested design description with no empirical grounding to evaluate.

Referee Report

2 major / 2 minor

Summary. The paper proposes a multilevel streaming text data analytics framework that integrates Spark Streaming for real-time text processing, LSTM deep learning models for sentiment analysis and language understanding, and SQL-based tools for analytical processing and querying. It aims to unify streaming and offline analytics in a scalable manner for applications such as real-time market information, with a demonstration on a use case involving the industry partner Gnowit, including data indexing and query processing.

Significance. If the integration and scalability claims were empirically validated, the work could offer a practical, open-source-based architecture for hybrid real-time analytics in business, healthcare, and security domains. The combination of established technologies addresses a genuine need for seamless streaming/offline systems. However, the manuscript provides only a high-level systems description without supporting measurements, limiting its contribution to an architectural sketch.

major comments (2)

[Abstract] Abstract: The assertion of a 'scalable solution for multilevel streaming text analytics' and 'seamless' integration of Spark streaming, LSTM, and SQL tools is presented without any reported benchmarks, throughput/latency numbers, scaling behavior, error analysis, or baseline comparisons. This directly undermines the central claim of functionality and scalability.
[Demonstration / use case section] Use-case demonstration: The described demonstration of multilevel text analytics (including LSTM-based sentiment analysis, data indexing, and query processing) supplies no quantitative validation data, failure-mode analysis, or discussion of integration bottlenecks such as LSTM inference latency under streaming load or Spark checkpointing overhead.

minor comments (2)

The manuscript would benefit from explicit architecture diagrams with component interfaces and data-flow arrows to clarify how the components interact in real time.
Notation for the 'multilevel' analytics pipeline is used without a formal definition or layered diagram, making it difficult to assess exactly what each level comprises.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the manuscript is primarily an architectural description with a functional demonstration and lacks quantitative performance data, which limits the strength of scalability claims. We will revise to qualify those claims and add discussion of limitations.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of a 'scalable solution for multilevel streaming text analytics' and 'seamless' integration of Spark streaming, LSTM, and SQL tools is presented without any reported benchmarks, throughput/latency numbers, scaling behavior, error analysis, or baseline comparisons. This directly undermines the central claim of functionality and scalability.

Authors: We accept the point. The abstract overstates the contribution by using 'scalable solution' and 'seamless' without supporting measurements. The work describes a framework design and shows functional integration on a use case with Gnowit but does not evaluate performance. We will revise the abstract to state that the framework 'combines Spark Streaming, LSTM models, and SQL tools for multilevel streaming text analytics' and remove the unsupported adjectives. We will also add a limitations paragraph noting that empirical benchmarks are left for future work. revision: yes
Referee: [Demonstration / use case section] Use-case demonstration: The described demonstration of multilevel text analytics (including LSTM-based sentiment analysis, data indexing, and query processing) supplies no quantitative validation data, failure-mode analysis, or discussion of integration bottlenecks such as LSTM inference latency under streaming load or Spark checkpointing overhead.

Authors: The use-case section is intended to illustrate component integration and real-world applicability rather than to provide performance results. We agree that the absence of quantitative data and bottleneck discussion weakens the paper. We will expand the section with a qualitative discussion of design considerations (e.g., LSTM batching for streaming, Spark checkpointing) and explicitly state that no latency or throughput measurements were collected in this study, marking this as an area for future empirical validation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive systems framework with no derivations or predictions

full rationale

This is a high-level systems paper proposing an architectural framework that combines Spark streaming, LSTM models, and SQL tools for multilevel text analytics. The abstract and description contain no equations, no fitted parameters, no predictions derived from data, and no mathematical derivation chain. The central claim is an integration sketch rather than a result that reduces to its own inputs by construction. No self-citations are load-bearing for any uniqueness theorem or ansatz, and the paper does not rename known results or smuggle assumptions via citation. The reader's assessment of zero circularity is correct; this is the normal case for non-mathematical framework descriptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a systems integration description with no mathematical model, derivations, or formal claims, so no free parameters, axioms, or invented entities are present.

pith-pipeline@v0.9.0 · 5766 in / 1027 out tokens · 26552 ms · 2026-05-24T21:18:47.923311+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

[1]

Ballard et al., Ibm infosphere streams: Assembling continuous insight in the information revolution

C. Ballard et al., Ibm infosphere streams: Assembling continuous insight in the information revolution. IBM Redbooks, 2012

work page 2012
[2]

Big data for Natural Language Processing: A streaming approach,

R. Agerri, X. Artola, Z. Beloki, G. Rigau, and A. Soroa, "Big data for Natural Language Processing: A streaming approach," Knowledge - Based Systems, vol. 79, pp. 36-42, 2015

work page 2015
[3]

NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news,

P. Vossen et al., "NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news," Knowledge -Based Systems, vol. 110, pp. 60-85, 2016

work page 2016
[4]

Applying spark based machine learning model on streaming big data for health status prediction,

L. R. Nair, S. D. Shetty, and S. D. Shetty, "Applying spark based machine learning model on streaming big data for health status prediction," Computers & Electrical Engineering, vol. 65, pp. 393 - 399, 2018

work page 2018
[5]

A Scalable and Robust Framework for Data Stream Ingestion,

H. Isah and F. Zulkernine, "A Scalable and Robust Framework for Data Stream Ingestion," in 2018 IEEE International Conference on Big Data (Big Data), 2018: IEEE, pp. 2900-2905

work page 2018
[6]

A. G. Psaltis, Streaming Data: Understanding the Real-Time Pipeline. Manning Publications Company, 2017

work page 2017
[7]

Dean, Fast Data Architectures for streaming applications

W. Dean, Fast Data Architectures for streaming applications. Lightbend and O'Reilly, 2016

work page 2016
[8]

O'Reilly Media, Inc

S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: patterns for learning from data at scale. " O'Reilly Media, Inc.", 2017

work page 2017
[9]

Opinion mining and sentiment analysis,

B. Pang and L. Lee, "Opinion mining and sentiment analysis," Foundations and Trends® in Information Retrieval, vol. 2, no. 1 –2, pp. 1-135, 2008

work page 2008
[10]

Deep learning for sentiment analysis: A survey,

L. Zhang, S. Wang, and B. Liu, "Deep learning for sentiment analysis: A survey," Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 4, p. e1253, 2018

work page 2018
[11]

Deep learning,

Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," nature , vol. 521, no. 7553, p. 436, 2015

work page 2015
[12]

BigDL: A distributed deep learning framework for big data,

Y. Wang et al., "BigDL: A distributed deep learning framework for big data," arXiv preprint arXiv:1804.05839, 2018

work page arXiv 2018
[13]

Two Tales of the World: Comparison of Widely Used World News Datasets GDELT and EventRegistry,

H. K. J. An, "Two Tales of the World: Comparison of Widely Used World News Datasets GDELT and EventRegistry," 2016

work page 2016
[14]

Citizenhelper: A streaming analytics system to mine citizen and web data for humanitarian organizations,

P. Karuna, M. Rana, and H. Purohit, "Citizenhelper: A streaming analytics system to mine citizen and web data for humanitarian organizations," in Eleventh International AAAI Conference on Web and Social Media, 2017

work page 2017
[15]

Twitter sentiment analysis with deep convolutional neural networks,

A. Severyn and A. Moschitti , "Twitter sentiment analysis with deep convolutional neural networks," in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015: ACM, pp. 959-962

work page 2015
[16]

Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks

K. S. Tai, R. Socher, and C. D. Manning, "I mproved semantic representations from tree -structured long short -term memory networks," arXiv preprint arXiv:1503.00075, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

Learning to Generate Reviews and Discovering Sentiment

A. Radford, R. Jozefowicz, and I. Sutskever, "Learning to generate reviews and discovering sentiment," arXiv preprint arXiv:1704.01444, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Multiplicative LSTM for sequence modelling

B. Krause, L. Lu, I. Murray, and S. Renals, "Multiplicative LSTM for sequence modelling," arXiv preprint arXiv:1609.07959, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Improving language understanding by generative pre-training,

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving language understanding by generative pre-training," URL https://s3-us-west-2. amazonaws. com/openai -assets/research- covers/languageunsupervised/language understanding paper. pdf, 2018

work page 2018
[20]

Semi -supervised sequence learning,

A. M. Dai and Q. V. Le, "Semi -supervised sequence learning," in Advances in neural information processing systems, 2 015, pp. 3079 - 3087

work page
[21]

Universal Language Model Fine-tuning for Text Classification

J. Howard and S. Ruder, "Universal language model fine -tuning for text classification," arXiv preprint arXiv:1801.06146, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

Large scale distributed deep networks,

J. Dean et al., "Large scale distributed deep networks," in Advances in neural information processing systems, 2012, pp. 1223-1231

work page 2012
[23]

Long short -term memory,

S. Hochreiter and J. Schmidhuber, "Long short -term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997

work page 1997
[24]

Apache spark: a unified engine for big data processing,

M. Zaharia et al., "Apache spark: a unified engine for big data processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, 2016

work page 2016
[25]

Recent trends in deep learning based natural language processing,

T. Young, D. Hazarika, S. Poria, and E. Cambria, "Recent trends in deep learning based natural language processing," ieee Computational intelligenCe magazine, vol. 13, no. 3, pp. 55-75, 2018

work page 2018
[26]

Structured Streaming In Apache Spark

Z. Matei, D. Tathagata, A. Michael, and X . Reynold. "Structured Streaming In Apache Spark." @databricks

work page

[1] [1]

Ballard et al., Ibm infosphere streams: Assembling continuous insight in the information revolution

C. Ballard et al., Ibm infosphere streams: Assembling continuous insight in the information revolution. IBM Redbooks, 2012

work page 2012

[2] [2]

Big data for Natural Language Processing: A streaming approach,

R. Agerri, X. Artola, Z. Beloki, G. Rigau, and A. Soroa, "Big data for Natural Language Processing: A streaming approach," Knowledge - Based Systems, vol. 79, pp. 36-42, 2015

work page 2015

[3] [3]

NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news,

P. Vossen et al., "NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news," Knowledge -Based Systems, vol. 110, pp. 60-85, 2016

work page 2016

[4] [4]

Applying spark based machine learning model on streaming big data for health status prediction,

L. R. Nair, S. D. Shetty, and S. D. Shetty, "Applying spark based machine learning model on streaming big data for health status prediction," Computers & Electrical Engineering, vol. 65, pp. 393 - 399, 2018

work page 2018

[5] [5]

A Scalable and Robust Framework for Data Stream Ingestion,

H. Isah and F. Zulkernine, "A Scalable and Robust Framework for Data Stream Ingestion," in 2018 IEEE International Conference on Big Data (Big Data), 2018: IEEE, pp. 2900-2905

work page 2018

[6] [6]

A. G. Psaltis, Streaming Data: Understanding the Real-Time Pipeline. Manning Publications Company, 2017

work page 2017

[7] [7]

Dean, Fast Data Architectures for streaming applications

W. Dean, Fast Data Architectures for streaming applications. Lightbend and O'Reilly, 2016

work page 2016

[8] [8]

O'Reilly Media, Inc

S. Ryza, U. Laserson, S. Owen, and J. Wills, Advanced analytics with spark: patterns for learning from data at scale. " O'Reilly Media, Inc.", 2017

work page 2017

[9] [9]

Opinion mining and sentiment analysis,

B. Pang and L. Lee, "Opinion mining and sentiment analysis," Foundations and Trends® in Information Retrieval, vol. 2, no. 1 –2, pp. 1-135, 2008

work page 2008

[10] [10]

Deep learning for sentiment analysis: A survey,

L. Zhang, S. Wang, and B. Liu, "Deep learning for sentiment analysis: A survey," Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 4, p. e1253, 2018

work page 2018

[11] [11]

Deep learning,

Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," nature , vol. 521, no. 7553, p. 436, 2015

work page 2015

[12] [12]

BigDL: A distributed deep learning framework for big data,

Y. Wang et al., "BigDL: A distributed deep learning framework for big data," arXiv preprint arXiv:1804.05839, 2018

work page arXiv 2018

[13] [13]

Two Tales of the World: Comparison of Widely Used World News Datasets GDELT and EventRegistry,

H. K. J. An, "Two Tales of the World: Comparison of Widely Used World News Datasets GDELT and EventRegistry," 2016

work page 2016

[14] [14]

Citizenhelper: A streaming analytics system to mine citizen and web data for humanitarian organizations,

P. Karuna, M. Rana, and H. Purohit, "Citizenhelper: A streaming analytics system to mine citizen and web data for humanitarian organizations," in Eleventh International AAAI Conference on Web and Social Media, 2017

work page 2017

[15] [15]

Twitter sentiment analysis with deep convolutional neural networks,

A. Severyn and A. Moschitti , "Twitter sentiment analysis with deep convolutional neural networks," in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015: ACM, pp. 959-962

work page 2015

[16] [16]

Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks

K. S. Tai, R. Socher, and C. D. Manning, "I mproved semantic representations from tree -structured long short -term memory networks," arXiv preprint arXiv:1503.00075, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[17] [17]

Learning to Generate Reviews and Discovering Sentiment

A. Radford, R. Jozefowicz, and I. Sutskever, "Learning to generate reviews and discovering sentiment," arXiv preprint arXiv:1704.01444, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Multiplicative LSTM for sequence modelling

B. Krause, L. Lu, I. Murray, and S. Renals, "Multiplicative LSTM for sequence modelling," arXiv preprint arXiv:1609.07959, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

Improving language understanding by generative pre-training,

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving language understanding by generative pre-training," URL https://s3-us-west-2. amazonaws. com/openai -assets/research- covers/languageunsupervised/language understanding paper. pdf, 2018

work page 2018

[20] [20]

Semi -supervised sequence learning,

A. M. Dai and Q. V. Le, "Semi -supervised sequence learning," in Advances in neural information processing systems, 2 015, pp. 3079 - 3087

work page

[21] [21]

Universal Language Model Fine-tuning for Text Classification

J. Howard and S. Ruder, "Universal language model fine -tuning for text classification," arXiv preprint arXiv:1801.06146, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [22]

Large scale distributed deep networks,

J. Dean et al., "Large scale distributed deep networks," in Advances in neural information processing systems, 2012, pp. 1223-1231

work page 2012

[23] [23]

Long short -term memory,

S. Hochreiter and J. Schmidhuber, "Long short -term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997

work page 1997

[24] [24]

Apache spark: a unified engine for big data processing,

M. Zaharia et al., "Apache spark: a unified engine for big data processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, 2016

work page 2016

[25] [25]

Recent trends in deep learning based natural language processing,

T. Young, D. Hazarika, S. Poria, and E. Cambria, "Recent trends in deep learning based natural language processing," ieee Computational intelligenCe magazine, vol. 13, no. 3, pp. 55-75, 2018

work page 2018

[26] [26]

Structured Streaming In Apache Spark

Z. Matei, D. Tathagata, A. Michael, and X . Reynold. "Structured Streaming In Apache Spark." @databricks

work page