pith. sign in

arxiv: 2402.06698 · v1 · pith:CV32L7QYnew · submitted 2024-02-09 · 💱 q-fin.ST

FNSPID: A Comprehensive Financial News Dataset in Time Series

classification 💱 q-fin.ST
keywords financialfnspidmarketstockdatasetnewsanalysissentiment
0
0 comments X
read the original abstract

Financial market predictions utilize historical data to anticipate future stock prices and market trends. Traditionally, these predictions have focused on the statistical analysis of quantitative factors, such as stock prices, trading volumes, inflation rates, and changes in industrial production. Recent advancements in large language models motivate the integrated financial analysis of both sentiment data, particularly market news, and numerical factors. Nonetheless, this methodology frequently encounters constraints due to the paucity of extensive datasets that amalgamate both quantitative and qualitative sentiment analyses. To address this challenge, we introduce a large-scale financial dataset, namely, Financial News and Stock Price Integration Dataset (FNSPID). It comprises 29.7 million stock prices and 15.7 million time-aligned financial news records for 4,775 S&P500 companies, covering the period from 1999 to 2023, sourced from 4 stock market news websites. We demonstrate that FNSPID excels existing stock market datasets in scale and diversity while uniquely incorporating sentiment information. Through financial analysis experiments on FNSPID, we propose: (1) the dataset's size and quality significantly boost market prediction accuracy; (2) adding sentiment scores modestly enhances performance on the transformer-based model; (3) a reproducible procedure that can update the dataset. Completed work, code, documentation, and examples are available at github.com/Zdong104/FNSPID. FNSPID offers unprecedented opportunities for the financial research community to advance predictive modeling and analysis.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Point-in-Time Financial RAG with Frozen LLMs and Market-Feedback Adaptive Retrieval

    cs.CL 2026-05 unverdicted novelty 6.0

    Adding a Bayesian source memory for market-feedback adaptive retrieval to a frozen LLM improves macro-F1 from 0.438 to 0.471 and portfolio Sharpe from 0.52 to 0.84 in point-in-time financial event-impact prediction.

  2. From Time Series Analysis to Question Answering: A Survey in the LLM Era

    cs.LG 2025-06 accept novelty 6.0

    A survey proposing a taxonomy of Injective, Bridging, and Internal Alignment paradigms to evolve TSA into user-driven Time Series Question Answering with LLMs.

  3. Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics

    cs.LG 2026-05 unverdicted novelty 5.0

    SSAI maps news into four factors (sentiment, risk, confidence, volatility) for trading, but factor portfolios, ridge models, and RL agents show no reliable edge over baselines after coverage controls and costs.