Clustering Activity-Travel Behavior Time Series using Topological Data Analysis

Jingyue Zhang; Karthik Konduri; Nalini Ravishanker; Renjie Chen

arxiv: 1907.07603 · v1 · pith:7U2PJHBSnew · submitted 2019-07-17 · 📊 stat.ML · cs.LG· stat.AP

Clustering Activity-Travel Behavior Time Series using Topological Data Analysis

Renjie Chen , Jingyue Zhang , Nalini Ravishanker , Karthik Konduri This is my paper

Pith reviewed 2026-05-24 19:59 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.AP

keywords activity-travel behaviortopological data analysistime series clusteringK-meansNational Household Travel Surveycohort differencestravel patternscategorical time series

0 comments

The pith

Activity-travel patterns in U.S. national surveys from 1990 to 2017 form three clusters when processed with time series and topological features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a divide-and-combine strategy for clustering long sequences of daily activity and travel choices. Features are first pulled from each time series using standard time-series tools together with topological data analysis, then K-means is run on the reduced space. When this pipeline is applied to five waves of the National Household Travel Survey, the data separate into three groups. The resulting groups also line up with documented differences across the survey cohorts. The same procedure is presented as usable for any other categorical time series that records transportation behavior.

Core claim

A divide-and-combine K-means procedure applied to features extracted by time series analysis and topological data analysis partitions activity-travel time series into three clusters; the same clustering recovers cohort-level distinctions present in the National Household Travel Survey waves collected between 1990 and 2017.

What carries the argument

Divide-and-combine K-means operating on a feature vector obtained by combining time-series descriptors with topological data analysis summaries of categorical activity-travel sequences.

If this is right

Activity-travel sequences across three decades reduce to three stable groups.
Observed differences between survey cohorts are recoverable from the same three-group partition.
The method extends directly to other categorical transportation time series such as driving behavior, mode choice, or vehicle ownership.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the three groups remain stable in later data, policy interventions could be designed separately for each cluster rather than for the population average.
Re-applying the pipeline to post-2017 survey waves would test whether new clusters appear once ride-hailing, remote work, or electrification become widespread.
The same feature-plus-clustering steps could be used to compare activity-travel stability across countries or cities that collect comparable diary data.

Load-bearing premise

The features taken from time series analysis and topological data analysis still contain the distinctions that actually matter for activity-travel behavior, so ordinary Euclidean K-means on those features produces groups worth interpreting.

What would settle it

Running the identical pipeline on the same survey waves but with a different number of clusters or with an alternative feature set that fails to recover three groups aligned with cohort differences would falsify the central claim.

read the original abstract

Over the last few years, traffic data has been exploding and the transportation discipline has entered the era of big data. It brings out new opportunities for doing data-driven analysis, but it also challenges traditional analytic methods. This paper proposes a new Divide and Combine based approach to do K means clustering on activity-travel behavior time series using features that are derived using tools in Time Series Analysis and Topological Data Analysis. Clustering data from five waves of the National Household Travel Survey ranging from 1990 to 2017 suggests that activity-travel patterns of individuals over the last three decades can be grouped into three clusters. Results also provide evidence in support of recent claims about differences in activity-travel patterns of different survey cohorts. The proposed method is generally applicable and is not limited only to activity-travel behavior analysis in transportation studies. Driving behavior, travel mode choice, household vehicle ownership, when being characterized as categorical time series, can all be analyzed using the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TDA-plus-time-series pipeline for NHTS clustering claims three groups but reports no validation for K or cluster quality.

read the letter

The main takeaway is a divide-and-combine K-means method that extracts features via time series tools and topological data analysis to cluster categorical activity-travel sequences from five NHTS waves. It reports three clusters that persist across decades and align with some cohort differences, yet the lack of any internal validation for the cluster count or stability is the central weakness that undercuts the headline result. What is new is the specific transportation application of this hybrid feature set plus the scaling trick for K-means on large categorical series. TDA on time series exists elsewhere, but the concrete use on NHTS activity patterns and the cohort comparison is a fresh empirical step. The paper also states the pipeline can extend to driving behavior or mode choice series, which is a fair generality claim given the setup. It handles the big-data framing in transportation reasonably and keeps the method description accessible. The soft spots are concentrated in the clustering validation. Nothing is said about how K=3 was selected, whether by elbow, gap statistic, or another rule, and there are no silhouette scores, stability metrics across the divide-and-combine splits, or checks that the partitions hold under modest changes to the TDA summaries. The assumption that the extracted features preserve the distinctions that matter for activity-travel behavior therefore sits untested. This matters because the three-cluster claim is the main empirical output. Readers working in transportation data analysis or applied clustering on categorical series would find the most value here; a methods specialist might borrow the pipeline idea but would want the missing checks first. The paper shows straightforward engagement with the data and literature without internal contradictions. It deserves a serious referee because the data scale is real and the reusable aspect is concrete. I would send it to review with a request for the validation metrics and external alignment tests in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a Divide-and-Combine procedure that extracts features from activity-travel behavior time series via Time Series Analysis and Topological Data Analysis, then applies Euclidean K-means. On five waves of the National Household Travel Survey (1990–2017) the method yields three clusters; the authors interpret this partition as evidence of stable long-term patterns and cohort differences.

Significance. If the three-cluster partition is shown to be robust, the work would supply a practical pipeline for clustering large categorical time-series data in transportation and related fields. The generality claim (applicability to driving behavior, mode choice, etc.) would be a secondary contribution.

major comments (2)

[Results] Results section: the central claim that the data support exactly three clusters is not accompanied by any reported procedure for selecting K (elbow, silhouette, gap statistic, etc.) or by internal validation metrics (silhouette scores, stability under bootstrap or perturbation of the TDA summaries). Without these, it is impossible to assess whether the reported partition is meaningful or an artifact of the feature pipeline and the Divide-and-Combine step.
[Method] Method section on feature construction: the assumption that the chosen TSA and TDA features preserve the distinctions relevant to activity-travel behavior is not tested; no ablation or sensitivity check is presented showing that the three-cluster structure survives modest changes to the persistence summaries or to the time-series feature set.

minor comments (2)

[Abstract] Abstract: 'do K means clustering' should read 'perform K-means clustering'.
[Method] Notation: the description of the Divide-and-Combine procedure would benefit from an explicit algorithmic outline or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify gaps in the validation of the clustering results and the robustness of the feature pipeline. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Results] Results section: the central claim that the data support exactly three clusters is not accompanied by any reported procedure for selecting K (elbow, silhouette, gap statistic, etc.) or by internal validation metrics (silhouette scores, stability under bootstrap or perturbation of the TDA summaries). Without these, it is impossible to assess whether the reported partition is meaningful or an artifact of the feature pipeline and the Divide-and-Combine step.

Authors: We agree that the original manuscript lacks an explicit procedure for selecting K and internal validation metrics. The choice of three clusters was motivated by domain knowledge from transportation studies on activity-travel patterns, but no quantitative criteria or stability checks were reported. In the revision we will add an elbow plot, silhouette scores, gap statistic, and bootstrap stability analysis on the TDA features to substantiate the partition. revision: yes
Referee: [Method] Method section on feature construction: the assumption that the chosen TSA and TDA features preserve the distinctions relevant to activity-travel behavior is not tested; no ablation or sensitivity check is presented showing that the three-cluster structure survives modest changes to the persistence summaries or to the time-series feature set.

Authors: We acknowledge that no ablation or sensitivity analysis on the TSA and TDA features was included. The features were selected to capture temporal and topological properties of categorical time series, but their necessity for recovering the three-cluster structure was not tested. In the revised manuscript we will add sensitivity checks by varying persistence summaries and time-series feature subsets to demonstrate robustness of the reported clusters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; result is empirical output of feature extraction + clustering pipeline

full rationale

The paper describes an empirical workflow: extract features via time-series analysis and topological data analysis, then apply K-means (with a Divide-and-Combine procedure) to NHTS waves. The three-cluster grouping is reported as the direct output of this pipeline on the data, not as a quantity derived from or forced by any fitted parameter, self-definition, or prior self-citation. No equations, uniqueness theorems, or ansatzes are shown that reduce the cluster labels or the K=3 choice to the inputs by construction. The supporting claim about cohort differences is presented as a post-hoc observation rather than a load-bearing premise. This matches the default case of a self-contained empirical analysis.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract supplies no explicit parameter counts or invented entities; the central claim rests on the unstated premise that TDA features are informative for this domain.

free parameters (1)

number of clusters K
Set to 3 for the reported analysis; value is data-dependent and not derived from first principles.

axioms (1)

domain assumption Activity-travel records can be faithfully represented as categorical time series whose topological and statistical features capture behaviorally relevant differences.
Invoked by the choice of feature extraction step before clustering.

pith-pipeline@v0.9.0 · 5702 in / 1208 out tokens · 18762 ms · 2026-05-24T19:59:18.643770+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We convert the WFT of the time series into a first-order persistence landscape ... PL(n,ℓ) = min(V1(n,ℓ),V2(n,ℓ))+ ... used as features in the clustering algorithm.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.