pith. sign in

arxiv: 1907.05281 · v1 · pith:A3FV36TJnew · submitted 2019-07-02 · 💻 cs.CV · eess.IV

Human Body Parts Tracking: Applications to Activity Recognition

Pith reviewed 2026-05-25 10:59 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords human body parts trackingactivity recognitionblob trackingforeground silhouettebackground subtraction2D Gaussian modelingpartial occlusionsreal-time video
0
0 comments X

The pith

Torso blob tracking on foreground silhouettes anchors real-time tracking of head, arms and legs for activity recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a real-time system that first isolates the human silhouette through background subtraction and then locates the torso with blob tracking to determine its position and size in each video frame. Other body parts are placed at fixed offsets relative to this torso and represented as 2D-Gaussian blobs whose parameters capture location, size and pose. A refinement step cleans the foreground mask before these placements occur. The resulting tracks are asserted to remain accurate across illumination changes and partial occlusions and are demonstrated on simple activity sequences such as carrying an object or opening a container.

Core claim

The HBPT system obtains the torso location and size via blob tracking on the foreground silhouette in every frame, places the remaining body parts at fixed relative positions, models each part with a 2D-Gaussian blob, and uses the resulting tracks to recognize activities such as approaching an object, carrying an object, and opening a box or suitcase while remaining accurate under varying illumination and partial occlusions.

What carries the argument

Torso blob tracking on the refined foreground silhouette, which fixes the reference frame for placing and modeling all other body parts as 2D-Gaussian blobs.

If this is right

  • Body-part tracks produced by the system can be fed directly into activity classifiers for tasks such as carrying objects or opening containers.
  • The same torso reference allows consistent part placement even when illumination varies between frames.
  • Partial occlusions that leave the torso visible still permit recovery of the remaining part locations.
  • The 2D-Gaussian representation supplies both position and pose information usable for real-time recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If torso detection remains the only robust cue, the method will degrade in crowded scenes where multiple overlapping silhouettes appear.
  • Replacing the fixed relative offsets with learned kinematic constraints could extend the approach to more articulated motions without changing the core silhouette pipeline.
  • The Gaussian blob output could serve as input features for downstream probabilistic trackers that handle full occlusions.
  • Evaluating the system on standard public activity datasets would reveal whether the reported robustness generalizes beyond the sequences shown.

Load-bearing premise

The torso can be reliably located and sized by blob tracking on the foreground silhouette obtained by background subtraction.

What would settle it

A test video in which the blob tracker produces an incorrect torso location or size under partial occlusion or changed lighting, causing the derived positions of head, arms and legs to deviate enough that the intended activity labels are no longer recovered.

Figures

Figures reproduced from arXiv: 1907.05281 by Aras R. Dargazany.

Figure 3.1
Figure 3.1. Figure 3.1: 1. Silhouette based shape features used in [1]......................................................... 5 [PITH_FULL_IMAGE:figures/full_fig_p008_3_1.png] view at source ↗
Figure 3.2
Figure 3.2. Figure 3.2: 1. Tracking of right leg and left leg .................................................................. 13 [PITH_FULL_IMAGE:figures/full_fig_p008_3_2.png] view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: 1. Silhouette based shape features used in [1] : (a) input image, (b) detected foreground region, (c) its [PITH_FULL_IMAGE:figures/full_fig_p013_3_1.png] view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: 2. An example showing how body parts are labeled in [1]: (a) original image, (b) detected silhouette, (c) [PITH_FULL_IMAGE:figures/full_fig_p015_3_1.png] view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: 3. Examples of using the silhouette model to locate the body parts in different actions in [1]. [PITH_FULL_IMAGE:figures/full_fig_p016_3_1.png] view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: 5. Silhouette analysis in real images: (a) input image, (b) detected foreground region, (c) contour of silhouette [PITH_FULL_IMAGE:figures/full_fig_p017_3_1.png] view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: 6. Detection and tracking of body parts using Silhouette: (a) partitioning of convex points [PITH_FULL_IMAGE:figures/full_fig_p018_3_1.png] view at source ↗
Figure 3.2
Figure 3.2. Figure 3.2: 1. Tracking of right leg and left leg. (a) previous frame with a specified region of right leg, (b) detected righ [PITH_FULL_IMAGE:figures/full_fig_p021_3_2.png] view at source ↗
Figure 3.2
Figure 3.2. Figure 3.2: 3. Tracking of head and right hand as two close body parts: (a) previous frame with a specified region of face, (b [PITH_FULL_IMAGE:figures/full_fig_p022_3_2.png] view at source ↗
Figure 4.1
Figure 4.1. Figure 4.1: 1.System overview, (a) 2D representation of human body parts model, (b) flowchart of proposed [PITH_FULL_IMAGE:figures/full_fig_p024_4_1.png] view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: The block diagrams of foreground detection module (a) and person blob detection module (b). [PITH_FULL_IMAGE:figures/full_fig_p028_5_2.png] view at source ↗
Figure 5.3
Figure 5.3. Figure 5.3: 1. The block diagrams of person blob tracking pipeline. [PITH_FULL_IMAGE:figures/full_fig_p029_5_3.png] view at source ↗
Figure 5.4
Figure 5.4. Figure 5.4: 1. The process of refining the foreground mask. (a) input frame, (b) dilated foreground mask, (c) eroding the [PITH_FULL_IMAGE:figures/full_fig_p031_5_4.png] view at source ↗
Figure 5.5
Figure 5.5. Figure 5.5: 1. Tracking the torso blob using the person blob. (a) the input frame, (b) the detected foreground, (c) the person [PITH_FULL_IMAGE:figures/full_fig_p032_5_5.png] view at source ↗
Figure 5.6
Figure 5.6. Figure 5.6: 1. Building the blob model of human body parts by the proposed HBPT. (a) input frame, (b) applied initial torso [PITH_FULL_IMAGE:figures/full_fig_p034_5_6.png] view at source ↗
Figure 5.6
Figure 5.6. Figure 5.6: 2. The complete blob model of human body parts by the proposed HBPT. (a) input frame, (b) learned scene, (c) applied initial torso blob on the refined foreground mask and dividing the legs region by bounding box, (d) superimposing the blob model on input frame. a b c d e [PITH_FULL_IMAGE:figures/full_fig_p035_5_6.png] view at source ↗
Figure 6.1
Figure 6.1. Figure 6.1: 1. Depth estimation using Kinect, (a) input frame, (b) point cloud in a colorized map, (c) point cloud in a [PITH_FULL_IMAGE:figures/full_fig_p037_6_1.png] view at source ↗
Figure 6.1
Figure 6.1. Figure 6.1: 2. Approaching box recognition, (a) successful detection of approaching box, (b) depth used when the person [PITH_FULL_IMAGE:figures/full_fig_p038_6_1.png] view at source ↗
Figure 6.2
Figure 6.2. Figure 6.2: 1. Opening box recognition, (a) computing the [PITH_FULL_IMAGE:figures/full_fig_p039_6_2.png] view at source ↗
Figure 6.3
Figure 6.3. Figure 6.3: 1. Recognizing object carrying using the proposed HBPT, (a) getting close to the object, (b) approaching the object (c) recognition of object carrying by depth data from Kinect when the person and object are getting closer to camera, (d) the object and person are really close to camera [PITH_FULL_IMAGE:figures/full_fig_p040_6_3.png] view at source ↗
read the original abstract

As cameras and computers became popular, the applications of computer vision techniques attracted attention enormously. One of the most important applications in the computer vision community is human activity recognition. In order to recognize human activities, we propose a human body parts tracking system that tracks human body parts such as head, torso, arms and legs in order to perform activity recognition tasks in real time. This thesis presents a real-time human body parts tracking system (i.e. HBPT) from video sequences. Our body parts model is mostly represented by body components such as legs, head, torso and arms. The body components are modeled using torso location and size which are obtained by a torso tracking method in each frame. In order to track the torso, we are using a blob tracking module to find the approximate location and size of the torso in each frame. By tracking the torso, we will be able to track other body parts based on their location with respect to the torso on the detected silhouette. In the proposed method for human body part tracking, we are also using a refining module to improve the detected silhouette by refining the foreground mask (i.e. obtained by background subtraction) in order to detect the body parts with respect to torso location and size. Having found the torso size and location, the region of each human body part on the silhouette will be modeled by a 2D-Gaussian blob in each frame in order to show its location, size and pose. The proposed approach described in this thesis tracks accurately the body parts in different illumination conditions and in the presence of partial occlusions. The proposed approach is applied to activity recognition tasks such as approaching an object, carrying an object and opening a box or suitcase.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a real-time human body parts tracking system (HBPT) for activity recognition. Body components (head, torso, arms, legs) are modeled relative to the torso, whose location and size are obtained via blob tracking on the foreground silhouette from background subtraction; a refining module improves the mask, parts are placed at fixed relative positions on the silhouette, and each is represented as a 2D Gaussian. The approach is claimed to track accurately under varying illumination and partial occlusions and is applied to tasks such as approaching an object, carrying an object, and opening a box.

Significance. If the tracking pipeline were shown to be robust, the method could contribute to real-time activity recognition pipelines that rely on explicit body-part localization. The procedural description offers no machine-checked proofs, reproducible code, parameter-free derivations, or falsifiable quantitative predictions, so significance cannot be evaluated from the supplied material.

major comments (3)
  1. [Abstract] Abstract: the claim that 'the proposed approach described in this thesis tracks accurately the body parts in different illumination conditions and in the presence of partial occlusions' is unsupported; the manuscript supplies no quantitative tracking metrics (e.g., MOTA, precision-recall, pixel error), no datasets, no baseline comparisons, and no ablation of the refining module.
  2. [Method description (torso tracking module)] The torso-location step (blob tracking on the background-subtracted silhouette) is load-bearing for the entire pipeline and for the illumination/occlusion robustness claim, yet the description provides neither the exact blob-tracking algorithm nor any validation that this step remains reliable when background subtraction fails under the very illumination changes the paper asserts it handles.
  3. [Body-part placement step] Placement of remaining parts at 'fixed relative positions' with respect to the detected torso on the silhouette is presented without any mechanism for handling the partial occlusions that directly corrupt the silhouette used for both torso sizing and relative placement.
minor comments (2)
  1. [Gaussian modeling paragraph] Notation for the 2D-Gaussian blobs (means, covariances, how pose is encoded) is never defined.
  2. [Refining module] The refining module is invoked repeatedly but never specified (algorithm, parameters, or pseudocode).

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment point by point below, indicating planned revisions where appropriate. We acknowledge several limitations in the current manuscript regarding evaluation and methodological detail.

read point-by-point responses
  1. Referee: [Abstract] the claim that 'the proposed approach described in this thesis tracks accurately the body parts in different illumination conditions and in the presence of partial occlusions' is unsupported; the manuscript supplies no quantitative tracking metrics (e.g., MOTA, precision-recall, pixel error), no datasets, no baseline comparisons, and no ablation of the refining module.

    Authors: We agree that the manuscript provides no quantitative metrics, datasets, baselines or ablations, and that the accuracy claim in the abstract is therefore unsupported by such evidence. The claim derives from qualitative visual inspection of tracking results on the demonstrated activity examples. We will revise the abstract to remove the unsupported quantitative claim and instead describe the approach as having been applied to activity recognition tasks involving varying illumination and partial occlusions, based on the presented examples. revision: yes

  2. Referee: [Method description (torso tracking module)] The torso-location step (blob tracking on the background-subtracted silhouette) is load-bearing for the entire pipeline and for the illumination/occlusion robustness claim, yet the description provides neither the exact blob-tracking algorithm nor any validation that this step remains reliable when background subtraction fails under the very illumination changes the paper asserts it handles.

    Authors: The torso tracking relies on standard connected-component blob detection applied to the foreground mask after background subtraction. We acknowledge that the manuscript does not specify the exact algorithm (e.g., parameters, distance metrics or update rules) nor include targeted validation showing reliability when background subtraction degrades under illumination variation. This constitutes a genuine gap in the method description. We will expand the torso-tracking subsection with additional implementation details drawn from the original thesis work where possible. revision: partial

  3. Referee: [Body-part placement step] Placement of remaining parts at 'fixed relative positions' with respect to the detected torso on the silhouette is presented without any mechanism for handling the partial occlusions that directly corrupt the silhouette used for both torso sizing and relative placement.

    Authors: Body-part regions are assigned at fixed offsets relative to the detected torso on the refined silhouette. The refining module improves the foreground mask, yet we agree there is no explicit mechanism (such as occlusion-aware adjustment or fallback estimation) to compensate when occlusions corrupt the silhouette used for sizing and placement. The robustness claim rests on observed behavior in the example sequences rather than a dedicated algorithmic safeguard. We will add a limitations paragraph clarifying this point and noting that severe occlusions may affect placement accuracy. revision: partial

standing simulated objections not resolved
  • Provision of quantitative tracking metrics, datasets, baseline comparisons or ablation studies, as none were performed in the original work.

Circularity Check

0 steps flagged

No circularity: purely procedural description with no equations or derivations

full rationale

The manuscript describes a body-parts tracking pipeline (torso blob tracking on background-subtracted silhouette, relative part placement, 2D-Gaussian modeling, and a refining module) but contains no equations, no fitted parameters, no derivations, and no self-citations. The accuracy claim under illumination changes and occlusions is asserted as an empirical outcome of the described steps rather than shown to reduce to those steps by construction. Because there is no derivation chain at all, none of the enumerated circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard computer vision domain assumptions about silhouette quality and torso-centric body modeling; no free parameters, new entities, or ad-hoc axioms are stated in the abstract.

axioms (2)
  • domain assumption Torso location and size obtained by blob tracking can serve as reliable anchor for locating other body parts on the silhouette.
    Central modeling choice stated in the abstract.
  • domain assumption Background subtraction yields a usable foreground mask under the target conditions.
    Required for the initial silhouette.

pith-pipeline@v0.9.0 · 5835 in / 1176 out tokens · 24015 ms · 2026-05-25T10:59:10.574144+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    w4 : Real -Time Surveillance of People and Their Activities

    I. Haritaoglu, D. Harwood, and L.S. Davis, “w4 : Real -Time Surveillance of People and Their Activities”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 809-830, Aug. 2000

  2. [2]

    Human Activity Recognition Using Multidimensional Indexing

    J. Ben -Arie, Z. Wang, P. Pandit and S. Rajaram, “Human Activity Recognition Using Multidimensional Indexing ”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 8, pp. 1091-1104, Aug. 2002

  3. [3]

    Distinctive Image Features from Scale -Invariant Keypoints

    D. G . Lowe , “Distinctive Image Features from Scale -Invariant Keypoints ”, International Journal of Computer Vision 60(2), 91–110, 2004

  4. [4]

    SURF: Speeded -Up Robust Features

    Bay, H., Tuytelaars, T., & Van Gool, L., “SURF: Speeded -Up Robust Features”, 9th European Conference on Computer Vision, V ol. 110, pp. 346-359, 2008

  5. [5]

    Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration

    M. Muja and D. G . Lowe, “Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration”, International Conference on Computer Vision Theory and Applications (VISAPP'09), 2009

  6. [6]

    Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography

    M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography”, Comm. of the ACM 24: 381–395, June 1981

  7. [7]

    An Extended Set of Haar -like Features for Rapid Object Detection

    R. Lienhart, Jochen Maydt, “An Extended Set of Haar -like Features for Rapid Object Detection”, ICIP, 2002

  8. [8]

    Kernel-based object tracking

    D. Comaniciu, V . Ramesh, and P. Meer, “Kernel-based object tracking”, PAMI, 2003

  9. [9]

    A B ayesian Formulation for 3D Articulated Upper Body Segmentation and Tracking from Dense Disparity Maps

    R. D. Cavin, A. V . Nefian and N. Goel, “A B ayesian Formulation for 3D Articulated Upper Body Segmentation and Tracking from Dense Disparity Maps”, ICIP, 2003

  10. [10]

    Multi-bandwidth Kernel-Based Object Tracking

    A. Dargazany, A. Soleimani, A. Ahmadyfard, “Multi-bandwidth Kernel-Based Object Tracking”, Hindawi Publishing Corporation, Advances in Artificial Intelligence, Article ID 175603, 15 pages, 2010

  11. [11]

    Barron and I

    C. Barron and I. Kakadiaris. Estimating anthropometry and pose from a single image. In Computer Vision and Pattern Recognition, pages 669–676, 2000

  12. [12]

    Bregler and J

    C. Bregler and J. Malik. Tracking pe ople with twists and exponential maps. In Computer Vision and Pattern Recognition, pages 8–15, 1998

  13. [13]

    Grzeszczuk, G

    R. Grzeszczuk, G . Bradski, M.H. Chu, and J.Y . Bouguet. Stereo based gesture recognition invariant to 3D pose and lighting. In International Conference on Computer Vision and Pattern Recognition, pages 826–833, 2000

  14. [14]

    Jojic, B

    N. Jojic, B. Brumitt, B. Meyers, S. Harris, and T. Huang. Tracking self -occluding articulated objects in dense disparity maps. In International Conference on Computer Vision, pages 123–130, 1999

  15. [15]

    Jojic, B

    N. Jojic, B. Brumitt, B. Meyers, S. Harris, and T. Huang. Detection and estimation of pointing gestures in dense disparity maps. In International Conference on Face and Gesture Recognition, pages 468–475, 2000. 36

  16. [16]

    Kakadiaris and D

    I. Kakadiaris and D. Met axas. Model-based estimation of 3D human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1453–1459, 2000

  17. [17]

    A statistical upper body model for 3D static and dynamic gesture re cognition from stereo sequences

    A. V . Nefian, R. Grzeszczuk, and V . Eruhimov. “A statistical upper body model for 3D static and dynamic gesture re cognition from stereo sequences”, In IEEE International Conference on Image Processing, pages 601–607, 2001

  18. [18]

    Sidenbladh, F

    H. Sidenbladh, F. De La Torre, and M. J. Black. A framework for modeling the appearance of 3D articulated figures. In Automatic Face and Gestu re Recognition, pages 368–375, 2000

  19. [19]

    C. Wren, A. Azerbayejani, T. Darell, and A. Pentland. Pfinder: Real -time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:780–785, July 1997

  20. [20]

    Kernel-Based Hand Tracking

    Aras Dargazany, Ali Soleimani, “Kernel-Based Hand Tracking”, INSInet Publication, Australian Journal of Basic and Applied Sciences, 2009

  21. [21]

    Recursive Estimation of Motion, Structure, and Focal Length,

    A. Azarbayejani and A. Pentland, “Recursive Estimation of Motion, Structure, and Focal Length,” Trans. Pattern Analysis and Machine In telligence, vol. 17, no. 6, pp. 562–575, June 1995

  22. [22]

    An Efficient Method for Contour Tracking Using Active Shape Models,

    A. Baumberg and D. Hogg, “An Efficient Method for Contour Tracking Using Active Shape Models,” Proc. Workshop Motion of Nonrigid and Articulated Objects. Los Alamitos, Calif.: IEEE CS Press, 1994

  23. [23]

    Segmenting Simply Connected Moving Objects in a Static Scene,

    M. Bichsel, “Segmenting Simply Connected Moving Objects in a Static Scene,” Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 11, pp. 1,138 –1,142, Nov. 1994

  24. [24]

    Pfinder: Real -Time Tracking of the Human Body

    C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder: Real -Time Tracking of the Human Body”, Trans. Pattern Analysis and Machine Intelligence, vol. 19, 1997

  25. [25]

    Adaptive background mixture models for real-time tracking

    C. Stauffer, W. Grimso, “Adaptive background mixture models for real-time tracking” ,CVPR, 1998

  26. [26]

    Foreground O bject Detection from Videos Containing Complex Background

    L. Li, W. Huang, I. Y .H. Gu, Q. Tian, “Foreground O bject Detection from Videos Containing Complex Background”, ACM, 2003

  27. [27]

    Tracking and Matching Connected Components from 3D Video

    D. da Silva Pires, R. Cesar -Jr.,“Tracking and Matching Connected Components from 3D Video”, CVPR, 2005

  28. [28]

    Real Time Hand Tracking by Combi ning Particle Filtering and Mean Shift

    C. Shan, Y . Wei, T. Tan, F. Ojardias, “Real Time Hand Tracking by Combi ning Particle Filtering and Mean Shift”, Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004