pith. sign in

arxiv: 1907.00488 · v1 · pith:V3UDSSP2new · submitted 2019-06-30 · 💻 cs.CL · cs.CY· cs.DL· cs.IR

Topic Modeling the Reading and Writing Behavior of Information Foragers

Pith reviewed 2026-05-25 12:17 UTC · model grok-4.3

classification 💻 cs.CL cs.CYcs.DLcs.IR
keywords information foragingtopic modelingLDACharles Darwinreading behaviorknowledge constructioninnovationexploration exploitation
0
0 comments X

The pith

LDA topic models of Darwin's reading records and drafts show how individuals build knowledge by balancing exploration against exploitation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper uses topic modeling on historical texts to examine information foraging, where agents survey existing knowledge while searching for new ideas. It focuses on Charles Darwin's detailed 23-year record of books read alongside his drafts and revisions of The Origin of Species. The same approach is applied to Thomas Jefferson's correspondence and to collective neuroscience citation patterns. These cases illustrate how reading and writing together construct personal knowledge bases and contribute to innovation.

Core claim

LDA topic modeling applied to the texts each author read and wrote represents the information environment and the cognitive search process. Case studies of Darwin characterize his reading behavior, show its interaction with drafts and revisions of The Origin of Species, and extend the dataset to later work. Analysis of Jefferson's letters broadens the data type beyond books, while neuroscience citations move from individual to collective behavior. Together these studies reveal the interplay between individual and collective phenomena where innovation takes place.

What carries the argument

LDA topic modeling on the texts each author read and wrote, which constructs a representation of the information environment and the balance of exploration against exploitation.

If this is right

  • Reading behavior can be tracked quantitatively across years for a single historical figure using the same modeling approach.
  • Topic overlap between readings and specific drafts can identify how new ideas enter and shape written work.
  • The method extends from book records to correspondence and from single authors to group citation networks.
  • Individual foraging patterns contribute measurably to collective innovation processes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar modeling could be applied to modern researchers' reading lists and citation histories to study current knowledge construction.
  • Digital library systems might incorporate topic-based recommendations that support balanced exploration and exploitation.
  • The framework connects individual search strategies to broader economic models of decision-making under incomplete information.

Load-bearing premise

LDA topic modeling on the texts each author read and wrote accurately represents the information environment and the cognitive search process of balancing exploration against exploitation.

What would settle it

If the topics extracted from Darwin's readings and drafts fail to align with the documented sequence of his ideas leading to the Origin of Species, the modeling would not capture the intended search process.

Figures

Figures reproduced from arXiv: 1907.00488 by Jaimie Murdock.

Figure 4.7
Figure 4.7. Figure 4.7: The AIC of the independent selection for a 2-epoch model is also shown. Note [PITH_FULL_IMAGE:figures/full_fig_p016_4_7.png] view at source ↗
Figure 2.1
Figure 2.1. Figure 2.1: The term “state” is a highly probable term for several topics inferred from the Stanford Encyclopedia of Philosophy. It appears in topic 14, indicating political states; topic 55, indicating psy￾chological states; and topic 60, indicating physical states. Notice that both state and states appear in each topics’ top 10 terms. This is because we do not employ stemming in our corpus preparation steps. Terms… view at source ↗
Figure 2
Figure 2. Figure 2: summarizes the various geometries defined by the VSM, LSA, pLSI, and LDA semantic [PITH_FULL_IMAGE:figures/full_fig_p036_2.png] view at source ↗
Figure 2.2
Figure 2.2. Figure 2.2: Geometries of Various Semantic Models. On the top row, the vector space and latent semantic analysis models are shown using the representation of [PITH_FULL_IMAGE:figures/full_fig_p037_2_2.png] view at source ↗
Figure 2.3
Figure 2.3. Figure 2.3: Priors can be symmetric or asymmetric across document-topic or word-topic distributions. Wallach, Mimno, and Mccallum (2009) examined the effect of prior selection on model robustness to changes in the number of topics and to highly-skewed word distributions. They found that changes to β, the word-topic Dirichlet prior, had no significant effect. However, changes to α, the document-topic Dirichlet prior … view at source ↗
Figure 2.3
Figure 2.3. Figure 2.3: Examples of Dirichlet priors. In this example we compare two dice factories. On the top is a modern factory, that uses plastics and laser cutting to ensure consistency across dice. Each row on the right represents the probability distribution of a single die made in this factory, showing there are still some variations among the dice with a prior of α = 600. This is in contrast to the bottom, which repre… view at source ↗
Figure 2.4
Figure 2.4. Figure 2.4: Three methods for working with multiple corpora. a) Corpus merging takes two corpora and trains a single model over their joint vocabulary. b) Model comparison takes two corpora and trains two separate models, with independent vocabularies. c) Query sampling takes a model trained on an initial corpus and projects a new document into that model space, using only the vocabulary from the initial corpus. 2.6… view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: An example of KL divergence. The table above shows the word distributions for two parents in a game of 20 Questions, limited to a very simple vocabulary of 3 words. Each parent’s distribution has an entropy of 1.5 bits, meaning on average, it will take a child with an optimal script 1.5 questions to get to the noun. Using the optimal encoding for parent 1 with parent 2’s distribution will result in an av… view at source ↗
Figure 4.1
Figure 4.1. Figure 4.1: Page from Darwin’s reading notebooks. Page 3a of Darwin’s first notebook (DAR 119), during which he began to track the exact reading dates. Note the reading of Malthus’s On Population on October 3, 1838. Photo courtesy of Cambridge University Libraries. 39 [PITH_FULL_IMAGE:figures/full_fig_p060_4_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: shows the density of Darwin’s readings modeled. Notice the large jump in 1840 corresponds [PITH_FULL_IMAGE:figures/full_fig_p061_4.png] view at source ↗
Figure 4.2
Figure 4.2. Figure 4.2: Reading density of Darwin’s reading notebooks, smoothed over a 6-month window. The dashed line shows the 665 titles here modeled, while the thin solid line represents all 915 titles in the reading notebooks. 41 [PITH_FULL_IMAGE:figures/full_fig_p062_4_2.png] view at source ↗
Figure 4.3
Figure 4.3. Figure 4.3: Publication and reading dates. Scatter plot of the publication and reading dates of the titles in Darwin’s reading list. The 665 modeled titles are shown with dots, while the remaining 250 titles are shown as xs. The solid line indicates when the reading date and publication date are equal. The dashed line indicates a linear regression over the dots (r 2 = 0.1992), and the dotted line indicates a linear … view at source ↗
Figure 4.4
Figure 4.4. Figure 4.4: Epochs of exploration and exploitation in Darwin’s reading choices. Text-to-text (top) and text-to-past (bottom) cumulative surprise over the reading path, in bits. More negative (downward) slope indicates lower surprise (exploitation); more positive (upward) slope indicates greater surprise (exploration). The three epochs, identified by an unsupervised Bayesian model, are marked as alternating shaded re… view at source ↗
Figure 4.5
Figure 4.5. Figure 4.5 [PITH_FULL_IMAGE:figures/full_fig_p069_4_5.png] view at source ↗
Figure 4.6
Figure 4.6. Figure 4.6: Darwin’s reading order more exploratory than the culture’s production. Text-to-text (top) and text-to-past (bottom) cumulative surprise over the reading order (solid) and the publication order (dashed). More negative (downward) slope indicates lower surprise (exploitation); more positive (upward) slope indicates greater surprise (exploration). In both cases, Darwin’s cumulative surprise is higher than th… view at source ↗
Figure 4.7
Figure 4.7. Figure 4.7: Two-epoch model likelihoods. Fisher maximum-likelihood estimation for a 2-epoch BEE model over the text-to-text and text-to-past k = 80 models of 647 of Darwin’s readings. The darker line indicates the window of the 5-year minimum epoch length. Note the phase transition at the 325th volume in the text-to-past case (bottom) and the 548th volume in the text-to-text case (top). Note also that the text-to￾pa… view at source ↗
Figure 5.1
Figure 5.1. Figure 5.1: Mean probability and entropy of topics. This scatter plot shows the mean probability of topics across all documents (X-axis) and the Shannon entropy across documents (Y-axis) of all topics in the 200-topic model. Topics in the upper right appear frequently and in many texts. The majority of the topics appear with very low probability and are unevenly distributed across the texts. top three books in which… view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: Perplexity landscape of 1,100 samples of The Origin of Species. The colored Xs correspond to points falling in the eight clusters represented in [PITH_FULL_IMAGE:figures/full_fig_p085_5_2.png] view at source ↗
Figure 5.3
Figure 5.3. Figure 5.3: Cluster analysis of The Origin of Species. This “violin” plot shows the distribution of perplex￾ity (fit to the document) by topic cluster for The Origin of Species. The number below each cluster shows the number of samples classified in that group, and the surface area of the bulging part of each violin is proportional to this number. The horizontal line in the center of each violin shows the median per… view at source ↗
Figure 5
Figure 5. Figure 5: shows that with respect to what Darwin had read at any given time (position along the [PITH_FULL_IMAGE:figures/full_fig_p087_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: shows the divergence of Darwin’s writings with respect to the order his readings were published, [PITH_FULL_IMAGE:figures/full_fig_p088_5.png] view at source ↗
Figure 5.4
Figure 5.4. Figure 5.4: KL divergence between Darwin’s readings and writings by reading order. Vertical dashed lines indicate date of publication. The Origin diverges more from the readings-to-date than either of the two previous drafts at all time points. However, each successive draft diverges less from the readings-to-date at the time of writing. The curves been smoothed by averaging over 5 years of readings. The grey rectan… view at source ↗
Figure 5.5
Figure 5.5. Figure 5.5: KL divergence between Darwin’s readings and writings by publication order. Vertical dashed lines indicate date of publication. The Origin diverges more from the publications-to-date than either of the two previous drafts at all time points. 69 [PITH_FULL_IMAGE:figures/full_fig_p090_5_5.png] view at source ↗
Figure 5.6
Figure 5.6. Figure 5.6: Similarity between Darwin’s writings and Wallace’s essay. This heatmap shows the Jensen￾Shannon Distance between Darwin’s various drafts, The Origin of Species, and Wallace’s contemporaneous essay. The Essay of ’44 and Sketch of ’42 are the closest two texts. 5.4 Wallace and Darwin Regardless of the primary cause of Darwin’s delay, his sudden rush to publication is often attributed to the co-discovery of… view at source ↗
Figure 5.7
Figure 5.7. Figure 5.7: Divergences between Darwin’s writings and Wallace’s essay. This heatmap shows the asym￾metric Kullback-Leibler divergence between Darwin’s essay drafts, The Origin of Species, and Wallace’s contemporaneous essay. how far Wallace had come. To investigate further, we examine the asymmetries in the KL divergences among these texts (Fig￾ure 5.7). We compare two scenarios: reading Wallace’s text after encount… view at source ↗
Figure 5
Figure 5. Figure 5: figure 5.7) and reading Darwin’s text after encountering Wallace’s manuscript (last column, figure 5.7). [PITH_FULL_IMAGE:figures/full_fig_p092_5.png] view at source ↗
Figure 6.1
Figure 6.1. Figure 6.1: The number of pre-1871 books in Darwin’s Library by publication date. The shaded region indicates the time period when Darwin maintained his reading notebooks. That most of the books in the Library come from this time period indicates the significance of it to Darwin’s scholarship. the reading notebooks collection. I located 265 new volumes for inclusion in the dataset, for a total of 912 volumes (see [… view at source ↗
Figure 6.2
Figure 6.2. Figure 6.2: Validation of Darwin’s reading notebook experiments with new topic models. The blue lines indicate the model from [PITH_FULL_IMAGE:figures/full_fig_p097_6_2.png] view at source ↗
Figure 6.3
Figure 6.3. Figure 6.3: Verification of null model using additional volumes. The new null (bottom line) includes an additional 130 volumes. It indicates that the additional volumes further removed correlations between random volumes, i.e., the additional volumes demonstrated a stronger correlation among Darwin’s readings. 78 [PITH_FULL_IMAGE:figures/full_fig_p099_6_3.png] view at source ↗
Figure 6.4
Figure 6.4. Figure 6.4: Darwin’s reading rate. The number of books Darwin read in a given 1-year period. The average rate was 33 books per year before he started writing the The Origin of Species, but slowed down dramatically after that point to 15 books per year. 79 [PITH_FULL_IMAGE:figures/full_fig_p100_6_4.png] view at source ↗
Figure 7.1
Figure 7.1. Figure 7.1: Reproduction of Jefferson’s Library at the Library of Congress. On August 24, 1814, British troops burned Washington D.C., including the United States Capitol, which housed the Library of Congress. Thomas Jefferson’s personal library was sold to Congress for $23,950 in 1815 to rebuild the collection. That library is shown here, reconstructed in its original catalog order and using items from the original… view at source ↗
Figure 7.2
Figure 7.2. Figure 7.2: Temporal distribution of Jefferson’s Retirement Library and Letters. The left side shows the distribution of publication years in Jefferson’s Retirement Library. Note that most books were published during his retirement, i.e., after 1809. The right side shows the distribution of authorship years in our sample of 1,731 letters. Note that there are fewer letters in this sample from retirement. consists of … view at source ↗
Figure 7.3
Figure 7.3. Figure 7.3: Date of least-divergent 5-year publication window to 90-day correspondence window. We find that by and large, Jefferson was closest to the volumes published in 1802. An expanded corpus of books going past Jefferson’s death may be more interesting for our question of whether he was “ahead of his time.” to indicate “the time” Jefferson was “of” at any given moment. These results are shown in [PITH_FULL_IM… view at source ↗
Figure 8.1
Figure 8.1. Figure 8.1: Cumulative citation-to-abstract divergence for 106,726 authors. Over time, the population’s divergence between the abstracts an author wrote against the abstracts of articles cited decreases. This suggests a pattern of overall exploitation among published neuroscientists. The lines above and below show the 95% confidence intervals. date“ for each cited paper as the publication year of the citing article.… view at source ↗
Figure 8
Figure 8. Figure 8: shows the average with the 95% confidence interval, finding that over the course of an author’s [PITH_FULL_IMAGE:figures/full_fig_p111_8.png] view at source ↗
read the original abstract

The general problem of "information foraging" in an environment about which agents have incomplete information has been explored in many fields, including cognitive psychology, neuroscience, economics, finance, ecology, and computer science. In all of these areas, the searcher aims to enhance future performance by surveying enough of existing knowledge to orient themselves in the information space. Individuals can be viewed as conducting a cognitive search in which they must balance exploration of ideas that are novel to them against exploitation of knowledge in domains in which they are already expert. In this dissertation, I present several case studies that demonstrate how reading and writing behaviors interact to construct personal knowledge bases. These studies use LDA topic modeling to represent the information environment of the texts each author read and wrote. Three studies revolve around Charles Darwin. Darwin left detailed records of every book he read for 23 years, from disembarking from the H.M.S. Beagle to just after publication of The Origin of Species. Additionally, he left copies of his drafts before publication. I characterize his reading behavior, then show how that reading behavior interacted with the drafts and subsequent revisions of The Origin of Species, and expand the dataset to include later readings and writings. Then, through a study of Thomas Jefferson's correspondence, I expand the study to non-book data. Finally, through an examination of neuroscience citation data, I move from individual behavior to collective behavior in constructing an information environment. Together, these studies reveal "the interplay between individual and collective phenomena where innovation takes place" (Tria et al. 2014).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents case studies applying LDA topic modeling to Charles Darwin's reading records and drafts (including pre- and post-Origin of Species), Thomas Jefferson's correspondence, and neuroscience citation networks. These are used to characterize individual reading/writing behaviors as cognitive search balancing exploration and exploitation, then to illustrate interactions that construct personal knowledge bases and the interplay between individual and collective phenomena in innovation.

Significance. If the LDA-derived topics can be shown to validly represent information environments and search dynamics, the work would offer a computational lens on knowledge construction across historical and modern datasets; the multi-scale design (individual to collective) is a potential strength, but the absence of validation, quantitative metrics, or error analysis in the described studies limits immediate impact.

major comments (3)
  1. [Abstract] Abstract and method description: the central claim that LDA topic models of read/written texts accurately represent both the external information environment and the internal cognitive search process (exploration vs. exploitation) is load-bearing for all three case studies, yet no validation against independent measures of novelty, expertise, or search dynamics is described; LDA yields unsupervised co-occurrence clusters whose mapping to cognitive constructs remains untested.
  2. [Abstract] Abstract: the headline result that the studies reveal 'the interplay between individual and collective phenomena where innovation takes place' rests on descriptive case studies, but the abstract supplies no equations, fitted parameters, quantitative results, error analysis, or statistical tests, preventing verification that the observed patterns support the claimed interplay.
  3. [Case Studies] Darwin case study (and extension to Jefferson/neuroscience): domain-specific issues such as 19th-century language shift, non-book formats, and citation-text peculiarities are not addressed, yet these directly affect whether the topic clusters can be interpreted as faithful representations of the information environment.
minor comments (2)
  1. [Abstract] The abstract references expanding the Darwin dataset to later readings/writings but provides no details on how the LDA models were trained, preprocessed, or evaluated across the different corpora.
  2. [Abstract] Citation to Tria et al. 2014 is used to frame the collective claim, but the manuscript does not specify how the present LDA results quantitatively connect to or extend that prior work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the referee's constructive comments. We address each major point below, indicating planned revisions where appropriate to clarify the methodological approach and limitations of the case studies.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: the central claim that LDA topic models of read/written texts accurately represent both the external information environment and the internal cognitive search process (exploration vs. exploitation) is load-bearing for all three case studies, yet no validation against independent measures of novelty, expertise, or search dynamics is described; LDA yields unsupervised co-occurrence clusters whose mapping to cognitive constructs remains untested.

    Authors: The manuscript uses LDA to derive topic distributions as a proxy for the information environment based on textual co-occurrences, with interpretations of exploration/exploitation grounded in alignment with each individual's documented historical activities and outputs. We acknowledge that no independent quantitative validation (e.g., against expert ratings or behavioral metrics) is presented. We will revise the methods and discussion sections to explicitly note the unsupervised nature of the clusters and the interpretive basis for cognitive mapping, while adding a limitations subsection on this point. revision: yes

  2. Referee: [Abstract] Abstract: the headline result that the studies reveal 'the interplay between individual and collective phenomena where innovation takes place' rests on descriptive case studies, but the abstract supplies no equations, fitted parameters, quantitative results, error analysis, or statistical tests, preventing verification that the observed patterns support the claimed interplay.

    Authors: The work consists of descriptive case studies illustrating patterns across scales rather than a parametric or statistical modeling paper; the abstract therefore provides a high-level summary. The full manuscript details the LDA-derived topics and their correspondence to known shifts in reading/writing behavior. We will revise the abstract to reference the multi-scale design and the specific qualitative patterns (e.g., topic transitions in Darwin's drafts) that support the claimed interplay. revision: partial

  3. Referee: [Case Studies] Darwin case study (and extension to Jefferson/neuroscience): domain-specific issues such as 19th-century language shift, non-book formats, and citation-text peculiarities are not addressed, yet these directly affect whether the topic clusters can be interpreted as faithful representations of the information environment.

    Authors: These domain factors can influence topic coherence and are not explicitly discussed in the current text. Standard LDA preprocessing was applied across datasets. We will add targeted discussion in the methods and each case-study section addressing language evolution, non-book formats, and citation characteristics, along with any mitigation steps used. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive LDA case studies with external citation

full rationale

The paper applies standard LDA topic modeling to historical reading/writing records (Darwin's books and drafts, Jefferson correspondence, neuroscience citations) to produce descriptive characterizations of information environments. No equations, fitted parameters, or predictions are defined in terms of the outputs themselves. The central claim quotes an external 2014 paper (Tria et al.) rather than relying on self-citation chains or uniqueness theorems. All steps remain interpretive applications of an off-the-shelf unsupervised method to external data sources; nothing reduces by construction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The central modeling step implicitly assumes LDA topics capture semantically meaningful information environments.

axioms (1)
  • domain assumption LDA topic modeling can represent the information environment of the texts each author read and wrote
    Invoked to characterize reading behavior and its interaction with writing

pith-pipeline@v0.9.0 · 5810 in / 1115 out tokens · 19197 ms · 2026-05-25T12:17:10.872033+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 8 internal anchors

  1. [1]

    Formation of Scientific Fields as a Universal Topological Transition

    DOI: 10.1002/hbm.24038. Allen, Colin, Hongliang Luo, Jaimie Murdock, Jianghuai Pu, Xiaohong Wang, Yanjie Zhai, and Kun Zhao (2017). “Topic Modeling the H `an di ˘an Ancient Classics”. Journal of Cultural Analytics . DOI: 10 . 22148/16.016. Andrieu, Christophe, Nando de Freitas, Arnaud Doucet, and Michael I. Jordan (2003). “An Introduction to Markov chain ...

  2. [2]

    BTM: Topic Modeling over Short Texts

    DOI: 10.1016/0010-0285(73)90004-2. Cheng, Xueqi, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo (2014). “BTM: Topic Modeling over Short Texts”. IEEE Transactions on Knowledge and Data Engineering 26.12, pp. 2928–2941. DOI: 10 . 1109/TKDE.2014.2313872. Chi, Ed H., Peter Pirolli, and James Pitkow (2001). “Using Information Scent to Model User Information Needs an...

  3. [3]

    The Darwin Reading Notebooks (1838–1860)

    DOI: 10.1002/smj.738. van Hulle, Dirk (2014). Modern Manuscripts: The Extended Mind and Creative Undoing from Darwin to Beckett and Beyond. Historicizing Modernism. Bloomsbury Academic. Van Wyhe, John (2013). Dispelling the Darkness: Voyage in the Malay Archipelago and the Discovery of Evolution by Wallace and Darwin. World Scientific Publishing Company In...

  4. [4]

    The Wisdom of the Crowd in Combinatorial Problems

    DOI: 10.1038/nmeth.1635. Yi, Sheng Kung Michael, Mark Steyvers, Michael D. Lee, and Matthew J. Dry (2012). “The Wisdom of the Crowd in Combinatorial Problems”. Cognitive Science 36.3, pp. 452–470. DOI: 10.1111/j.1551- 6709.2011.01223.x. Youn, Hyejin, Deborah Strumsky, Luis M.A. A Bettencourt, and Jos´e Lobo (2015). “Invention as a combi- natorial process:...

  5. [5]

    Topic Modeling the H\`an di\u{a}n Ancient Classics

    Colin Allen, Hongliang Luo, Jaimie Murdock, Jianghuai Pu, Xiaohong Wang, Yanjie Zhai, and Zhao Kun. “Topic Modeling the Hàn diăn Ancient Classics (ࠡܞׅݱ.”) Journal of Cultural Analytics (2017). doi: 10.22148/16.016. arXiv: 1702.00860

  6. [6]

    Multi-level computational methods for interdisciplinary research in the HathiTrust Digital Library

    Jaimie Murdock, Colin Allen, Katy Börner, Robert Light, Simon McAlister, Robert Rose, Doori Rose, Jun Otsuka, David Bourget, John Lawrence, Andrew Ravenscroft, and Chris Reed. “Multi-level Computational Methods for Interdisciplinary Research in the HathiTrust Digital Library”. PLOS ONE 12.9 (2017), e0184188. doi: 10.1371/journal.pone.0184188. arXiv: 1702.01090

  7. [7]

    Exploration and Exploitation of Victorian Science in Darwin's Reading Notebooks

    Jaimie Murdock, Colin Allen, and Simon DeDeo. “Exploration and Exploitation of Victorian Science in Darwin’s Reading Notebooks”. Cognition 159 (2017), pp. 117–126. doi: 10.1016/ j.cognition.2016.11.012. arXiv: 1509.07175

  8. [8]

    The Wisdom of the Few? "Supertaggers" in Collaborative Tagging Systems

    Jared Lorince, Sam Zorowitz, Jaimie Murdock, and Peter Todd. “Wisdom of the Few? “Su- pertaggers” in Collaborative Tagging Systems”. Journal of Web Science 1.1 (2015). doi: 10.1561/106.00000002. arXiv: 1502.02777. Book Chapters

  9. [9]

    Evaluating Dynamic Ontologies

    Jaimie Murdock, Cameron Buckner, and Colin Allen. “Evaluating Dynamic Ontologies”. In: Knowledge Discovery, Knowledge Engineering, and Knowledge Management: Second Interna- tional Joint Conference, IC3K 2010, Valencia, Spain, October 25-28, 2010, Revised Selected Papers. Ed. by Ana Fred, Jan L. G. Dietz, Kecheng Liu, and Joaquim Filipe. Vol. 272. Com- mun...

  10. [10]

    Computational Discovery

    Jaimie Murdock. “Computational Discovery”. In: The Dynamics of Science: Computational Frontiers in History and Philosophy of Science . Ed. by Grant Ramsay and Andreas De Block. University of Pittsburgh Press, accepted

  11. [11]

    Darwin’s Semantic Voyage

    Jaimie Murdock, Colin Allen, and Simon DeDeo. “Darwin’s Semantic Voyage”. In: The Dy- namics of Science: Computational Frontiers in History and Philosophy of Science . Ed. by Grant Ramsay and Andreas De Block. University of Pittsburgh Press, accepted. Conference Papers

  12. [12]

    The Development of Darwin's Origin of Species

    Jaimie Murdock, Colin Allen, and Simon DeDeo. “Quantitative and Qualitative Approaches to the Development of Darwin’s Origin of Species”. Current Research in Digital History 1 (2018). doi: 10.31835/crdh.2018.14. arXiv: 1802.09944

  13. [13]

    Towards Publishing Secure Capsule-Based Analysis

    Jaimie Murdock, Jacob Jett, Tim Cole, Yu Ma, J. Stephen Downie, and Beth Plale. “Towards Publishing Secure Capsule-Based Analysis”. 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL). Toronto, Ontario, Canada, June 2017. doi: 10.1109/JCDL.2017.7991585

  14. [14]

    “Supertagger

    Jared Lorince, Sam Zorowitz, Jaimie Murdock, and Peter Todd. ““Supertagger” behavior in building folksonomies”. WebSci ’14: Proceedings of the 2014 ACM Conference on Web Science. Bloomington, Indiana, USA, June 2014, pp. 129–138. doi: 10.1145/2615569.2615686

  15. [15]

    Containing the Semantic Explosion

    Jaimie Murdock, Cameron Buckner, and Colin Allen. “Containing the Semantic Explosion”. Proceedings of the WWW2012 conference workshop PhiloWeb 2012: “Web and Philosophy, Why and What For?” Ed. by Alexandre Monnin, Harry Halpin, and Leslie Carr. Vol. 859. CEUR Workshop Proceedings. Apr. 2012

  16. [16]

    InPhO for All: Why APIs Matter

    Jaimie Murdock and Colin Allen. “InPhO for All: Why APIs Matter”. Journal of the Chicago Colloquium on Digital Humanities and Computer Science 1.3 (Nov. 2011)

  17. [17]

    Identifying Species by Genetic Clustering

    Jaimie Murdock and Larry S. Yaeger. “Identifying Species by Genetic Clustering”. Advances in Artificial Life: 20th Anniversary Edition - Back to the Origins of Alife, ECAL 2011 . Paris, France: MIT Press, Aug. 2011, pp. 564–572

  18. [18]

    Two Methods for Evaluating Dynamic Ontologies

    Jaimie Murdock, Cameron Buckner, and Colin Allen. “Two Methods for Evaluating Dynamic Ontologies”. Proceedings of the 2nd International Conference on Knowledge Engineering and Ontology Development (KEOD) . Ed. by Ana Fred and Joaquim Filipe. Vol. 272. Revised and expanded in [ 5]. Valencia, Spain: Springer-Verlag, Oct. 2010, pp. 110–122. doi: 10.5220/ 000...

  19. [19]

    Enhancing Access to Digital Media: The Language Application Grid in the HTRC Data Capsule

    James Pustejovsky, Marc Verhagen, Keongmin Rim, Yu Ma, Liang Ran, Samitha Liyanage, Jaimie Murdock, Robert H. McDonald, and Beth Plale. “Enhancing Access to Digital Media: The Language Application Grid in the HTRC Data Capsule”. PEARC ’17: Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact....

  20. [20]

    Visualization Techniques for Topic Model Checking

    Jaimie Murdock and Colin Allen. “Visualization Techniques for Topic Model Checking”. AAAI-15: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence . Demo of https://hypershelf.org. Jan. 2015, pp. 4284–4285

  21. [21]

    Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research

    Jaimie Murdock, Jiaan Zeng, and Robert H. McDonald. “Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research”. JCDL ’15: Proceedings of the 15th ACM/IEEE- CS Joint Conference on Digital Libraries . Tutorial. Knoxville, Tennessee, USA, June 2015, p. 295. doi: 10.1145/2756406.2756929

  22. [22]

    LODE: Linking Digital Humanities Content to the Web of Data

    Timo Sztyler, Jakob Huber, Jan Noessner, Jaimie Murdock, Colin Allen, and Mathias Niepert. “LODE: Linking Digital Humanities Content to the Web of Data”. JCDL ’14: Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries . Demo. London, United Kingdom, Sept. 2014, pp. 423–424

  23. [23]

    Mapping the Intersection of Science and Philosophy

    Jaimie Murdock, Robert Light, Colin Allen, and Katy Börner. “Mapping the Intersection of Science and Philosophy”. JCDL ’13, Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries . Poster. Indianapolis, Indiana, USA, July 2013, pp. 405–406. doi: 10. 1145/2467696.2467777

  24. [24]

    Genetic Clustering for the Identification of Species

    Jaimie Murdock and Larry S. Yaeger. “Genetic Clustering for the Identification of Species”. GECCO ’11. Proceedings of the 13th Annual Conference Companion on Genetic and Evolu- tionary Computation . Poster. Expanded in [ 13]. Dublin, Ireland, July 2011, pp. 29–30. doi: 10.1145/2001858.2001875

  25. [25]

    InPhO: A System for Collaboratively Populating and Extending a Dynamic Ontology

    Mathias Niepert, Cameron Buckner, Jaimie Murdock, and Colin Allen. “InPhO: A System for Collaboratively Populating and Extending a Dynamic Ontology”. JCDL ’08: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries . Pittsburgh, Pennsylvania, USA, June 2008, p. 429. doi: 10.1145/1378889.1378976. Miscellaneous

  26. [26]

    Towards Evaluation of Cultural-scale Claims in Light of Topic Model Sampling Effects

    Jaimie Murdock, Jiaan Zeng, and Colin Allen. Towards Evaluation of Cultural-scale Claims in Light of Topic Model Sampling Effects . Advanced Collaborative Support Technical Report. HathiTrust Research Center, June 2016. arXiv: 1512.05004

  27. [27]

    LODE: Linking Digital Humanities Content to the Web of Data

    Jakob Huber, Timo Sztyler, Jan Noessner, Jaimie Murdock, Colin Allen, and Mathias Niepert. LODE: Linking Digital Humanities Conetnt to the Web of Data . Expanded pre-print of abstract published in [ 18]. 2014. arXiv: 1406.0216

  28. [28]

    Computational Phi- losophy and the Examined Text: A Tale of Two Encyclopedias

    Colin Allen, Jaimie Murdock, Cameron Buckner, and Robert Rose. “Computational Phi- losophy and the Examined Text: A Tale of Two Encyclopedias”. American Philosophical Association Newsletter on Philosophy and Computers 12.2 (2013), pp. 28–30

  29. [29]

    Cross-Cutting Categorization Schemes in the Digital Humanities

    Colin Allen and the InPhO Group. “Cross-Cutting Categorization Schemes in the Digital Humanities”. Isis 104.3 (2013), pp. 573–583. doi: 10.1086/673276. TEACHING Indiana University , Bloomington, IN Foundations in Science and Mathematics Introduction to Computer Science June 2016 Introduction to Computer Science June 2015 Department of History and Philosop...