{"work":{"id":"a1fd09f1-b62b-4aca-a5ef-dd2b50ad08b5","openalex_id":null,"doi":null,"arxiv_id":null,"raw_key":"raw:3eeb6cc3436cd43bb7a41789","title":"Advances in neural information processing systems , volume=","authors":null,"authors_text":"Attention is all you need , author=","year":null,"venue":null,"abstract":null,"external_url":null,"cited_by_count":null,"metadata_source":"raw_reference","metadata_fetched_at":"2026-05-27T06:38:54.295934+00:00","pith_arxiv_id":null,"created_at":"2026-05-10T18:26:22.474706+00:00","updated_at":"2026-05-27T06:38:54.295934+00:00","title_quality_ok":false,"display_title":"Advances in neural information processing systems , volume=","render_title":"Advances in neural information processing systems , volume="},"hub":{"state":{"work_id":"a1fd09f1-b62b-4aca-a5ef-dd2b50ad08b5","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":143,"external_cited_by_count":null,"distinct_field_count":20,"first_pith_cited_at":"2022-11-30T17:33:28+00:00","last_pith_cited_at":"2026-05-22T17:28:53+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-02T13:24:43.507945+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":6},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":6},{"context_polarity":"use_method","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Advances in neural information processing systems , volume=","claims":[{"claim_text":"Careful design and use of systems like the co-scientist are crucial to mitigate these risks. AI as a catalyst for both scientific discovery and equity.Despite these risks, AI holds immense potential to democratize access to scientific information and accelerate discovery, particularly benefiting historically neglected and resource-constrained areas [142, 143]. In essence, AI can \"raise the tide\" of scientific progress, lifting all boats, especially those that have historically been left behind. ","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"TBPTT with truncation depth K computes gradients on exactly this truncated graph, hence equals∇ θLK(θ). A.2. Chain-Rule Derivation via Jacobian Products and Adjoint Recursion We provide an explicit chain-rule expansion for the gradient computed by TBPTT and show that it equals∇ θLK(θ). Notation.LetC i ∈R dC be the vectorized interface state. Define the Jacobian of the state transition: Ji := ∂Ci ∂Ci−1 ∈R dC ×dC ,(6) and the sensitivity of the transition to parameters: Ui := ∂Ci ∂θ .(7) For a fix","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"j∈N(i) softmaxj(αij)vij.(8) where the superscript(c→n)stands for center to neighbors andN(i)is the neighbor ID set of theith node. (b)Neighbor to neighbor attention: The neighbor to neighbor attention of theith node works similarly as the center to neighbor attention and can be formulated as the following: qij =W ′ qhij,k ik =W ′ khik,v ik =W ′ vhik,(9) βjk = g(d jk ) q⊤ ijkik √dh ,(10) o(n→n) i = 1 |N(i)| X j,k∈N(i),j̸=k softmaxk(βjk )vik.(11) wherei̸=j̸=kandW ′ q,W ′ k andW ′ v are learnable w","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Constructing paired neutral and stigmatizing narratives. Clinical experts with domain expertise in medical stigma constructed a foundational neutral vignette for each of the four evaluated conditions. Drawing on established tax- onomies of SL in medical documentations (Harrigian et al., 2023; Goddu et al., 2018; McArthur et al., 2026), we fo- cused on three primary phenotypes: (1) doubt (questioning the validity of patient-reported symptoms), (2) blame (at- tributing treatment non-adherence to p","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"of estimation error of two nuisance parameters - propensity score eP ∗ and outcome regression P ∗ Y|A,X . The requirement of op(n−1/2) convergence rate for this double product is standard. Therefore, now ignoring the established op(n−1/2) term, it remains to show that Sε(bP1,bP0)−S ε(P ∗ 1 , P ∗ 0 )− h (P ∗ 1 − bP1)υ1 + (P ∗ 0 − bP0)υ0 i =o p(n−1/2).(36) The above is true via the Hadamard differentiability of (α, β)7→S ε(α, β). For any (µ, ν)∈ P(Y)× P(Y) and (γ1, γ2)∈ M0,µ × M0,µ, consider paths","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Optimizer Adam, lr =3×10 −4 Adam, lr =3×10 −4 Batch Size 64 128 Training Steps 1M 1M Diffusion Timesteps 150 200 Observation HorizonT o 10 4 (Kitchen), 2 (LIBERO), 10 (others) Planning HorizonT p 16 (Cheetah), 32 (Ant) 16 (Kitchen), 10 (LIBERO), 32 (others) Execution HorizonT o 1 8 (Kitchen, LIBERO), 10 (others) Guidance CG,ω= 1.5 CFG Inverse Dynamics Model - 2-layer embed (64), 3-layer MLP (128), 1M steps Refinement Loss Cofficient 0.1 0.1 Table 12: Planner configurations for type (i): full tra","claim_type":"background","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Advances in neural information processing systems , volume= because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (6 contexts).","role_counts":[{"n":6,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-22T22:14:01.433342+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"d0097eda-c1b6-4d87-b0b6-c96fdde9ee61","orcid":null,"display_name":"Attention is all you need"},{"id":"fe4d5dbf-e369-4296-8f93-544d5ed81b09","orcid":null,"display_name":"author="}]},"error":null,"updated_at":"2026-05-22T22:14:01.422498+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T18:46:46.126552+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Advances in neural information processing systems , volume=","work_id":"12f5a236-ef7a-4d13-b4de-b51465a6f977","shared_citers":9},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":8},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":7},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":7},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":6},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":6},{"title":"Neurocomputing , volume=","work_id":"0cfd4ae7-837c-4c4f-bf6d-afc10f5a8ed1","shared_citers":6},{"title":"OpenAI blog , volume=","work_id":"31dc92c3-2fc3-432b-8d63-b7ee13f53a9c","shared_citers":6},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":6},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":5},{"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","shared_citers":5},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":5},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":5},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":5},{"title":"2016 , publisher=","work_id":"cf0899e0-53ee-4591-aae4-f38fa5ac12ad","shared_citers":4},{"title":"Advances in neural information processing systems , volume=","work_id":"48172a5a-0dfc-45cf-9fdc-988f99c16450","shared_citers":4},{"title":"Journal of machine learning research , volume=","work_id":"d1b76c93-2608-4f12-9b0e-1c71236cad23","shared_citers":4},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":4},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":4},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":4},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":4},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":4},{"title":"Scaling Learning Algorithms Towards","work_id":"bb2761cc-98d0-411b-92f6-803773d64460","shared_citers":4},{"title":"The Pile: An 800GB Dataset of Diverse Text for Language Modeling","work_id":"9b10667a-da61-4358-aceb-10578234d45d","shared_citers":4}],"time_series":[{"n":2,"year":2023},{"n":4,"year":2024},{"n":1,"year":2025},{"n":32,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T18:46:24.812195+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T18:46:33.824857+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Advances in neural information processing systems , volume=","claims":[{"claim_text":"Careful design and use of systems like the co-scientist are crucial to mitigate these risks. AI as a catalyst for both scientific discovery and equity.Despite these risks, AI holds immense potential to democratize access to scientific information and accelerate discovery, particularly benefiting historically neglected and resource-constrained areas [142, 143]. In essence, AI can \"raise the tide\" of scientific progress, lifting all boats, especially those that have historically been left behind. ","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"TBPTT with truncation depth K computes gradients on exactly this truncated graph, hence equals∇ θLK(θ). A.2. Chain-Rule Derivation via Jacobian Products and Adjoint Recursion We provide an explicit chain-rule expansion for the gradient computed by TBPTT and show that it equals∇ θLK(θ). Notation.LetC i ∈R dC be the vectorized interface state. Define the Jacobian of the state transition: Ji := ∂Ci ∂Ci−1 ∈R dC ×dC ,(6) and the sensitivity of the transition to parameters: Ui := ∂Ci ∂θ .(7) For a fix","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"j∈N(i) softmaxj(αij)vij.(8) where the superscript(c→n)stands for center to neighbors andN(i)is the neighbor ID set of theith node. (b)Neighbor to neighbor attention: The neighbor to neighbor attention of theith node works similarly as the center to neighbor attention and can be formulated as the following: qij =W ′ qhij,k ik =W ′ khik,v ik =W ′ vhik,(9) βjk = g(d jk ) q⊤ ijkik √dh ,(10) o(n→n) i = 1 |N(i)| X j,k∈N(i),j̸=k softmaxk(βjk )vik.(11) wherei̸=j̸=kandW ′ q,W ′ k andW ′ v are learnable w","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Constructing paired neutral and stigmatizing narratives. Clinical experts with domain expertise in medical stigma constructed a foundational neutral vignette for each of the four evaluated conditions. Drawing on established tax- onomies of SL in medical documentations (Harrigian et al., 2023; Goddu et al., 2018; McArthur et al., 2026), we fo- cused on three primary phenotypes: (1) doubt (questioning the validity of patient-reported symptoms), (2) blame (at- tributing treatment non-adherence to p","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"of estimation error of two nuisance parameters - propensity score eP ∗ and outcome regression P ∗ Y|A,X . The requirement of op(n−1/2) convergence rate for this double product is standard. Therefore, now ignoring the established op(n−1/2) term, it remains to show that Sε(bP1,bP0)−S ε(P ∗ 1 , P ∗ 0 )− h (P ∗ 1 − bP1)υ1 + (P ∗ 0 − bP0)υ0 i =o p(n−1/2).(36) The above is true via the Hadamard differentiability of (α, β)7→S ε(α, β). For any (µ, ν)∈ P(Y)× P(Y) and (γ1, γ2)∈ M0,µ × M0,µ, consider paths","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Optimizer Adam, lr =3×10 −4 Adam, lr =3×10 −4 Batch Size 64 128 Training Steps 1M 1M Diffusion Timesteps 150 200 Observation HorizonT o 10 4 (Kitchen), 2 (LIBERO), 10 (others) Planning HorizonT p 16 (Cheetah), 32 (Ant) 16 (Kitchen), 10 (LIBERO), 32 (others) Execution HorizonT o 1 8 (Kitchen, LIBERO), 10 (others) Guidance CG,ω= 1.5 CFG Inverse Dynamics Model - 2-layer embed (64), 3-layer MLP (128), 1M steps Refinement Loss Cofficient 0.1 0.1 Table 12: Planner configurations for type (i): full tra","claim_type":"background","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Advances in neural information processing systems , volume= because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (6 contexts).","role_counts":[{"n":6,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-22T22:14:01.429011+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Advances in neural information processing systems , volume=","claims":[],"why_cited":"Pith tracks Advances in neural information processing systems , volume= because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T18:46:24.738175+00:00"}},"summary":{"title":"Advances in neural information processing systems , volume=","claims":[],"why_cited":"Pith tracks Advances in neural information processing systems , volume= because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Advances in neural information processing systems , volume=","work_id":"12f5a236-ef7a-4d13-b4de-b51465a6f977","shared_citers":9},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":8},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":7},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":7},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":6},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":6},{"title":"Neurocomputing , volume=","work_id":"0cfd4ae7-837c-4c4f-bf6d-afc10f5a8ed1","shared_citers":6},{"title":"OpenAI blog , volume=","work_id":"31dc92c3-2fc3-432b-8d63-b7ee13f53a9c","shared_citers":6},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":6},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":5},{"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","shared_citers":5},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":5},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":5},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":5},{"title":"2016 , publisher=","work_id":"cf0899e0-53ee-4591-aae4-f38fa5ac12ad","shared_citers":4},{"title":"Advances in neural information processing systems , volume=","work_id":"48172a5a-0dfc-45cf-9fdc-988f99c16450","shared_citers":4},{"title":"Journal of machine learning research , volume=","work_id":"d1b76c93-2608-4f12-9b0e-1c71236cad23","shared_citers":4},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":4},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":4},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":4},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":4},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":4},{"title":"Scaling Learning Algorithms Towards","work_id":"bb2761cc-98d0-411b-92f6-803773d64460","shared_citers":4},{"title":"The Pile: An 800GB Dataset of Diverse Text for Language Modeling","work_id":"9b10667a-da61-4358-aceb-10578234d45d","shared_citers":4}],"time_series":[{"n":2,"year":2023},{"n":4,"year":2024},{"n":1,"year":2025},{"n":32,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"d0097eda-c1b6-4d87-b0b6-c96fdde9ee61","orcid":null,"display_name":"Attention is all you need","source":"manual","import_confidence":0.72},{"id":"fe4d5dbf-e369-4296-8f93-544d5ed81b09","orcid":null,"display_name":"author=","source":"manual","import_confidence":0.72}]}}