{"work":{"id":"72f42543-17d5-49aa-ba5a-25d67ffbb88a","openalex_id":null,"doi":null,"arxiv_id":"1812.01717","raw_key":null,"title":"Towards Accurate Generative Models of Video: A New Metric & Challenges","authors":null,"authors_text":"Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, Sylvain Gelly","year":2018,"venue":"cs.CV","abstract":"Recent advances in deep generative models have lead to remarkable progress in synthesizing high quality images. Following their successful application in image processing and representation learning, an important next step is to consider videos. Learning generative models of video is a much harder task, requiring a model to capture the temporal dynamics of a scene, in addition to the visual presentation of objects. While recent attempts at formulating generative models of video have had some success, current progress is hampered by (1) the lack of qualitative metrics that consider visual quality, temporal coherence, and diversity of samples, and (2) the wide gap between purely synthetic video data sets and challenging real-world data sets in terms of complexity. To this extent we propose Fr\\'{e}chet Video Distance (FVD), a new metric for generative models of video, and StarCraft 2 Videos (SCV), a benchmark of game play from custom starcraft 2 scenarios that challenge the current capabilities of generative models of video. We contribute a large-scale human study, which confirms that FVD correlates well with qualitative human judgment of generated videos, and provide initial benchmark results on SCV.","external_url":"https://arxiv.org/abs/1812.01717","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T05:00:21.368414+00:00","pith_arxiv_id":"1812.01717","created_at":"2026-05-09T06:05:34.370348+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"Towards Accurate Generative Models of Video: A New Metric & Challenges","render_title":"Towards Accurate Generative Models of Video: A New Metric & Challenges"},"hub":{"state":{"work_id":"72f42543-17d5-49aa-ba5a-25d67ffbb88a","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":100,"external_cited_by_count":null,"distinct_field_count":6,"first_pith_cited_at":"2022-04-07T14:08:02+00:00","last_pith_cited_at":"2026-05-22T14:51:22+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-11T18:48:46.211508+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":16},{"context_role":"method","n":16},{"context_role":"baseline","n":3},{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"background","n":17},{"context_polarity":"use_method","n":16},{"context_polarity":"baseline","n":3}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Towards Accurate Generative Models of Video: A New Metric & Challenges","claims":[{"claim_text":"Recent advances in deep generative models have lead to remarkable progress in synthesizing high quality images. Following their successful application in image processing and representation learning, an important next step is to consider videos. Learning generative models of video is a much harder task, requiring a model to capture the temporal dynamics of a scene, in addition to the visual presentation of objects. While recent attempts at formulating generative models of video have had some success, current progress is hampered by (1) the lack of qualitative metrics that consider visual quali","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"8 554M 250 371k LDM-4Δ𝑔∗ [56] 3.6 247.7 400M 250 51.5k DiT-XL/2Δ𝑔∗ [53] 2.3 278.2 675M 250 59.5k MDTΔ𝑔∗ [18] 1.8 283.0 676M 250>59k MaskDiTΔ𝑔∗ [82] 2.3 276.6 736M 250>28k CDMΔ [30] 4.9 158.7 - 8100 - RINΔ [36] 3.4 182.0 410M 1000 334k Simple DiffusionΔ𝑔 [33] 2.4 256.3 2B 512 - VDM++Δ𝑔 [39] 2.1 267.7 2B 512 - EDiffΔ𝑔 [25] 2.1 - 450M 50 119k LPDM-ADMΔ𝑔 [72] 2.7 - - 50 7.8k MARΔ𝑔 [44]✓1.8 296.0 479M 128 - VQVAE-2Δ [55]✓31.1∼45 13.5B 5120 - VQGANΔ [15]✓15.8 78.3 1.4B 256 - MaskGITΔ[7] 6.2 182.1 227M","claim_type":"baseline","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"with20frames at64×64resolution. We follow the DiffATS pipeline in Fig. 2 directly. Metrics.For unconditional image synthesis, we report the Fréchet Inception Distance (FID) [ 17] with both Inception-V3 [55] and DINOv2 [40] embeddings; the latter follows [54] for a less biased evaluation. For video generation, we report the standard Fréchet Video Distance (FVD) [59]. Results.Generated samples are shown in Fig. 6 (App. D). Tab. 2 shows that DiffATS attains the best FID and FVD among autoencoder-fr","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"frame using Peak Signal-to-Noise Ratio (PSNR) and Learned Perceptual Image Patch Similarity (LPIPS) across four cardinal viewpoints. For video-level evalu- ation, we autoregressively generate a 200-frame trajectory comprising Forward, Backward, Left, and Idle segments, encompassing one start, one180◦ turn, one 90◦ turn, and one stop. We compute FVD [63] between rendered videos from four cardinal viewpoints and ground-truth references containing equivalent ac- tion events and the same number of f","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"we hold out every 10th timestamp for testing, excluding both the vehicle and infrastructure images at these specific timestamps from the training set. We measure reconstruction quality using PSNR, SSIM, and LPIPS. Since dynamic objects represent the core challenge in asynchronous cooperative reconstruction, we specifically report these metrics on dynamic areas, masked using the provided 2D bounding boxes [31]. Furthermore, we report FVD [18] and RAFT-EPE [17] on these dynamic areas to evaluate t","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"resolution. At inference, we sample 35 diffusion steps; generating one video takes approximately 15 minutes on a single A100 GPU. More implementation details are presented in Sec. A. 4.2. Experiment Settings Evaluation metrics.We evaluate our model across four different aspects:Video quality: PSNR and SSIM against reference videos, and FID [33] and FVD [66] for distribution-level similarity.Camera accuracy: rotation and translation errors [1, 2, 31] between reference poses and poses estimated fr","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"low resolution samples at frameskip 4 to 64x128x128 samples at frameskip 1 using a 9x128x128 diffusion model. 4 4 Experiments We report our results on video diffusion models for unconditional video generation (Section 4.1), conditional video generation (video prediction) (Section 4.2), and text-conditioned video genera- tion (Section 4.3). We evaluate our models using standard metrics such as FVD [54], FID [19], and IS [43]; details on evaluation are provided below alongside each benchmark. Samp","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Towards Accurate Generative Models of Video: A New Metric & Challenges because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (16 contexts).","role_counts":[{"n":16,"context_role":"background"},{"n":15,"context_role":"method"},{"n":3,"context_role":"baseline"},{"n":1,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-25T05:05:38.638810+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"bbf4c12a-b635-44de-a65c-9db61f95e252","orcid":null,"display_name":"Thomas Unterthiner"},{"id":"2e12c5a4-5577-4a0b-9f48-e56d2efeca34","orcid":null,"display_name":"Sjoerd van Steenkiste"},{"id":"fe58ce88-c1e6-4d31-95aa-cc0187070f47","orcid":null,"display_name":"Karol Kurach"},{"id":"b572e91c-3e32-4113-a63b-63dd0d180b43","orcid":null,"display_name":"Raphael Marinier"},{"id":"377c3da2-4679-44fd-89c7-e325138f41ea","orcid":null,"display_name":"Marcin Michalski"},{"id":"17c19b6d-a124-4406-af3e-4c62326e63ef","orcid":null,"display_name":"Sylvain Gelly"}]},"error":null,"updated_at":"2026-05-25T05:05:38.634622+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T12:50:50.035352+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":23},{"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","shared_citers":22},{"title":"CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer","work_id":"f38fc088-12aa-4bf4-9ecd-08d3e797ccb7","shared_citers":13},{"title":"HunyuanVideo: A Systematic Framework For Large Video Generative Models","work_id":"881efa7e-7e73-4c66-9cc3-2803e551061c","shared_citers":10},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":8},{"title":"Imagen Video: High Definition Video Generation with Diffusion Models","work_id":"bb20d241-dc6f-4b0a-b071-fd43a2cbd57f","shared_citers":8},{"title":"Score-Based Generative Modeling through Stochastic Differential Equations","work_id":"d9110e53-a5d4-4794-a4c5-a575e91c31ad","shared_citers":8},{"title":"Auto-Encoding Variational Bayes","work_id":"97d95295-30e1-42b4-bbf6-85f0fa4edb44","shared_citers":7},{"title":"CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers","work_id":"2dbd6bcd-fc98-4fbf-b586-f6d94fe1abd2","shared_citers":7},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":7},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":7},{"title":"CameraCtrl: Enabling Camera Control for Text-to-Video Generation","work_id":"1c05c278-c023-4ef0-a359-25a41f1065eb","shared_citers":6},{"title":"Classifier-Free Diffusion Guidance","work_id":"acf2c588-c088-4a6c-938e-150ad7c666d7","shared_citers":6},{"title":"Cosmos World Foundation Model Platform for Physical AI","work_id":"a2dba24c-318d-476a-8b21-4289c265810c","shared_citers":6},{"title":"Latent video diffusion models for high-fidelity video generation with arbitrary lengths","work_id":"23338b3d-620a-4954-904f-bab6a577b8a5","shared_citers":6},{"title":"Make-A-Video: Text-to-Video Generation without Text-Video Data","work_id":"52a801fc-a707-45a1-a8cd-0d6702f124ab","shared_citers":6},{"title":"Seedance 1.0: Exploring the Boundaries of Video Generation Models","work_id":"b2e36b5d-99e4-45b4-9358-64f6d3501983","shared_citers":6},{"title":"Video Diffusion Models","work_id":"02e03469-549e-4b5a-9bf0-ac6617a89882","shared_citers":6},{"title":"VideoGPT: Video Generation using VQ-VAE and Transformers","work_id":"703c74c3-fa5e-455c-8c00-697c83511fcf","shared_citers":6},{"title":"AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning","work_id":"1f9d1d3b-a6d6-45a9-9f13-51393c03be8a","shared_citers":5},{"title":"Diffusion models are real-time game engines","work_id":"3f074579-63c2-40bf-b5c8-c6a8c39d9319","shared_citers":5},{"title":"Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow","work_id":"a1989e1b-d66d-4533-be3a-fb9c5fd62290","shared_citers":5},{"title":"Latte: Latent Diffusion Transformer for Video Generation","work_id":"5328e907-7278-4781-a2bb-c5ef40dc87fb","shared_citers":5},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":5}],"time_series":[{"n":2,"year":2022},{"n":2,"year":2023},{"n":3,"year":2024},{"n":1,"year":2025},{"n":47,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T12:50:34.528647+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T12:50:42.954559+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Towards Accurate Generative Models of Video: A New Metric & Challenges","claims":[{"claim_text":"Recent advances in deep generative models have lead to remarkable progress in synthesizing high quality images. Following their successful application in image processing and representation learning, an important next step is to consider videos. Learning generative models of video is a much harder task, requiring a model to capture the temporal dynamics of a scene, in addition to the visual presentation of objects. While recent attempts at formulating generative models of video have had some success, current progress is hampered by (1) the lack of qualitative metrics that consider visual quali","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"8 554M 250 371k LDM-4Δ𝑔∗ [56] 3.6 247.7 400M 250 51.5k DiT-XL/2Δ𝑔∗ [53] 2.3 278.2 675M 250 59.5k MDTΔ𝑔∗ [18] 1.8 283.0 676M 250>59k MaskDiTΔ𝑔∗ [82] 2.3 276.6 736M 250>28k CDMΔ [30] 4.9 158.7 - 8100 - RINΔ [36] 3.4 182.0 410M 1000 334k Simple DiffusionΔ𝑔 [33] 2.4 256.3 2B 512 - VDM++Δ𝑔 [39] 2.1 267.7 2B 512 - EDiffΔ𝑔 [25] 2.1 - 450M 50 119k LPDM-ADMΔ𝑔 [72] 2.7 - - 50 7.8k MARΔ𝑔 [44]✓1.8 296.0 479M 128 - VQVAE-2Δ [55]✓31.1∼45 13.5B 5120 - VQGANΔ [15]✓15.8 78.3 1.4B 256 - MaskGITΔ[7] 6.2 182.1 227M","claim_type":"baseline","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"with20frames at64×64resolution. We follow the DiffATS pipeline in Fig. 2 directly. Metrics.For unconditional image synthesis, we report the Fréchet Inception Distance (FID) [ 17] with both Inception-V3 [55] and DINOv2 [40] embeddings; the latter follows [54] for a less biased evaluation. For video generation, we report the standard Fréchet Video Distance (FVD) [59]. Results.Generated samples are shown in Fig. 6 (App. D). Tab. 2 shows that DiffATS attains the best FID and FVD among autoencoder-fr","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"frame using Peak Signal-to-Noise Ratio (PSNR) and Learned Perceptual Image Patch Similarity (LPIPS) across four cardinal viewpoints. For video-level evalu- ation, we autoregressively generate a 200-frame trajectory comprising Forward, Backward, Left, and Idle segments, encompassing one start, one180◦ turn, one 90◦ turn, and one stop. We compute FVD [63] between rendered videos from four cardinal viewpoints and ground-truth references containing equivalent ac- tion events and the same number of f","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"we hold out every 10th timestamp for testing, excluding both the vehicle and infrastructure images at these specific timestamps from the training set. We measure reconstruction quality using PSNR, SSIM, and LPIPS. Since dynamic objects represent the core challenge in asynchronous cooperative reconstruction, we specifically report these metrics on dynamic areas, masked using the provided 2D bounding boxes [31]. Furthermore, we report FVD [18] and RAFT-EPE [17] on these dynamic areas to evaluate t","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"resolution. At inference, we sample 35 diffusion steps; generating one video takes approximately 15 minutes on a single A100 GPU. More implementation details are presented in Sec. A. 4.2. Experiment Settings Evaluation metrics.We evaluate our model across four different aspects:Video quality: PSNR and SSIM against reference videos, and FID [33] and FVD [66] for distribution-level similarity.Camera accuracy: rotation and translation errors [1, 2, 31] between reference poses and poses estimated fr","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"low resolution samples at frameskip 4 to 64x128x128 samples at frameskip 1 using a 9x128x128 diffusion model. 4 4 Experiments We report our results on video diffusion models for unconditional video generation (Section 4.1), conditional video generation (video prediction) (Section 4.2), and text-conditioned video genera- tion (Section 4.3). We evaluate our models using standard metrics such as FVD [54], FID [19], and IS [43]; details on evaluation are provided below alongside each benchmark. Samp","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Towards Accurate Generative Models of Video: A New Metric & Challenges because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (16 contexts).","role_counts":[{"n":16,"context_role":"background"},{"n":15,"context_role":"method"},{"n":3,"context_role":"baseline"},{"n":1,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-25T05:05:38.642777+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Towards Accurate Generative Models of Video: A New Metric & Challenges","claims":[{"claim_text":"Recent advances in deep generative models have lead to remarkable progress in synthesizing high quality images. Following their successful application in image processing and representation learning, an important next step is to consider videos. Learning generative models of video is a much harder task, requiring a model to capture the temporal dynamics of a scene, in addition to the visual presentation of objects. While recent attempts at formulating generative models of video have had some success, current progress is hampered by (1) the lack of qualitative metrics that consider visual quali","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Towards Accurate Generative Models of Video: A New Metric & Challenges because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T12:50:36.758338+00:00"}},"summary":{"title":"Towards Accurate Generative Models of Video: A New Metric & Challenges","claims":[{"claim_text":"Recent advances in deep generative models have lead to remarkable progress in synthesizing high quality images. Following their successful application in image processing and representation learning, an important next step is to consider videos. Learning generative models of video is a much harder task, requiring a model to capture the temporal dynamics of a scene, in addition to the visual presentation of objects. While recent attempts at formulating generative models of video have had some success, current progress is hampered by (1) the lack of qualitative metrics that consider visual quali","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Towards Accurate Generative Models of Video: A New Metric & Challenges because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":23},{"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","shared_citers":22},{"title":"CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer","work_id":"f38fc088-12aa-4bf4-9ecd-08d3e797ccb7","shared_citers":13},{"title":"HunyuanVideo: A Systematic Framework For Large Video Generative Models","work_id":"881efa7e-7e73-4c66-9cc3-2803e551061c","shared_citers":10},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":8},{"title":"Imagen Video: High Definition Video Generation with Diffusion Models","work_id":"bb20d241-dc6f-4b0a-b071-fd43a2cbd57f","shared_citers":8},{"title":"Score-Based Generative Modeling through Stochastic Differential Equations","work_id":"d9110e53-a5d4-4794-a4c5-a575e91c31ad","shared_citers":8},{"title":"Auto-Encoding Variational Bayes","work_id":"97d95295-30e1-42b4-bbf6-85f0fa4edb44","shared_citers":7},{"title":"CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers","work_id":"2dbd6bcd-fc98-4fbf-b586-f6d94fe1abd2","shared_citers":7},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":7},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":7},{"title":"CameraCtrl: Enabling Camera Control for Text-to-Video Generation","work_id":"1c05c278-c023-4ef0-a359-25a41f1065eb","shared_citers":6},{"title":"Classifier-Free Diffusion Guidance","work_id":"acf2c588-c088-4a6c-938e-150ad7c666d7","shared_citers":6},{"title":"Cosmos World Foundation Model Platform for Physical AI","work_id":"a2dba24c-318d-476a-8b21-4289c265810c","shared_citers":6},{"title":"Latent video diffusion models for high-fidelity video generation with arbitrary lengths","work_id":"23338b3d-620a-4954-904f-bab6a577b8a5","shared_citers":6},{"title":"Make-A-Video: Text-to-Video Generation without Text-Video Data","work_id":"52a801fc-a707-45a1-a8cd-0d6702f124ab","shared_citers":6},{"title":"Seedance 1.0: Exploring the Boundaries of Video Generation Models","work_id":"b2e36b5d-99e4-45b4-9358-64f6d3501983","shared_citers":6},{"title":"Video Diffusion Models","work_id":"02e03469-549e-4b5a-9bf0-ac6617a89882","shared_citers":6},{"title":"VideoGPT: Video Generation using VQ-VAE and Transformers","work_id":"703c74c3-fa5e-455c-8c00-697c83511fcf","shared_citers":6},{"title":"AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning","work_id":"1f9d1d3b-a6d6-45a9-9f13-51393c03be8a","shared_citers":5},{"title":"Diffusion models are real-time game engines","work_id":"3f074579-63c2-40bf-b5c8-c6a8c39d9319","shared_citers":5},{"title":"Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow","work_id":"a1989e1b-d66d-4533-be3a-fb9c5fd62290","shared_citers":5},{"title":"Latte: Latent Diffusion Transformer for Video Generation","work_id":"5328e907-7278-4781-a2bb-c5ef40dc87fb","shared_citers":5},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":5}],"time_series":[{"n":2,"year":2022},{"n":2,"year":2023},{"n":3,"year":2024},{"n":1,"year":2025},{"n":47,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"fe58ce88-c1e6-4d31-95aa-cc0187070f47","orcid":null,"display_name":"Karol Kurach","source":"manual","import_confidence":0.72},{"id":"377c3da2-4679-44fd-89c7-e325138f41ea","orcid":null,"display_name":"Marcin Michalski","source":"manual","import_confidence":0.72},{"id":"b572e91c-3e32-4113-a63b-63dd0d180b43","orcid":null,"display_name":"Raphael Marinier","source":"manual","import_confidence":0.72},{"id":"2e12c5a4-5577-4a0b-9f48-e56d2efeca34","orcid":null,"display_name":"Sjoerd van Steenkiste","source":"manual","import_confidence":0.72},{"id":"17c19b6d-a124-4406-af3e-4c62326e63ef","orcid":null,"display_name":"Sylvain Gelly","source":"manual","import_confidence":0.72},{"id":"bbf4c12a-b635-44de-a65c-9db61f95e252","orcid":null,"display_name":"Thomas Unterthiner","source":"manual","import_confidence":0.72}]}}