FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis: Settings

This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video.


This content originally appeared on HackerNoon and was authored by Kinetograph: The Video Editing Technology Publication

:::info (1) Feng Liang, The University of Texas at Austin and Work partially done during an internship at Meta GenAI (Email: jeffliang@utexas.edu);

(2) Bichen Wu, Meta GenAI and Corresponding author;

(3) Jialiang Wang, Meta GenAI;

(4) Licheng Yu, Meta GenAI;

(5) Kunpeng Li, Meta GenAI;

(6) Yinan Zhao, Meta GenAI;

(7) Ishan Misra, Meta GenAI;

(8) Jia-Bin Huang, Meta GenAI;

(9) Peizhao Zhang, Meta GenAI (Email: stzpz@meta.com);

(10) Peter Vajda, Meta GenAI (Email: vajdap@meta.com);

(11) Diana Marculescu, The University of Texas at Austin (Email: dianam@utexas.edu).

:::

5. Experiments

5.1. Settings

Implementation Details We train our model with 100k videos from Shutterstock [1]. For each training video, we sequentially sample 16 frames with interval {2,4,8}, which represent videos lasting {1,2,4} seconds (taking videos with FPS of 30). The resolution of all images, including input frames, spatial condition images, and flow warped frames, is set to 512×512 via center crop. We train the model with a batch size of 1 per GPU and a total batch size of 8 with 8 GPUs. We employ AdamW optimizer [28] with a learning rate of 1e-5 for 100k iterations. As detailed in our method, we train the major U-Net and ControlNet U-Net joint branches with v-parameterization [41]. The training takes four days on one 8-A100-80G node.

\

\ During generation, we first generate keyframes with our trained model and then use an off-the-shelf frame interpolation model, such as RIFE [21], to generate non-key frames. By default, we produce 16 key frames at an interval of 4, corresponding to a 2-second clip at 8 FPS. Then, we use RIFE to interpolate the results to 32 FPS. We employ classifier-free guidance [15] with a scale of 7.5 and use 20 inference sampling steps. Additionally, the Zero SNR noise scheduler [27] is utilized. We also fuse the self-attention features obtained during the DDIM inversion of corresponding key frames from the input video, following FateZero [35]. We evaluate our FlowVid with two different spatial conditions: canny edge maps [5] and depth maps [38]. A comparison of these controls can be found in Section 5.4.

\ Evaluation We select the 25 object-centric videos from the public DAVIS dataset [34], covering humans, animals, etc We manually design 115 prompts for these videos, spanning from stylization to object swap. Besides, we also collect 50 Shutterstock videos [1] with 200 designed prompts. We conduct both qualitative (see Section 5.2) and quantitative comparisons (see Section 5.3) with state-of-the-art methods including Rerender [49], CoDeF [32] and TokenFlow [13]. We use their official codes with the default settings.

\

:::info This paper is available on arxiv under CC 4.0 license.

:::

\


This content originally appeared on HackerNoon and was authored by Kinetograph: The Video Editing Technology Publication


Print Share Comment Cite Upload Translate Updates
APA

Kinetograph: The Video Editing Technology Publication | Sciencx (2024-10-09T12:15:16+00:00) FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis: Settings. Retrieved from https://www.scien.cx/2024/10/09/flowvid-taming-imperfect-optical-flows-for-consistent-video-to-video-synthesis-settings/

MLA
" » FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis: Settings." Kinetograph: The Video Editing Technology Publication | Sciencx - Wednesday October 9, 2024, https://www.scien.cx/2024/10/09/flowvid-taming-imperfect-optical-flows-for-consistent-video-to-video-synthesis-settings/
HARVARD
Kinetograph: The Video Editing Technology Publication | Sciencx Wednesday October 9, 2024 » FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis: Settings., viewed ,<https://www.scien.cx/2024/10/09/flowvid-taming-imperfect-optical-flows-for-consistent-video-to-video-synthesis-settings/>
VANCOUVER
Kinetograph: The Video Editing Technology Publication | Sciencx - » FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis: Settings. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/09/flowvid-taming-imperfect-optical-flows-for-consistent-video-to-video-synthesis-settings/
CHICAGO
" » FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis: Settings." Kinetograph: The Video Editing Technology Publication | Sciencx - Accessed . https://www.scien.cx/2024/10/09/flowvid-taming-imperfect-optical-flows-for-consistent-video-to-video-synthesis-settings/
IEEE
" » FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis: Settings." Kinetograph: The Video Editing Technology Publication | Sciencx [Online]. Available: https://www.scien.cx/2024/10/09/flowvid-taming-imperfect-optical-flows-for-consistent-video-to-video-synthesis-settings/. [Accessed: ]
rf:citation
» FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis: Settings | Kinetograph: The Video Editing Technology Publication | Sciencx | https://www.scien.cx/2024/10/09/flowvid-taming-imperfect-optical-flows-for-consistent-video-to-video-synthesis-settings/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.