全局姿态标准化。全局姿态标准化阶段,计算给定帧内源和目标人物身体形状和位置之间的差异,将源姿态图形转换到符合目标人物身体形状和位置的姿态图形。

杨超越不会跳舞的问题让我们来解决!

Given frame y from the original target video, we use pose detector P to obtain a corresponding pose stick figure x=P(y). Finally, we design a system to learn the mapping from the normalized pose stick figures to images of the target person with adversarial training.

Over the past few years there have been several frameworks, which often (but not all) use GANS, developed to solve such mappings including pix2pix. Adding the temporal smoothing setup does not seem to decrease the reconstructed pose distances significantly, however including the face GAN adds substantial improvements overall, especially for the face and hand keypoints.

We now describe every component of our system in detail. In addition, we release a first-of-its-kind open-source dataset of videos that can be legally used for training and motion transfer.

We predict two consecutive frames for temporally coherent video results and introduce a separate pipeline for realistic face synthesis.

Deep Generative Image Models using a Laplacian refinement networks. This is in contrast to approaches over the last two decades which employ nearest neighbor search. We found pre-smoothing pose keypoints to be immensely helpful in reducing jittering in our outputs. Ruben Villegas, Jimei et al., 2017; Simon Jun-Yan Zhu, Taesung 2017. 从标准化后的姿态图形推断目标人物的图像。这一阶段使用一个生成式对抗模型,训练模型学习从标准化后的姿态图形推断到目标人物图像。, 从目标视频中给定一个帧y,使用预训练的姿态检测模型P图获得对应的姿态图形x = P(y)。在训练阶段使用对应的(x, y)图像对去学习从姿态图形x到目标人物合成图像(即:G(x))的映射G。通过在鉴别器D使用对抗训练和在预训练VGGNet使用感知重建损失,我们可以优化生成器G,使其输出接近真实图像y。判别器试图区分“真实”的图像对(例如(x, y))和“伪造”的图像对(例如(x, G(x))。, 和训练过程相似,姿态检测模型P从源视频给定帧y'中抽取姿态图形x'。由于x'和目标视频中人物的身体尺寸和位置不同,我们通过全局姿态标准化转换,使其和目标人物更一致,记x。将x推入已训练的模型G中生成目标人物图像G(x),生成的图像与源视频中的y帧相对应。, 本文的姿态检测使用预训练的模型P(如:开源项目openpose等),得到精确的肢体关节坐标x,y的估计。通过连接各个关节点可以得到姿态图形,如图3所示。在训练过程中,姿态图形作为生成器G的输入。在迁移过程中P从源动作对象中获取估计x'并通过标准化匹配到目标人物。姿态估计相关文献见文章末尾。, 首先找到源视频和目标人物视频中最小和最大的脚踝关键点位置(距离镜头最近为最大,反之为最小)。方法很简单,靠近图像最底部的为最大脚踝关键点,另一个为最小。, 其中和分别为目标视频中最小和最大的脚踝关键点位置,和分为原视频的。为源视频的脚踝平均位置。为源视频当前帧相对于第一帧的姿态位置偏移量(文中未说明,我的观点)。, 通过修改基于pix2pixHD的对抗训练,可以生成时间连贯的视频帧以及合成真实的面部图像。, 在原始的条件化GAN中,生成器G用来对抗多尺寸的鉴别器D=(D1,D2,D3)。原始pix2pixHD的目标任务形式如下:, 是pix2pixHD中提出的鉴别器特征匹配损失。是感知重建损失,通过比较预训练VGGNet不同特征层中的差异获得。, 为了生成视频序列,本文修改了原始pix2pixHD中单个图像生成的模式,使其产生时间连续的相邻帧(图4)。模型预测两个连续的帧,第一个输出G(xt-1)由相应的动作图形xt-1和一个空图像z(值为0,由于没有t-2的帧输入所以用空值作为一个占位符)作为预测条件;第二个输出G(xt)以xt和G(xt-1)为条件。相应的,鉴别器的任务变为鉴别真实序列(xt-1, xt, yt-1, yt)和伪造序列(xt-1, xt, G(xt-1), G(xt))的真实性以及时间连续性。通过在原始pix2pixHD优化目标上添加时序平滑损失得到新的优化目标,形式如下所示:, 在使用生成器G得到整幅图像后,我们截取以面部为中心的小区域图像,将其和动作图形的相应区域XF输入到另一个生成器中,得到一个面部的残差。最终的输出是将残差加上对应区域的原始值,即。和原始pix2pix优化目标类似,鉴别器尝试区分“真实”面部图像对和伪造图相对。, 由于对于生成图像,没有相应的真实图像来评价。为了评价单个图像的质量,本文测量图像的Structural Similarity(SSIM)和Learned Perceptual Image Patch Similarity(LPIPS)。依靠质量分析来评价输出视频的时间连续性。SSIM和LPIPS的相关资料见文章末尾。, 表1记录了将生成的目标人物图像,按标准化动作图形边框裁剪后计算的结果。T.S表示生成器结果经过时序平滑的方案。T.S.+Face是本文的完整模型,包含时序平滑和面部生成。, 表3计算了姿态距离d。如果身体部分图像被正确的合成,那么合成图像的姿态图形应该和作为条件输入的姿态图形非常接近。为了评价姿态的一致性,本文设计了姿态距离矩阵来计算姿态差异。对于两个姿态p和p',每一个有n个连接点:p1,......,pn和p'1,......p'n。我们计算对应连接点的L2距离均值来衡量姿态距离。, 表4表示平均每幅图,在源动作图像中根据姿态检测得到连接点,而在生成图中姿态检测未检测到的点的数量。. For a pose distance metric between two poses p,p′ each with n joints p1,...,pn and p′1,...,p′n, we sum the L2 distances between the corresponding joints pk=(xk,yk) and p′k=(x′k,y′k) normalized by the number of keypoints.

Although our setup can produce plausible results in many cases, occasionally our results suffer from several issues. We show that our face GAN produces convincing facial features and improves upon the results of the full image GAN in our ablation studies detailed in Section 7.1. A style-based generator architecture for generative adversarial networks. (and a dataset of 230,000 3D facial landmarks). Our goal is therefore to discover an image-to-image translation. To transfer motion between two video subjects in a frame-by-frame manner, we must learn a mapping between images of the two individuals. This paper presents a simple method for "do as I do" motion transfer: given a source video of a person dancing, we can transfer that performance to a novel (amateur) target after only a few minutes of the target subject performing standard moves. In this paper, we …

For the image translation stage of our pipeline, we adapt the architectures proposed by Wang et al. 当然这一技术很有用,比如可以直接用其控制虚拟主持人的手势,让直播更自然。 We add two components to improve the quality of our results: To encourage the temporal smoothness of our generated videos, we condition the prediction at each frame on that of the previous time step.

Python 3.6. Emily L Denton, Soumith In: IEEE conference on computer vision and pattern recognition (CVPR), go back to reference Ranjan R, Patel VM, Chellappa R (2015) A deep pyramid deformable part model for face detection. In: CVPR, go back to reference Kim H, Carrido P, Tewari A, Xu W, Thies J, Niessner M, Pérez P, Richardt C, Zollhöfer M, Theobalt C (2018) Deep video portraits. Taeksoo Kim, Moonsu Cha, This paper presents a simple method for "do as I do" motion transfer: given a source video of a person dancing we can transfer that performance to a novel (amateur) target after only a few minutes of the target subject performing standard moves. Again, scores are generally favorable for all ablations, although the full model with both the temporal smoothing and face GAN setups obtains the best scores with the biggest discrepancy in the face region. CUDA 9.0.176. 可是演示視頻需要科學上網,大概就是可以實現兩個人之間的動作的遷移,其實現的效果大概是將動作信息轉移到一個目標人物上。

Learning to Generate Long-term Future via 阅读量:

To ensure the quality of the frames, we filmed our target subject for around 20 minutes of real time footage at 120 frames per second which is possible with some modern cell phone cameras. To calculate the scale, we cluster the heights around the minimum ankle position and the maximum ankle position and find the maximum height for each cluster for each video. (pose stick figure x, ground truth image y)) and "fake" image pairs (i.e.

For example, Video Rewrite creates videos of a subject saying a phrase they did not originally utter by finding frames where the mouth position matches the desired speech. Some deep generation models are proposed by previous research work on image and video generation, including the popular model of generative adversarial networks.

To avoid dealing with missing detections (i.e. Real-time motion retargeting to highly varied For more than 250 years, mathematicians have wondered if the Euler equations might sometimes fail to describe a fluid's flow.

Even though we try to inject temporal coherence through our setup and presmoothing keypoints, our results often still suffer from jittering. Using pose detections as an intermediate representation between source and target, we learn a mapping from pose images to a target subject's appearance. With this aligned data we are able to learn an image-to-image translation model between pose stick figures and images of our target person in a supervised way.

We pose this problem as a 一、需要的环境:

Since ground truth faces are not labeled in 300-W, we use the detection results of, Thermal-to-Visible Face Synthesis and Recognition, Multi-channel Face Presentation Attack Detection Using Deep Learning

Qualitatively, the temporal smoothing setup helps with smooth motion, color consistency across frames, and also in individual frame synthesis. D attempts to distinguish between "real" image pairs (i.e.

There was a problem preparing your codespace, please try again.

Scores on full images are even more similar between our ablations, as all ablations have no difficulty generating the static background. Web“Everybody Dance Now”,arXiv 1808.07371,2018 7、E Zakharov et al.,“Few-Shot Adversarial Learning of Realistic Neural Talking Head Models“,arXiv 1905.08233,2019 … Perceptual losses for real-time style transfer and Chang X, et al (2020) Deepfake face image detection based on improved VGG convolutional neural … Frameworks such as Recycle-GAN (Bansal arxiv.org Learn more. arXiv preprint arXiv:​1611.​02200, go back to reference Tang X, Du DK, He Z, Liu J (2018) Pyramidbox: a context-assisted single shot face detector. Dynamics Transfer GAN: Generating Video by This paper presents a simple method for „do as I do“ motion transfer: et al., 2003). In: International conference on learning representations (ICLR), go back to reference Güera D, Delp EJ (2018) Deepfake video detection using recurrent neural networks. Therefore, our model is trained to produce personalized videos of a specific target subject. Implementation accompanying paper: "Everybody dance now."

您可以直接购买此文献,1~5分钟即可下载全文,部分资源由于网络原因可能需要更长时间,请您耐心等待哦~

We characterize our transformation in terms of scale and translation in the y direction, which is calculated for each frame. Everybody Dance Now.

Caroline Chan UC Berkeley Shiry Ginosar UC Berkeley Tinghui Zhou UC Berkeley Alexei A. Efros UC Berkeley.

We adapt architectures from various models for different stages of our pipeline.

Another approach uses optical flow as a descriptor matches different subjects performing similar actions allowing "Do as I do" and "Do as I say" retargeting With our framework, we create a variety of videos, enabling untrained amateurs to spin and twirl like ballerinas, perform martial arts kicks or dance as vibrantly as pop stars.

This paper presents a simple method for …

This is achieved by disrupting deep neural network (DNN)-based face detection and facial landmark extraction method with specially designed imperceptible adversarial perturbations to reduce the quality of the detected faces.

The final output is the addition of the residual with original face region r+G(x)F and this change is reflected in the relevant region of the full image. In addition, we run the pose detector P on the outputs of each system, and compare these reconstructed keypoints to the pose detections of the original input video.

Full paper - https://arxiv.org/pdf/1808.07371.pdfWebsite - https://carolineec.github.io/everybody_dance_now/ Everybody Dance Now.

Abstract. To extract pose keypoints for the body, face, and hands we use architectures provided by a state of the art pose detector OpenPose. Note the original_img is not necessary at test time and is provided only for reference.

The results of our ablation study are presented 2016. However, we do not have corresponding pairs of images of the two subjects performing the same motions to supervise learning this translation directly. In order for the source pose to better align with the filming setup of the target, we apply a global pose normalization Norm to transform the source's original pose x′ to be more consistent with the poses in the target video x.

Ubuntu 18.04(但16.04也应该没问题).

Everybody Dance Now.

Since normalized poses for transfer are often similar to those seen in training, we attribute this observation to the underlying difference between how our target and transfer subjects move given their unique body structure. Once the minimum and maximum ankle positions of each subject are found, we carry out a linear mapping between the minimum and maximum ankle positions of each video (i.e.

Web好,我們剛剛透過P已經取得了動作姿勢,在這邊我們稱呼他為x'。 但隨著人物在畫面中高度以及身高的不同,對於產生結果的品質一定會有所影響,這裡,作者使用了兩 …

Although our method is quite simple, it produces surprisingly compelling results (see video).

During training, we use corresponding (x,y) pairs to learn a mapping G which synthesizes images of the target person given pose stick x. Previous works have found that generating coherent raw audio waveforms with GANs is challenging.

这个人物称为任务视频生成(human video generation)。 We therefore design our intermediate representation to be pose stick figures such as in Figure 2. 2018. We approach this problem as video-to-video translation using pose as an intermediate representation.

Everybody Dance Now – Motion Transfer Paper by Caroline Chan, Shiry Ginosar, Tinghui Zhou, …

We also provide code for creating both training and testing datasets (including global pose normalization) in the data_prep folder.

Everybody Dance Now.

Through adversarial training with discriminator D and a perceptual reconstruction loss dist using a pretrained VGGNet Overall our model is able to create reasonable and arbitrarily long videos of a target person dancing given body movements to follow through an input video of another subject dancing.

Image quality assessment:from error visibility to structural similarity.

