Activity analysis

HAKE
http://hake-mvig.cn/home/ Human Activity Knowledge Engine (HAKE) aims at promoting the human activity/action understanding. As a large-scale knowledge base, HAKE is built upon existing activity datasets, and supplies human instance action labels and corresponding body part level atomic action labels (Part States). With the power of HAKE, our baseline methods have outperformed state-of-the-art approaches on several wide-used activity and Human-Object Interaction benchmarks. HAKE is still under construction, we will keep enriching and enlarging it. And we hope the volunteers from all the communities to work with us. We will make the dataset publicly available this summer.

http://www-personal.umich.edu/~ywchao/hico/ data sets

https://github.com/DirtyHarryLYL/Transferable-Interactiveness-Network

https://github.com/DirtyHarryLYL/HAKE.git

https://github.com/ywchao/ho-rcnn

facebook
https://github.com/facebookresearch/ActivityNet-Entities This repo hosts the dataset and evaluation scripts used in our paper Grounded Video Description (GVD). We also released the source code of GVD in this repo.

ActivityNet-Entities, is based on the video description dataset ActivityNet Captions and augments it with 158k bounding box annotations, each grounding a noun phrase (NP). Here we release the complete set of NP-based annotations as well as the pre-processed object-based annotations.

https://github.com/facebookresearch/grounded-video-description based on https://github.com/jiasenlu/NeuralBabyTalk

https://github.com/jiasenlu/AdaptiveAttention Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as "the" and "of". Other words that may seem visual can often be predicted reliably just from the language model e.g., "sign" after "behind a red stop" or "phone" following "talking on a cell". In this paper, we propose a novel adaptive attention model with a visual sentinel. At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel. The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation. We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K. Our approach sets the new state-of-the-art by a significant margin.

https://cs.stanford.edu/people/ranjaykrishna/densevid/

image captioning
https://arxiv.org/abs/1906.06632 Image captioning has attracted considerable attention in recent years. However, little work has been done for game image captioning which has some unique characteristics and requirements. In this work we propose a novel game image captioning model which integrates bottom-up attention with a new multi-level residual top-down attention mechanism. Firstly, a lower-level residual top-down attention network is added to the Faster R-CNN based bottom-up attention network to address the problem that the latter may lose important spatial information when extracting regional features. Secondly, an upper-level residual top-down attention network is implemented in the caption generation network to better fuse the extracted regional features for subsequent caption prediction. We create two game datasets to evaluate the proposed model. Extensive experiments show that our proposed model outperforms existing baseline models.

movement prediction
https://github.com/Yijunmaverick/FlowGrounded-VideoPrediction

https://arxiv.org/pdf/1807.09755.pdf Abstract. Existing video prediction methods mainly rely on observing multiple historical frames or focus on predicting the next one-frame. In this work, we study the problem of generating consecutive multiple future frames by observing one single still image only. We formulate the multi-frame prediction task as a multiple time step ﬂow (multi-ﬂow) prediction phase followed by a ﬂow-to-frame synthesis phase. The multi-ﬂow prediction is modeled in a variational probabilistic manner with spatialtemporal relationships learned through 3D convolutions. The ﬂow-toframe synthesis is modeled as a generative process in order to keep the predicted results lying closer to the manifold shape of real video sequence. Such a two-phase design prevents the model from directly looking at the high-dimensional pixel space of the frame sequence and is demonstrated to be more eﬀective in predicting better and diverse results. Extensive experimental results on videos with diﬀerent types of motion show that the proposed algorithm performs favorably against existing methods in terms of quality, diversity and human perceptual evaluation. Keywords: Future prediction, conditional variational autoencoder, 3D convolutions.

https://github.com/anuragranj/end2end-spynet

https://arxiv.org/abs/1611.00850 We learn to compute optical flow by combining a classical spatial-pyramid formulation with deep learning. This estimates large motions in a coarse-to-fine approach by warping one image of a pair at each pyramid level by the current flow estimate and computing an update to the flow. Instead of the standard minimization of an objective function at each pyramid level, we train one deep network per level to compute the flow update. Unlike the recent FlowNet approach, the networks do not need to deal with large motions; these are dealt with by the pyramid. This has several advantages. First, our Spatial Pyramid Network (SPyNet) is much simpler and 96% smaller than FlowNet in terms of model parameters. This makes it more efficient and appropriate for embedded applications. Second, since the flow at each pyramid level is small (< 1 pixel), a convolutional approach applied to pairs of warped images is appropriate. Third, unlike FlowNet, the learned convolution filters appear similar to classical spatio-temporal filters, giving insight into the method and how to improve it. Our results are more accurate than FlowNet on most standard benchmarks, suggesting a new direction of combining classical flow methods with deep learning.

comp vision for video training
https://github.com/kenshohara/3D-ResNets-PyTorch The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available. this https URL