Activity analysis

HAKE
http://hake-mvig.cn/home/ Human Activity Knowledge Engine (HAKE) aims at promoting the human activity/action understanding. As a large-scale knowledge base, HAKE is built upon existing activity datasets, and supplies human instance action labels and corresponding body part level atomic action labels (Part States). With the power of HAKE, our baseline methods have outperformed state-of-the-art approaches on several wide-used activity and Human-Object Interaction benchmarks. HAKE is still under construction, we will keep enriching and enlarging it. And we hope the volunteers from all the communities to work with us. We will make the dataset publicly available this summer.

http://www-personal.umich.edu/~ywchao/hico/ data sets

https://github.com/DirtyHarryLYL/Transferable-Interactiveness-Network

https://github.com/DirtyHarryLYL/HAKE.git

https://github.com/ywchao/ho-rcnn

facebook
https://github.com/facebookresearch/ActivityNet-Entities This repo hosts the dataset and evaluation scripts used in our paper Grounded Video Description (GVD). We also released the source code of GVD in this repo.

ActivityNet-Entities, is based on the video description dataset ActivityNet Captions and augments it with 158k bounding box annotations, each grounding a noun phrase (NP). Here we release the complete set of NP-based annotations as well as the pre-processed object-based annotations.

https://github.com/facebookresearch/grounded-video-description based on https://github.com/jiasenlu/NeuralBabyTalk

https://github.com/jiasenlu/AdaptiveAttention Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as "the" and "of". Other words that may seem visual can often be predicted reliably just from the language model e.g., "sign" after "behind a red stop" or "phone" following "talking on a cell". In this paper, we propose a novel adaptive attention model with a visual sentinel. At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel. The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation. We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K. Our approach sets the new state-of-the-art by a significant margin.

https://cs.stanford.edu/people/ranjaykrishna/densevid/

image captioning
https://arxiv.org/abs/1906.06632 Image captioning has attracted considerable attention in recent years. However, little work has been done for game image captioning which has some unique characteristics and requirements. In this work we propose a novel game image captioning model which integrates bottom-up attention with a new multi-level residual top-down attention mechanism. Firstly, a lower-level residual top-down attention network is added to the Faster R-CNN based bottom-up attention network to address the problem that the latter may lose important spatial information when extracting regional features. Secondly, an upper-level residual top-down attention network is implemented in the caption generation network to better fuse the extracted regional features for subsequent caption prediction. We create two game datasets to evaluate the proposed model. Extensive experiments show that our proposed model outperforms existing baseline models.

movement prediction
https://github.com/Yijunmaverick/FlowGrounded-VideoPrediction

https://arxiv.org/pdf/1807.09755.pdf Abstract. Existing video prediction methods mainly rely on observing multiple historical frames or focus on predicting the next one-frame. In this work, we study the problem of generating consecutive multiple future frames by observing one single still image only. We formulate the multi-frame prediction task as a multiple time step ﬂow (multi-ﬂow) prediction phase followed by a ﬂow-to-frame synthesis phase. The multi-ﬂow prediction is modeled in a variational probabilistic manner with spatialtemporal relationships learned through 3D convolutions. The ﬂow-toframe synthesis is modeled as a generative process in order to keep the predicted results lying closer to the manifold shape of real video sequence. Such a two-phase design prevents the model from directly looking at the high-dimensional pixel space of the frame sequence and is demonstrated to be more eﬀective in predicting better and diverse results. Extensive experimental results on videos with diﬀerent types of motion show that the proposed algorithm performs favorably against existing methods in terms of quality, diversity and human perceptual evaluation. Keywords: Future prediction, conditional variational autoencoder, 3D convolutions.

https://github.com/anuragranj/end2end-spynet

https://arxiv.org/abs/1611.00850 We learn to compute optical flow by combining a classical spatial-pyramid formulation with deep learning. This estimates large motions in a coarse-to-fine approach by warping one image of a pair at each pyramid level by the current flow estimate and computing an update to the flow. Instead of the standard minimization of an objective function at each pyramid level, we train one deep network per level to compute the flow update. Unlike the recent FlowNet approach, the networks do not need to deal with large motions; these are dealt with by the pyramid. This has several advantages. First, our Spatial Pyramid Network (SPyNet) is much simpler and 96% smaller than FlowNet in terms of model parameters. This makes it more efficient and appropriate for embedded applications. Second, since the flow at each pyramid level is small (< 1 pixel), a convolutional approach applied to pairs of warped images is appropriate. Third, unlike FlowNet, the learned convolution filters appear similar to classical spatio-temporal filters, giving insight into the method and how to improve it. Our results are more accurate than FlowNet on most standard benchmarks, suggesting a new direction of combining classical flow methods with deep learning.