ACMMM2017

Abstract:
Convolutional Neural Network (CNN) based methods have shown significant performance gains in the problem of visual tracking in recent years. Due to many uncertain changes of objects online, such as abrupt motion, background clutter and large deformation, the visual tracking is still a challenging task. We propose a novel algorithm, namely Deep Location-Specific Tracking, which decomposes the tracking problem into a localization task and a classification task, and trains an individual network for each task. The localization network exploits the information in the current frame and provides a specific location to improve the probability of successful tracking, while the classification network finds the target among many examples generated around the target location in the previous frame, as well as the one estimated from the localization network in the current frame. CNN based trackers often have massive number of trainable parameters, and are prone to over-fitting to some particular object states, leading to less precision or tracking drift. We address this problem by learning a classification network based on 1 × 1 convolution and global average pooling. Extensive experimental results on popular benchmark datasets show that the proposed tracker achieves competitive results without using additional tracking videos for fine-tuning. The code is available at https://github.com/ZjjConan/DLST

Abstract:
Weather recognition is important in practice, while this task has not been thoroughly explored so far. The current trend of dealing with this task is treating it as a single classification problem, i.e., determining whether a given image belongs to a certain weather category or not. However, weather recognition differs significantly from traditional image classification, since several weather features may appear simultaneously. In this case, a simple classification result is insufficient to describe the weather condition. To address this issue, we propose to provide auxiliary weather related information for comprehensive weather description. Specifically, semantic segmentation of weather-cues, such as blue sky and white clouds, is exploited as an auxiliary task in this paper. Moreover, a convolutional neural network (CNN) based multi-task framework is developed which aims to concurrently tackle weather category classification task and weather-cues segmentation task. Due to the intrinsic relationships between these two tasks, exploring auxiliary semantic segmentation of weather-cues can also help to learn discriminative features for the classification task, and thus obtain superior accuracy. To verify the effectiveness of the proposed approach, extra segmentation masks of weather-cues are generated manually on an existing weather image dataset. Experimental results have demonstrated the superior performance of our approach. The enhanced dataset, source codes and pre-trained models are available at https://github.com/wzgwzg/Multitask_Weather.

Abstract:
In this paper we present a novel network architecture, called Multi-Scale Cascade Network (MSC-Net), to identify the most visually conspicuous objects in an image. Our network consists of several stages (sub-networks) for handling saliency detection across different scales. All these sub-networks form a cascade structure (in a coarse-to-fine manner) where the same underlying convolutional feature representations are fully shared. Compared with existing CNN-based saliency models, the MSC-Net can naturally enable the learning process in the finer cascade stages to encode more global contextual information while progressively incorporating the saliency prior knowledge obtained from coarser stages and thus lead to better detection accuracy. We also design a novel refinement module to further filter out errors by considering the intermediate feedback information. Our MSC-Net is highly integrated, end-to-end trainable, and very powerful. The proposed method achieves state-of-the-art performance on five widely-used salient object detection benchmarks, outperforming existing methods and also maintaining high efficiency. Code and pre-trained models are available at https://github.com/lixin666/MSC-NET.

Abstract:
Recently, deep neural network (DNN) is drawing a lot of attention because of its applications. However, it requires a lot of computational resources and tremendous processes in order to setup an execution environment based on hardware acceleration such as GPGPU. Therefore, providing DNN applications to end-users is very hard. To solve this problem, we have developed an installation-free web browser-based DNN execution framework, WebDNN. WebDNN optimizes the trained DNN model to compress model data and accelerate the execution. It executes the DNN model with novel JavaScript API to achieve zero-overhead execution. Empirical evaluations show that it achieves more than two-hundred times the unusual acceleration. WebDNN is an open source framework and you can download it from https://github.com/mil-tokyo/webdnn.

Abstract:
Binary Neural Networks (BNNs) can drastically reduce memory size and accesses by applying bit-wise operations instead of standard arithmetic operations. Therefore it could significantly improve the efficiency and lower the energy consumption at runtime, which enables the application of state-of-the-art deep learning models on low power devices. BMXNet is an open-source BNN library based on MXNet, which supports both XNOR-Networks and Quantized Neural Networks. The developed BNN layers can be seamlessly applied with other standard library components and work in both GPU and CPU mode. BMXNet is maintained and developed by the multimedia research group at Hasso Plattner Institute and released under Apache license. Extensive experiments validate the efficiency and effectiveness of our implementation. The BMXNet library, several sample projects, and a collection of pre-trained binary deep models are available for download at https://github.com/hpi-xnor.

Abstract:
Interactive installation exploring the moment where an individual through sight, touch, and sound can identify and build a relationship with the almost indescribable processes of "machining". By executing the physicality of code and examining the phantom-objectivity of that code, one can connect to the cycle between the digital to physical, finding exactly where the technological identity is located.

Abstract:
Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly compared to each other. In this paper, we present a novel Adversarial Cross-Modal Retrieval (ACMR) method, which seeks an effective common subspace based on adversarial learning. Adversarial learning is implemented as an interplay between two processes. The first process, a feature projector, tries to generate a modality-invariant representation in the common subspace and to confuse the other process, modality classifier, which tries to discriminate between different modalities based on the generated representation. We further impose triplet constraints on the feature projector in order to minimize the gap among the representations of all items from different modalities with same semantic labels, while maximizing the distances among semantically different images and texts. Through the joint exploitation of the above, the underlying cross-modal semantic structure of multimedia data is better preserved when this data is projected into the common subspace. Comprehensive experimental results on four widely used benchmark datasets show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.

Abstract:
One of the fundamental taboos of the modern virtual reality (VR) is invading or occupying another person's personal virtual space. À Quatre Mains aims to tackle this user experience puzzle by giving a purpose to the shared space. Inspired by the connection of musicians playing together in a small area, À Quatre Mains takes the idea of four-hand piano playing to another level and proposes and immersive instrument played by multiple users in the same shared space. In À Quatre Mains VR experience, the musicians are presented with a half dome, where the structure is made up of multiple hexagonal sections. Each hexagonal is a synthesizer, featuring multiple modes. Using the tracked hand controls available to current high-end consumer VR systems, the synthesizers can be triggered by any participant. This gives the musicians two options, ignore each other and suffer possible cacophony or observe the other and create music. This encourages the performers to explore other ways of connecting through sound, space and the magic that is VR

Abstract:
In this panel, we attempt to review and discuss the recent emerging theoretical and technological advances and trends of cross-media. Integrating data-driven machine learning with human knowledge can effectively lead to explainable, robust, and general models. Thus, the effective employment of the interaction between cross-media data during inference and reasoning becomes a challenge to populate the cross-media knowledge graph. Some other fundamental and controversial issues such as leveraging the auxiliary information to boost the cross-media understanding, the existence of unified framework to bridge the gap between multi-modality will also be discussed in this panel.

Abstract:
This work is concerned with the protection of intellectual property rights and privacy against unpermitted uses of digital cameras. A technique of multispectral coded illumination (MSCI) with LED lights is proposed to defeat cameras capturing indoor scenes by inducing annoying color artifacts into the acquired images or video frames. The main idea of MSCI is to temporally modulate LED lights of different colors at certain frequencies so that they interfere with the rolling shutter of the camera, but at the same time the coded illumination appears to human eyes the same as steady white lighting.

Abstract:
This work makes the first attempt to generate articulated human motion sequence from a single image. On one hand, we utilize paired inputs including human skeleton information as motion embedding and a single human image as appearance reference, to generate novel motion frames based on the conditional GAN infrastructure. On the other hand, a triplet loss is employed to pursue appearance smoothness between consecutive frames. As the proposed framework is capable of jointly exploiting the image appearance space and articulated/kinematic motion space, it generates realistic articulated motion sequence, in contrast to most previous video generation methods which yield blurred motion effects. We test our model on two human action datasets including KTH and Human3.6M, and the proposed framework generates very promising results on both datasets.

Abstract:
Fine-Grained Visual Categorization (FGVC) has achieved significant progress recently. However, the number of fine-grained species could be huge and dynamically increasing in real scenarios, making it difficult to recognize unseen objects under the current FGVC framework. This raises an open issue to perform large-scale fine-grained identification without a complete training set. Aiming to conquer this issue, we propose a retrieval task named One-Shot Fine-Grained Instance Retrieval (OSFGIR). "One-Shot" denotes the ability of identifying unseen objects through a fine-grained retrieval task assisted with an incomplete auxiliary training set. This paper first presents the detailed description to OSFGIR task and our collected OSFGIR-378K dataset. Next, we propose the Convolutional and Normalization Networks (CN-Nets) learned on the auxiliary dataset to generate a concise and discriminative representation. Finally, we present a coarse-to-fine retrieval framework consisting of three components, i.e., coarse retrieval, fine-grained retrieval, and query expansion, respectively. The framework progressively retrieves images with similar semantics, and performs fine-grained identification. Experiments show our OSFGIR framework achieves significantly better accuracy and efficiency than existing FGVC and image retrieval methods, thus could be a better solution for large-scale fine-grained object identification.

Abstract:
This work focuses on Magic-wall, an automatic system for visualizing the effect of room decoration. Given an image of the indoor scene and a preferred color, the Magic-wall can automatically locate the wall regions in the image and smoothly replace the existing color with the required one. The key idea of the proposed Magic-wall is to leverage visual semantics to guide the entire process of color substitution including wall segmentation and color replacement. We propose an edge-aware fully convolutional neural network (FCN) for indoor semantic scene parsing, in which a novel edge-prior branch is introduced to better identify the boundary of different semantic regions. To accurately localize the wall regions, we adapt a semantic-dependent optimized strategy, which pays more attention to those pixels belonging to the wall by adapting larger optimization weights compared with those from other semantic regions. Finally, to naturally replace the color of original walls, a simple yet effective color space conversion method is proposed for replacement with brightness reservation. We build a new indoor scene dataset upon ADE20K for training and testing, which includes 6 semantic labels. Extensive experimental evaluations and visualizations well demonstrate that the proposed Magic-wall is effective and can automatically generate a set of visually pleasing results.

Abstract:
Region-based image retrieval (RBIR) technique is revisited. In early attempts at RBIR in the late 90s, researchers found many ways to specify region-based queries and spatial relationships; however, the way to characterize the regions, such as by using color histograms, were very poor at that time. Here, we revisit RBIR by incorporating semantic specification of objects and intuitive specification of spatial relationships. Our contributions are the following. First, to support multiple aspects of semantic object specification (category, instance, and attribute), we propose a multitask CNN feature that allows us to use deep learning technique and to jointly handle multi-aspect object specification. Second, to help users specify spatial relationships among objects in an intuitive way, we propose recommendation techniques of spatial relationships. In particular, by mining the search results, a system can recommend feasible spatial relationships among the objects. The system also can recommend likely spatial relationships by assigned object category names based on language prior. Moreover, object-level inverted indexing supports very fast shortlist generation, and re-ranking based on spatial constraints provides users with instant RBIR experiences.

Abstract:
Pl@ntNet is a world-scale participatory platform and information system dedicated to the monitoring of plant biodiversity through image-based plant identification. Nowadays, the mobile front-end of Pl@ntNet has been downloaded by more than 4 millions users in about 170 countries and an active community of contributors produce and revise new observations everyday (it is often referred to as the "shazam of plants"). This paper presents a business proposal allowing enterprises or organizations to set up their own private collaborative workflow within Pl@ntNet information system. The main added value is to allow them working on their own business object (e.g. plant disease diagnostic, deficiency measurements, railway lines maintenance, fishery surveillance, etc.) and with their own community of contributors and end-users (employees, sales representatives, clients, observers network, etc.). This business idea answers to a growing demand in agriculture and environmental economics. Actors in these domains begin to know that machine learning techniques are mature enough but the lack of training data and of efficient tools to collect them is a major breakthrough. A collaborative platform like Pl@ntNet is the ideal tool to bridge this gap. It initiates a powerful positive feedback loop boosting the production of training data while improving the work of the employees.

Abstract:
Existing methods of generative adversarial network (GAN) use different criteria to distinguish between real and fake samples, such as probability [9],energy [44] energy or other losses [30]. In this paper, by employing the merits of deep metric learning, we propose a novel metric-based generative adversarial network (MBGAN), which uses the distance-criteria to distinguish between real and fake samples. Specifically, the discriminator of MBGAN adopts a triplet structure and learns a deep nonlinear transformation, which maps input samples into a new feature space. In the transformed space, the distance between real samples is minimized, while the distance between real sample and fake sample is maximized. Similar to the adversarial procedure of existing GANs, a generator is trained to produce synthesized examples, which are close to real examples, while a discriminator is trained to maximize the distance between real and fake samples to a large margin. Meanwhile, instead of using a fixed margin, we adopt a data-dependent margin [30], so that the generator could focus on improving the synthesized samples with poor quality, instead of wasting energy on well-produce samples. Our proposed method is verified on various benchmarks, such as CIFAR-10, SVHN and CelebA, and generates high-quality samples.

Abstract:
Image color editing techniques such as color transfer, HDR tone mapping, dehazing, and white balance have been widely used and investigated in recent decades. However, naively employing them to videos frame-by-frame often leads to flickering or color inconsistency. To solve it generally, earlier methods rely on temporal filtering or warping from the previous frame, but they still fail in the cases of occlusion and produce blurry results. We introduce a new framework for these challenges: (1) We develop an online keyframe strategy to keep track of the dynamic objects, where more temporal information can be acquired than a single previous frame. (2) To preserve image details, local color affine model is employed. The main concept of this post-processing step is to capture the color transformation from editing algorithms and maintain the detail structures of the raw image simultaneously. Practically, our approach takes a raw video and its per-frame processed version, and generates a temporally consistent output. In addition, we propose a video quality metric to evaluate temporal coherence. Extensive experiments and subjective test are done to show the superiority of the proposed framework with respect to color fidelity, detail preservation, and temporal consistency.

Abstract:
Traditionally, kernel learning methods require positive definitiveness on the kernel, which is too strict and excludes many sophisticated similarities, that are indefinite. To utilize those indefinite kernels, indefinite learning methods are of great interests. This paper aims at the extension of the logistic regression from positive definite kernels to indefinite ones. The proposed model, named indefinite kernel logistic regression (IKLR), keeps consistency to the regular KLR in formulation but it essentially becomes non-convex. Thanks to the positive decomposition of an indefinite kernel, IKLR can be transformed into a difference of two convex models, which follows the use of concave-convex procedure. Moreover, aiming at large-scale problems in practice, a concave-inexact-convex procedure (CCICP) algorithm with an inexact solving scheme is proposed with convergence guarantees. Experimental results on multi-modal datasets demonstrate the superiority of the proposed IKLR model over kernel logistic regression with positive definite kernels and other state-of-the-art indefinite learning based methods.

Abstract:
We present an end-to-end system for streaming Cinematic Virtual Reality (VR) content (also called 360 or omnidirectional content). Content is captured and ingested at a resolution of 16K at 25Hz and streamed towards untethered mobile VR devices. Besides the usual navigation interactions such as panning and tilting offered by common VR systems, we also provide a zooming interactivity. This allows the VR client to fetch high quality pixels captured at a spatial resolution of 16K that greatly increase perceived quality compared to a 4K VR streaming solution. Since current client devices are not capable of receiving and decoding a 16K video, several optimizations are provided to only stream the required pixels for the current viewport of the user, while meeting strict latency and bandwidth requirements for a qualitative VR immersive experience.

Abstract:
As a bridge to connect vision and language, visual relations between objects in the form of relation triplet łangle subject,predicate,object\rangle, such as "person-touch-dog'' and "cat-above-sofa'', provide a more comprehensive visual content understanding beyond objects. In this paper, we propose a novel vision task named Video Visual Relation Detection (VidVRD) to perform visual relation detection in videos instead of still images (ImgVRD). As compared to still images, videos provide a more natural set of features for detecting visual relations, such as the dynamic relations like "A-follow-B'' and "A-towards-B'', and temporally changing relations like "A-chase-B'' followed by "A-hold-B''. However, VidVRD is technically more challenging than ImgVRD due to the difficulties in accurate object tracking and diverse relation appearances in video domain. To this end, we propose a VidVRD method, which consists of object tracklet proposal, short-term relation prediction and greedy relational association. Moreover, we contribute the first dataset for VidVRD evaluation, which contains 1,000 videos with manually labeled visual relations, to validate our proposed method. On this dataset, our method achieves the best performance in comparison with the state-of-the-art baselines.

Abstract:
Image filtering is helpful to numerous multimedia, computer vision and graphics tasks. Linear translation-invariant filters with manually designed kernels have been widely used. However, their performance suffers from the content-blindness, say identically treating noises, textures and structures. To mitigate the content-blindness, a family of filters, called joint/guided filters, has attracted much attention from the community, the principle of which is transferring the structure in the reference image to the target one. The main drawback of most joint/guided filters comes from the ignorance of structural inconsistency between the reference and target signals that can be like color, infrared and depth images captured under different conditions. Simply adopting such guidances very likely leads to unsatisfactory results. To address the above issues, this paper designs a simple yet effective filter, named as mutually guided image filter (muGIF), which jointly preserves mutual structures, avoids misleading from inconsistent structures and smooths flat regions. The proposed muGIF is very flexible, which can perform in one of dynamic only (self-guided), static/dynamic and dynamic/dynamic modes. Although the objective of muGIF is in nature non-convex, by subtly decomposing the objective, we can solve it effectively and efficiently. The advantages of muGIF in terms of effectiveness and flexibility are demonstrated over other state-of-the-art alternatives on a variety of applications.

Abstract:
Recently, deep neural networks based hashing methods have greatly improved the multimedia retrieval performance by simultaneously learning feature representations and binary hash functions. Inspired by the latest advance in the asymmetric hashing scheme, in this work, we propose a novel Deep Asymmetric Pairwise Hashing approach (DAPH) for supervised hashing. The core idea is that two deep convolutional models are jointly trained such that their output codes for a pair of images can well reveal the similarity indicated by their semantic labels. A pairwise loss is elaborately designed to preserve the pairwise similarities between images as well as incorporating the independence and balance hash code learning criteria. By taking advantage of the flexibility of asymmetric hash functions, we devise an efficient alternating algorithm to optimize the asymmetric deep hash functions and high-quality binary code jointly. Experiments on three image benchmarks show that DAPH achieves the state-of-the-art performance on large-scale image retrieval.

Abstract:
Image captioning has so far been explored mostly in English, as most available datasets are in this language. However, the application of image captioning should not be restricted by language. Only few studies have been conducted for image captioning in a cross-lingual setting. Different from these works that manually build a dataset for a target language, we aim to learn a cross-lingual captioning model fully from machine-translated sentences. To conquer the lack of fluency in the translated sentences, we propose in this paper a fluency-guided learning framework. The framework comprises a module to automatically estimate the fluency of the sentences and another module to utilize the estimated fluency scores to effectively train an image captioning model for the target language. As experiments on two bilingual (English-Chinese) datasets show, our approach improves both fluency and relevance of the generated captions in Chinese, but without using any manually written sentences from the target language.

Abstract:
Neural Style Transfer based on Convolutional Neural Networks (CNN) aims to synthesize a new image that retains the high-level structure of a content image, rendered in the low-level texture of a style image. This is achieved by constraining the new image to have high-level CNN features similar to the content image, and lower-level CNN features similar to the style image. However in the traditional optimization objective, low-level features of the content image are absent, and the low-level features of the style image dominate the low-level detail structures of the new image. Hence in the synthesized image, many details of the content image are lost, and a lot of inconsistent and unpleasing artifacts appear. As a remedy, we propose to steer image synthesis with a novel loss function: the Laplacian loss. The Laplacian matrix ("Laplacian" in short), produced by a Laplacian operator, is widely used in computer vision to detect edges and contours. The Laplacian loss measures the difference of the Laplacians, and correspondingly the difference of the detail structures, between the content image and a new image. It is flexible and compatible with the traditional style transfer constraints. By incorporating the Laplacian loss, we obtain a new optimization objective for neural style transfer named Lapstyle. Minimizing this objective will produce a stylized image that better preserves the detail structures of the content image and eliminates the artifacts. Experiments show that Lapstyle produces more appealing stylized images with less artifacts, without compromising their "stylishness".

Abstract:
Automatically describing videos with natural language is a crucial challenge of video understanding. Compared to images, videos have specific spatial-temporal structure and various modality information. In this paper, we propose a Multirate Multimodal Approach for video captioning. Considering that the speed of motion in videos varies constantly, we utilize a Multirate GRU to capture temporal structure of videos. It encodes video frames with different intervals and has a strong ability to deal with motion speed variance. As videos contain different modality cues, we design a particular multimodal fusion method. By incorporating visual, motion, and topic information together, we construct a well-designed video representation. Then the video representation is fed into a RNN-based language model for generating natural language descriptions. We evaluate our approach for video captioning on "Microsoft Research - Video to Text" (MSR-VTT), a large-scale video benchmark for video understanding. And our approach gets great performance on the 2nd MSR Video to Language Challenge.

Abstract:
Social multimedia refers to the multimedia content (text, images, and videos) generated by social network users for social interactions. The increasing popularity of online social networks leads to a significant amount of multimedia content generated by online social network users. Researchers from both the industrial and academic have been working on a broad range of projects related to the analyzing and understanding the online multimedia content, including real world activity prediction and content recommendation. Particularly, understanding online users' opinions or sentiments is a fundamental task that can benefit many applications, such as political campaigning and commercial marketing. We present a few recent advances in social multimedia sentiment analysis. Specifically, this tutorial consists of three parts. The first part is on visual sentiment analysis. We will introduce the task of visual sentiment, its main challenges, and the state-of-the-art approaches. We will include several representative approaches to manually designing visual features for this task as well as some approaches using deep neural networks. The second part is on building multimedia sentiment analysis datasets. We will introduce the challenges, the solutions in the construction of different large-scale datasets for sentiment analysis. The final part is mainly on multimodality model for sentiment analysis. We will introduce some recent research projects on multimodality designing and learning. In addition, we will also share some applications of sentiment analysis, as well as thoughts on current challenges and future directions.

Abstract:
The current state-of-the-art tele-medicine applications only allow audiovisual communication between a doctor and the patient, necessitating a clinician to physically examine the patient. The doctor relies on the physical examination performed by the clinician, along with the audiovisual dialogue with the patient. In this paper, a Haptic-enabled Tele-Immersive Musculoskeletal Examination (H-TIME) system is introduced, that allows doctors to physically examine musculoskeletal conditions of the patients remotely, by looking at the 3D reconstructed model of the patient in the virtual world, and physically feeling the patient's range of mobility using a haptic device. The proposed bidirectional haptic rendering in H-TIME can allow the doctor to evaluate a patient who suffers from problems in their upper extremities, such as the shoulder, elbow, wrist, etc., and evaluate them remotely. Real world user study was performed, between the doctors and the patients, and it highlighted the potential of the proposed system. The study indicated a high degree of correlation between the in-person and H-TIME evaluations of the patient. Both the doctors and patients involved in the study, felt that the system could potentially replace in-person consultations, someday.

Abstract:
In multimedia analysis, the task of domain adaptation is to adapt the feature representation learned in the source domain with rich label information to the target domain with less or even no label information. Significant research endeavors have been devoted to aligning the feature distributions between the source and the target domains in the top fully connected layers based on unsupervised DNN-based models. However, the domain adaptation has been arbitrarily constrained near the output ends of the DNN models, which thus brings about inadequate knowledge transfer in DNN-based domain adaptation process, especially near the input end. We develop an attention transfer process for convolutional domain adaptation. The domain discrepancy, measured in correlation alignment loss, is minimized on the second-order correlation statistics of the attention maps for both source and target domains. Then we propose Deep Unsupervised Convolutional Domain Adaptation DUCDA method, which jointly minimizes the supervised classification loss of labeled source data and the unsupervised correlation alignment loss measured on both convolutional layers and fully connected layers. The multi-layer domain adaptation process collaborately reinforces each individual domain adaptation component, and significantly enhances the generalization ability of the CNN models. Extensive cross-domain object classification experiments show DUCDA outperforms other state-of-the-art approaches, and validate the promising power of DUCDA towards large scale real world application.

Abstract:
Online multimedia has been growing rapidly due to ubiquitous mobile phones, widely deployed surveillance cameras, dashcams and mini-drones. When one takes photographs or videos at a public location, it is highly likely that some other people ("bystanders") also appear in the visual data. The data may be available online, such as shared by social media, and questions about privacy arise. This panel discusses the issues about privacy in online multimedia from legal, technological, and social aspects.

Abstract:
Altering the lyrics of famous songs is a common creative and communicative act, often used for purposes that go beyond simple amusement, such as the creation of companion music for advertisements. In this case, the altered song commonly refers to the advertised product or idea. Here we present a system that can automatically reproduce this process: it starts from a novel text, i.e. the daily news, identifies the key concepts therein contained, expands them and then uses this word cloud to replace some of the words in a song, obeying lexical, metrical and rhyming constraints. The new song is then created by merging these new lyrics, sung by a speech synthesizer, with the original backing music. Our evaluation shows that songs created by the system increase the recall of the news they were created from.

Abstract:
Smartphones are the central component of our modern, connected life. We carry them from the moment we wake up till the time we go to bed. Continuous technology innovation has created ever more sophisticated phones. Their small size belies a complexity that is largely unknown to the user -- making it almost impossible to discover and use these expanded capabilities. Designers are forced to make tradeoffs between adding more capabilities and burying the functionality in menu hierarchies. Users are frustrated when forced to accept either of these choices. This is the fundamental limitation of modern touch based interfaces to smartphones.

Abstract:
Temporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting start time and end time of each action instance. Many state-of-the-art methods adopt the "detection by classification" framework: first do proposal, and then classify proposals. The main drawback of this framework is that the boundaries of action instance proposals have been fixed during the classification step. To address this issue, we propose a novel Single Shot Action Detector (SSAD) network based on 1D temporal convolutional layers to skip the proposal generation step via directly detecting action instances in untrimmed video. On pursuit of designing a particular SSAD network that can work effectively for temporal action detection, we empirically search for the best network architecture of SSAD due to lacking existing models that can be directly adopted. Moreover, we investigate into input feature types and fusion strategies to further improve detection accuracy. We conduct extensive experiments on two challenging datasets: THUMOS 2014 and MEXaction2. When setting Intersection-over-Union threshold to 0.5 during evaluation, SSAD significantly outperforms other state-of-the-art systems by increasing mAP from 19.0% to 24.6% on THUMOS 2014 and from 7.4% to 11.0% on MEXaction2.

Abstract:
Temporal attention has been widely used in video description to adaptively focus on important frames. However, most existing methods based on temporal attention suffer from the problems of recognition error and detail missing, because only coarse frame-level global features are employed. Inspired by recent successful work in image description using spatial attention, we propose a spatial-temporal attention (STAT) method to address such problems. In particular, first, we take advantage of object-level local features to address the problem of detail missing. Second, the STAT method further selects relevant local features by spatial attention and then attend to important frames by temporal attention to recognize related semantics. The proposed two-stage attention mechanism can recognize the salient objects more precisely with high recall and automatically focus on the most relevant spatial-temporal segments given the sentence context. Extensive experiments on two well-known benchmarks suggest that STAT method outperforms the state-of-the-art methods on MSVD with BLEU4 score 0.511, and achieves superior BLEU4 score 0.374 on MSR-VTT-10K. Compared to the method without local features, the relative improvements derived from our STAT method are 10.1% and 0.8% respectively on two benchmarks. Compared to the method using only temporal attention, the relative improvements derived from our STAT method are 18.3% and 9.0% respectively on two benchmarks.

Abstract:
In this work, we aim to introduce our framework that performs the hand shadow puppetry using robotic arms. Human arms are not flexible and dynamic enough to produce many complicated shadow poses so we aim to utilize robotic arms for shadow puppetry. Firstly, we construct a shadow library to cover all the possible shadow images for a single robotic hand. For the input image, we first extract the shadow image by using a salient object detector. Then we match the shape correspondences between the input shadow image with the ones in the shadow library. Finally, we transfer the corresponding parameters of the best matching shadow in the library into proper utility to control the physical robotics arms.

Abstract:
In this paper a novel incremental dimensionality reduction (DR) technique called incremental accelerated kernel discriminant analysis (IAKDA) is proposed. Consisting of the eigenvalue decomposition of a relatively small-size matrix and the recursive block Cholesky factorization of the kernel matrix, a nonlinear DR transformation is efficiently computed at each incremental step. Moreover, employing factorization techniques of excellent numerical stability, IAKDA effectively removes data nonlinearities in the low dimensional subspace. Experimental evaluation on various multimedia tasks and datasets confirms that the proposed approach combined with linear support vector machines (LSVMs) offers improved mean average precision (MAP) and provides an impressive training time speedup over batch KDA and also over traditional LSVM and kernel SVM (KSVM).

Abstract:
In hospitals all around the world, medical multimedia information systems have gained high importance over the last few years. One of the reasons is that an increasing number of interventions are performed in a minimally invasive way. These endoscopic inspections and surgeries are performed with a tiny camera -- the endoscope -- which produces a video signal that is used to control the intervention. Apart from the viewing purpose, the video signal is also used for automatic content analysis during the intervention as well as for post-surgical usage, such as communicating operation techniques, planning future interventions, and medical forensics. Another reason is video documentation, which is even enforced by law in some countries. The problem, however, is the sheer amount of unstructured medical videos that are added to the multimedia archive on a daily basis. Without proper management and a multimedia information system, the medical videos cannot be used efficiently for post-surgical scenarios. It is therefore already foreseeable that medical multimedia information systems will gain even more attraction in the next few years. In this tutorial we will introduce the audience to this challenging new field, describe the domain-specific characteristics and challenges of medical multimedia data, introduce related use cases, and talk about existing works -- contributed by the medical imaging and robotics community, but also already partly from the multimedia community -- as well as the many open issues and challenges that bear high research potential.

Abstract:
Recognizing Families In the Wild (RFIW) is organized as a Data Challenge Workshop in conjunction with ACM MM 2017. The workshop is scheduled for the afternoon of October 27th. RFIW is the 1st large-scale kinship recognition challenge and is made up of 2 tracks, kinship verification and family classification. In total, 12 final submissions were made. This big data challenge was achieved with our FIW dataset which is, by far, the largest image collection of its kind. Potential next steps for FIW are abundant.

Abstract:
Recognizing visual contents in unconstrained videos has become a very important problem for many applications, such as Web video search and recommendation, smart advertising, robotics, etc. This workshop and challenge aims at exploring new challenges and approaches for large-scale video classification with large number of classes from open source videos in a realistic setting, based upon an extension of Fudan-Columbia Video Dataset (FCVID). This newly collected dataset contains over 8000 hours of video data from YouTube and Flicker, annotated into 500 categories. We hope this dataset can stimulate innovative research on this challenging and important problem.

Abstract:
Pervasive indoor localization (PIL) aims to locate an indoor mobile-phone user without any infrastructure assistance. Conventional PIL approaches employ a single probe (i.e., target) measurement to localize by identifying its best match out of a fingerprint gallery. However, a single measurement usually captures limited and inadequate location features. More importantly, the reliance on a single measurement bears the inherent risk of being inaccurate and unreliable, due to the fact that the measurement could be noisy and even corrupted.

Abstract:
This paper proposes a novel recursive hashing scheme, in contrast to conventional "one-off" based hashing algorithms. Inspired by human's "nonsalient-to-salient" perception path, the proposed hashing scheme generates a series of binary codes based on progressively expanded salient regions. Built on a recurrent deep network, i.e., LSTM structure, the binary codes generated from later output nodes naturally inherit information aggregated from previously codes while explore novel information from the extended salient region, and therefore it possesses good scalability property. The proposed deep hashing network is trained via minimizing a triplet ranking loss, which is end-to-end trainable. Extensive experimental results on several image retrieval benchmarks demonstrate good performance gain over state-of-the-art image retrieval methods and its scalability property.

Abstract:
In this paper, we propose an autoencoder-based generative adversarial network (GAN) for automatic image generation, which is called "stylized adversarial autoencoder". Different from existing generative autoencoders which typically impose a prior distribution over the latent vector, the proposed approach splits the latent variable into two components: style feature and content feature, both encoded from real images. The split of the latent vector enables us adjusting the content and the style of the generated image arbitrarily by choosing different exemplary images. In addition, a multiclass classifier is adopted in the GAN network as the discriminator, which makes the generated images more realistic. We performed experiments on hand-writing digits, scene text and face datasets, in which the stylized adversarial autoencoder achieves superior results for image generation as well as remarkably improves the corresponding supervised recognition task.

Abstract:
In this paper, we focus on improving Event Extraction (EE) by incorporating visual knowledge with words and phrases from text documents. We first discover visual patterns from large-scale text-image pairs in a weakly-supervised manner and then propose a multimodal event extraction algorithm where the event extractor is jointly trained with textual features and visual patterns. Extensive experimental results on benchmark data sets demonstrate that the proposed multimodal EE method can achieve significantly better performance on event extraction: absolute 7.1% F-score gain on event trigger labeling and 8.5% F-score gain on event argument labeling.

Abstract:
We consider the problem of extracting text instances of predefined categories from the Web. Instances of a category may be scattered across thousands of independent sources in many different formats with potential noises, which makes open-domain information extraction a challenging problem. Learning syntactic rules like "cities such as _" or "_ is a city" in a semi-supervised manner using a few labeled examples is usually unreliable because 1) high quality syntactic rules are rare and 2) the learning task is usually underconstrained. To address these problems, in this paper we propose to learn multimodal rules to combat the difficulty of syntactic rules. The multimodal rules are learned from information sources of different modalities, which is motivated by an intuition that information that is difficult to disambiguate correctly in one modality may be easily recognized in another. To demonstrate the effectiveness of this method, we have built a sophisticated end-to-end multimodal information extraction system that takes unannotated raw web pages as input, and generates a set of extracted instances as outputs. More specifically, our system learns reliable relationship between multimodal information by multimodal relation analysis on big unstructured data. Based on the learned relationship, we further train a set of multimodal rules for information extraction. Experimental evaluation shows that a greater accuracy for information extraction can be achieved by multimodal learning.

Abstract:
The huge variance of human pose and the misalignment of detected human images significantly increase the difficulty of person Re-Identification (Re-ID). Moreover, efficient Re-ID systems are required to cope with the massive visual data being produced by video surveillance systems. Targeting to solve these problems, this work proposes a Global-Local-Alignment Descriptor (GLAD) and an efficient indexing and retrieval framework, respectively. GLAD explicitly leverages the local and global cues in human body to generate a discriminative and robust representation. It consists of part extraction and descriptor learning modules, where several part regions are first detected and then deep neural networks are designed for representation learning on both the local and global regions. A hierarchical indexing and retrieval framework is designed to eliminate the huge redundancy in the gallery set, and accelerate the online Re-ID procedure. Extensive experimental results show GLAD achieves competitive accuracy compared to the state-of-the-art methods. Our retrieval framework significantly accelerates the online Re-ID procedure without loss of accuracy. Therefore, this work has potential to work better on person Re-ID tasks in real scenarios.

Abstract:
Cloud gaming has gained significant popularity recently due to many important benefits such as removal of device constraints, instant-on and cross-platform, etc. The properties of intensive resource demands and dynamic workloads make cloud gaming appropriate to be supported by an elastic cloud platform. Facing a large user population, a fundamental problem is how to provide satisfactory cloud gaming service at modest cost. We observe that software maintenance cost could be substantial compared to server running cost in cloud gaming. In this paper, we address the server provisioning problem for cloud gaming to optimize both server running cost and software maintenance cost. We find that the distribution of game softwares among servers triggers a trade-off between the software maintenance cost and server running cost. We formulate the problem with a stochastic model and employ queueing theories to conduct solid theoretical analysis. We then propose several classes of algorithms to approximate the optimal solution. The proposed algorithms are evaluated by extensive experiments using real-world parameters. The results show that the proposed algorithms are computationally efficient, nearly cost-optimal and highly robust to dynamic changes.

Abstract:
Retargeting aims at adapting an original high-resolution photo/video to a low-resolution screen with an arbitrary aspect ratio. Conventional approaches are generally based on desktop PCs, since the computation might be intolerable for mobile platforms (especially when retargeting videos). Besides, only low-level visual features are exploited typically, whereas human visual perception is not well encoded. In this paper, we propose a novel retargeting framework which fast shrinks photo/video by leveraging human gaze behavior. Specifically, we first derive a geometry-preserved graph ranking algorithm, which efficiently selects a few salient object patches to mimic human gaze shifting path (GSP) when viewing each scenery. Afterward, an aggregation-based CNN is developed to hierarchically learn the deep representation for each GSP. Based on this, a probabilistic model is developed to learn the priors of the training photos which are marked as aesthetically-pleasing by professional photographers. We utilize the learned priors to efficiently shrink the corresponding GSP of a retargeted photo/video to be maximally similar to those from the training photos. Extensive experiments have demonstrated that: 1) our method consumes less than 35ms to retarget a 1024 × 768 photo (or a 1280 × 720 video frame) on popular iOS/Android devices, which is orders of magnitude faster than the conventional retargeting algorithms; 2) the retargeted photos/videos produced by our method outperform its competitors significantly based on the paired-comparison-based user study; and 3) the learned GSPs are highly indicative of human visual attention according to the human eye tracking experiments.

Abstract:
A scene is usually abstract that consists of several less abstract entities such as objects or themes. It is very difficult to reason scenes from visual features due to the semantic gap between the abstract scenes and low-level visual features. Some alternative works recognize scenes with a two-step framework by representing images with intermediate representations of objects or themes. However, the object co-occurrences between scenes may lead to ambiguity for scene recognition. In this paper, we propose a framework to represent images with intermediate (object) representations with spatial layout, i.e., object-to-object relation (OOR) representation. In order to better capture the spatial information, the proposed OOR is adapted to RGB-D data. In the proposed framework, we first apply object detection technique on RGB and depth images separately. Then the detected results of both modalities are combined with a RGB-D proposal fusion process. Based on the detected results, we extract semantic feature OOR and regional convolutional neural network (CNN) features located by bounding boxes. Finally, different features are concatenated to feed to the classifier for scene recognition. The experimental results on SUN RGB-D and NYUD2 datasets illustrate the efficiency of the proposed method.

Abstract:
In this paper, we explore ways to address the challenges such as data bias caused by the lack of data on person re-identification problem. We propose a data generation framework from both intra- and inter-view aspects for data augmentation to advance the performance of the existing person re-identification algorithms. Specifically, for intra-view data generation, the proposed method generates useful predicted sequences within a camera view for certain person data expansion. The generated sequences well preserve the movement information of the camera and objects, which expands the original data with longer sequence length to tackle the problem caused by insufficient data from the root. For more challenging datasets which suffer from background clutters, we propose an inter-view image generation with automatic end-to-end background substitution to eliminate the influence by the background and increase the diversity of the training data as well, which makes the recognition system learn to focus on the regions of objects and image features related to identity. We then propose a flexible data augmentation method based on our data generation approaches to improve the performance of the person re-identification and analyze the advantages and applicability of these approaches respectively. Evaluated on the challenging re-id datasets, our method outperforms existing state-of-the-art approaches without any network structure modification on the baseline neural network. Cross-datasets evaluation results show that our method has favorable generalization ability and is potentially helpful for solving similar recognition tasks due to the common issue of insufficient data.

Abstract:
The ACM Special Interest Group on Multimedia (SIGMM) is pleased to present this year's Rising Star Award in multimedia computing, communications and applications to Dr. Liangliang Cao for his significant contributions in large-scale multimedia recognition and social media mining. The ACM SIGMM Rising Star Award recognizes a young researcher who has made outstanding research contributions to the field of multimedia computing, communication and applications during the early part of his or her career. Dr. Cao has published extensively in top multimedia related journals and conferences, including 15 in ACM Multimedia and 3 in ACM ICMR. To date, he has garnered 3400+ citations on over 70 papers and 10 patents. This impressive record in his early stage of career demonstrate the impact of his research and his contributions to our field of Multimedia. In his young research career, Dr. Cao has made unique and significant contributions in industrial settings. Most notably he was the Project key person for ALADDIN, on the IBM-Columbia team, the Lead contributor who made the IBM IMARS system 100 times faster, as well as the Lead contributor of a clothes & fashion search app at Yahoo Taiwan, which has received 80+ local media reports. He is a cofounder of HelloVera.AI where he is working as a CTO and Chief Scientist. ....

Abstract:
In this paper, we summarize our works for cross-media retrieval where the queries and retrieval content are of different media types. We study cross-media retrieval in the context of two applications, i.e., ~image retrieval by textual queries, and sentence retrieval by visual queries, two popular applications in multimedia retrieval. For image retrieval by textual queries, we proposetext2image which converts computing cross-media relevance between images and textual queries to comparing the visual similarity among images.We also proposecross-media relevance fusion, a conceptual framework that combines multiple cross-media relevance estimators.These two techniques have resulted in a winning entry in the Microsoft Image Retrieval Challenge at ACM MM 2015. For sentence retrieval by visual queries, we propose to compute cross-media relevance in a visual space exclusively. We contributeWord2VisualVec, a deep neural network architecture that learns to predict a visual feature representation from textual input. With proposedWord2VisualVec model, we won the Video to Text Description task at TRECVID 2016.

Abstract:
Multimodal data streams are essential for analyzing personal life, environmental conditions, and social situations. Since these data streams have different granularities and semantics, the semantic gap becomes even more formidable. To make sense of all the multimodal correlated streams we must first synchronize them in the context of the application, and then analyze them to extract meaningful information. In this paper, we consider the problem of modeling an individual by using daily activity in order to understand their health and behavior. The first step is to correlate diverse data streams with atomic-interval, and segment a person's day into her daily activities. We collect the diverse data streams from the person's smartphone to classify every atomic-interval into a daily activity. Next, we use an interval growing technique for determining daily-activity-intervals and their attributes. Then, these daily-activity-intervals are labeled as the daily activities by using Bagging Formal Concept Analysis (BFCA). Finally, we build a personal chronicle, which is a person's time-ordered list of daily activities. This personal chronicle can then be used to model the person using learning techniques applied to daily activities in the chronicle and relating them to biomedical or behavioral signals. We present the results for this daily activity segmentation and recognition by using lifelogs of 23 participants.

Abstract:
Scale analysis plays a vital role in pedestrian detection. Conventional approaches usually directly concatenate multi-scale outputs, which is only capable of modeling first-order dependency among various scales. In contrast, this work proposes a novel scale-context modeling scheme by exploiting the highly nonlinear dependency among scales. The proposed scheme aggregates output response maps from mid-results of convolutional layers via a bi-directional recurrent sub-network. Therefore scale information could flow among different layers and implicit underlying dependency structure information in the scale space would be disclosed, which yields more consistency detection. Experimental results on Caltech Pedestrian detection benchmark demonstrate the superior detection (state-of-the-art miss rate of 8.56%) of the proposed method over prior art.

Abstract:
We present a method for activity recognition that first estimates the activity performer's location and uses it with input data for activity recognition. Existing approaches directly take video frames or entire video for feature extraction and recognition, and treat the classifier as a black box. Our method first locates the activities in each input video frame by generating an activity mask using a conditional generative adversarial network (cGAN). The generated mask is appended to color channels of input images and fed into a VGG-LSTM network for activity recognition. To test our system, we produced two datasets with manually created masks, one containing Olympic sports activities and the other containing trauma resuscitation activities. Our system makes activity prediction for each video frame and achieves performance comparable to the state-of-the-art systems while simultaneously outlining the location of the activity. We show how the generated masks facilitate the learning of features that are representative of the activity rather than accidental surrounding information.

Abstract:
The ubiquity of online fashion shopping demands effective recommendation services for customers. In this paper, we study two types of fashion recommendation: (i) suggesting an item that matches existing components in a set to form a stylish outfit (a collection of fashion items), and (ii) generating an outfit with multimodal (images/text) specifications from a user. To this end, we propose to jointly learn a visual-semantic embedding and the compatibility relationships among fashion items in an end-to-end fashion. More specifically, we consider a fashion outfit to be a sequence (usually from top to bottom and then accessories) and each item in the outfit as a time step. Given the fashion items in an outfit, we train a bidirectional LSTM (Bi-LSTM) model to sequentially predict the next item conditioned on previous ones to learn their compatibility relationships. Further, we learn a visual-semantic space by regressing image features to their semantic representations aiming to inject attribute and category information as a regularization for training the LSTM. The trained network can not only perform the aforementioned recommendations effectively but also predict the compatibility of a given outfit. We conduct extensive experiments on our newly collected Polyvore dataset, and the results provide strong qualitative and quantitative evidence that our framework outperforms alternative methods.

Abstract:
UnrealCV is a project to help computer vision researchers build virtual worlds using Unreal Engine 4 (UE4). It extends UE4 with a plugin by providing (1) A set of UnrealCV commands to interact with the virtual world. (2) Communication between UE4 and an external program, such as Caffe. UnrealCV can be used in two ways. The first one is using a compiled game binary with UnrealCV embedded. This is as simple as running a game, no knowledge of Unreal Engine is required. The second is installing UnrealCV plugin to Unreal Engine 4 (UE4) and use the editor of UE4 to build a new virtual world. UnrealCV is an open-source software under the MIT license. Since the initial release in September 2016, it has gathered an active community of users, including students and researchers.

Abstract:
This paper presents a novel video search system that can provide users with diversified and summarized results automatically. Users frequently confront the dilemmas of specifying conditions for video search because too many or few conditions would lead to either insufficient or redundant results. We solve this problem by introducing novel definitions to explore hidden knowledge inherent in underlying datasets. We demonstrate our methods by implementing a prototype system with real-world surveillance videos.

Abstract:
The art of cooking is always fascinating. Nevertheless, reproducing a delicious dish that one has never encountered before is not easy. Even if the name of dish is known and the corresponding recipe could be retrieved, the right ingredients for cooking the dish may not be available due to factors such as geography region or season. Furthermore, knowing how to cut, cook and control timing may be challenging for one whose has no cooking experience. In this paper, an all-around cooking assistant mobile app, named Pic2Dish, is developed to help users who would like to cook a dish but neither know the name of dish nor has cooking skill. Basically, by inputting a picture of the dish and the list of ingredients at hand, Pic2Dish automatically recognizes the dish name and recommends a customized recipe together with video clips to guide user on how to cook the dish. Importantly, the recommended recipe is modified from a retrieved recipe that best matches the given dish, with missing ingredients being replaced with the available ingredients that match dish context and taste. The whole process involves the recognition of dishes with convolutional neural network, classification of key and non-key ingredients, and context analysis of ingredient relationship and their cooking/cutting methods. The user studies, which recruit real users to cook dishes by using Pic2Dish, shows the usefulness of the app.

Abstract:
We propose a new rate control algorithm for interactive real-time multimedia traffic, the FRACTaL algorithm. In our approach, the endpoint sends Forward Error Correction (FEC) packets not only for better error-resilience but also to probe for available bandwidth. The sender varies the amount of FEC to meet the sending rate calculated by the congestion control without changing the media rate. We evaluate our proposal in an emulated networking environment, using a set of reference test scenarios and compare it to the SCReAM congestion control algorithm. We find that FRACTaL performs better than SCReAM when competing with TCP flows, i.e., it is able to obtain its "fair" share. In other (non-TCP) scenarios, it achieves lower loss rates at comparable path utilization and queuing delay, i.e., FRACTaL delivers better and consistent media quality.

Abstract:
With the increasing demand of massive multimodal data storage and organization, cross-modal retrieval based on hashing technique has drawn much attention nowadays. It takes the binary codes of one modality as the query to retrieve the relevant hashing codes of another modality. However, the existing binary constraint makes it difficult to find the optimal cross-modal hashing function. Most approaches choose to relax the constraint and perform thresholding strategy on the real-value representation instead of directly solving the original objective. In this paper, we first provide a concrete analysis about the effectiveness of multimodal networks in preserving the inter- and intra-modal consistency. Based on the analysis, we provide a so-called Deep Binary Reconstruction (DBRC) network that can directly learn the binary hashing codes in an unsupervised fashion. The superiority comes from a proposed simple but efficient activation function, named as Adaptive Tanh (ATanh). The ATanh function can adaptively learn the binary codes and be trained via back-propagation. Extensive experiments on three benchmark datasets demonstrate that DBRC outperforms several state-of-the-art methods in both image2text and text2image retrieval task.

Abstract:
Future frame prediction for video sequences is a challenging task and worth exploring problem in computer vision. Existing methods often learn motion information for the entire image to predict next frames. However, different objects in the same scene often move and deform in different ways intuitively. Considering the human visual system, one often pays attention to the key objects that contain crucial motion signals, rather than compress an entire image into a static representation. Motivated by this property of human perception, in this work, we develop a novel object-centric video prediction model that learns local motion transformation dynamically for key object regions with visual attention. By transforming objects iteratively to the original input frames, next frame can be produced. Specifically, we design an attention module with replaceable strategies to attend to objects in video frames automatically. Our method does not require any annotated data during training procedure. To produce sharp predictions, adversarial training is adopted in our work. We evaluate our model on the Moving MNIST and UCF101 datasets and report competitive results, compared to prior methods. The generated frames demonstrate that our model can characterize motion for different objects and produce plausible future frames.

Abstract:
In content-based image retrieval, the most challenging (and ambiguous) part is to define the similarity between images. For the human-being, such similarity can be defined with respect to where they pay attention to and what semantic attributes they understand. Inspired by this fact, this paper presents two-stream attentive CNNs for image retrieval. As the human-being does, the proposed network has two streams that simultaneously handle two tasks. The Main stream focuses on extracting discriminative visual features that are tightly correlated with semantic attributes. Meanwhile, the Auxiliary stream aims to facilitate the main stream by redirecting the feature extraction operation mainly to the image content that human may pay attention to. By fusing these two streams into the Main and Auxiliary CNNs (MAC), image similarity can be computed as the human-being does by reserving the conspicuous content and suppressing the irrelevant regions. Extensive experiments show that the proposed model achieves impressive performance in image retrieval on four public datasets.

Abstract:
Modeling hairstyles for classification, synthesis and image editing has many practical applications. However, existing hairstyle datasets, such as the Beauty e-Expert dataset, are too small for developing and evaluating computer vision models, especially the recent deep generative models such as generative adversarial network (GAN). In this paper, we contribute a new large-scale hairstyle dataset called Hairstyle30k, which is composed of 30k images containing 64 different types of hairstyles. To enable automated generating and modifying hairstyles in images, we also propose a novel GAN model termed Hairstyle GAN (H-GAN) which can be learned efficiently. Extensive experiments on the new dataset as well as existing benchmark datasets demonstrate the effectiveness of proposed H-GAN model

Abstract:
With the proliferation of e-commerce websites and the ubiquitousness of smart phones, cross-domain image retrieval using images taken by smart phones as queries to search products on e-commerce websites is emerging as a popular application. One challenge of this task is to locate the attention of both the query and database images. In particular, database images, e.g. of fashion products, on e-commerce websites are typically displayed with other accessories, and the images taken by users contain noisy background and large variations in orientation and lighting. Consequently, their attention is difficult to locate. In this paper, we exploit the rich tag information available on the e-commerce websites to locate the attention of database images. For query images, we use each candidate image in the database as the context to locate the query attention. Novel deep convolutional neural network architectures, namely TagYNet and CtxYNet, are proposed to learn the attention weights and then extract effective representations of the images. Experimental results on public datasets confirm that our approaches have significant improvement over the existing methods in terms of the retrieval accuracy and efficiency.

Abstract:
Approximate Nearest Neighbour (ANN) search is an important research topic in multimedia and computer vision fields. In this paper, we propose a new deep supervised quantization method by Self-Organizing Map (SOM) to address this problem. Our method integrates the Convolutional Neural Networks (CNN) and Self-Organizing Map into a unified deep architecture. The overall training objective includes supervised quantization loss and classification loss. With the supervised quantization loss, we minimize the differences on the maps between similar image pairs, and maximize the differences on the maps between dissimilar image pairs. By optimization, the deep architecture can simultaneously extract deep features and quantize the features into the suitable nodes in the Self-Organizing Map. The experiments on several public standard datasets prove the superiority of our approach over the existing ANN search methods. Besides, as a byproduct, our deep architecture can be directly applied to classification task and visualization with little modification, and promising performances are demonstrated on these tasks in the experiments.

Abstract:
Data clustering is a fundamental operation in data analysis. For handling large-scale data, the standard k-means clustering method is not only slow, but also memory-inefficient. We propose an efficient clustering method for billion-scale feature vectors, called PQk-means. By first compressing input vectors into short product-quantized (PQ) codes, PQk-means achieves fast and memory-efficient clustering, even for high-dimensional vectors. Similar to k-means, PQk-means repeats the assignment and update steps, both of which can be performed in the PQ-code domain. Experimental results show that even short-length (32 bit) PQ-codes can produce competitive results compared with k-means. This result is of practical importance for clustering in memory-restricted environments. Using the proposed PQk-means scheme, the clustering of one billion 128D SIFT features with K = 105 is achieved within 14 hours, using just 32 GB of memory consumption on a single computer.

Abstract:
Recently, some cross-modal hashing methods have been devised for cross-modal search task. Essentially, given a similarity matrix, most of these methods tackle a discrete optimization problem by separating it into two stages, i.e., first relaxing the binary constraints and finding a solution of the relaxed optimization problem, then quantizing the solution to obtain the binary codes. This scheme will generate large quantization error. Some discrete optimization methods have been proposed to tackle this; however, the generation of the binary codes is independent of the features in the original space, which makes it not robust to noise. To consider these problems, in this paper, we propose a novel supervised cross-modal hashing method---Semi-Relaxation Supervised Hashing (SRSH). It can learn the hash functions and the binary codes simultaneously. At the same time, to tackle the optimization problem, it relaxes a part of binary constraints, instead of all of them, by introducing an intermediate representation variable. By doing this, the quantization error can be reduced and the optimization problem can also be easily solved by an iterative algorithm proposed in this paper. Extensive experimental results on three benchmark datasets demonstrate that SRSH can obtain competitive results and outperform state-of-the-art unsupervised and supervised cross-modal hashing methods.

Abstract:
Reading the human mind has been a hot topic in the last decades, and recent research in neuroscience has found evidence on the possibility of decoding, from neuroimaging data, how the human brain works. At the same time, the recent rediscovery of deep learning combined to the large interest of scientific community on generative methods has enabled the generation of realistic images by learning a data distribution from noise. The quality of generated images increases when the input data conveys information on visual content of images. Leveraging on these recent trends, in this paper we present an approach for generating images using visually-evoked brain signals recorded through an electroencephalograph (EEG). More specifically, we recorded EEG data from several subjects while observing images on a screen and tried to regenerate the seen images. To achieve this goal, we developed a deep-learning framework consisting of an LSTM stacked with a generative method, which learns a more compact and noise-free representation of EEG data and employs it to generate the visual stimuli evoking specific brain responses.

Abstract:
Social media information distributes in different Online Social Networks (OSNs). This paper addresses the problem integrating the cross-OSN information to facilitate an immersive social media search experience. We exploit hashtag, which is widely used to annotate and organize multi-modal items in different OSNs, as the bridge for information aggregation and organization. A three-stage solution framework is proposed for hashtag representation, clustering and demonstration. Given an event query, the related items from three OSNs, Twitter, Flickr and YouTube, are organized in cluster-hashtag-item hierarchy for display. The effectiveness of the proposed solution is validated by qualitative and quantitative experiments on hundreds of trending event queries.

Abstract:
Anomalous events detection in real-world video scenes is a challenging problem due to the complexity of "anomaly" as well as the cluttered backgrounds, objects and motions in the scenes. Most existing methods use hand-crafted features in local spatial regions to identify anomalies. In this paper, we propose a novel model called Spatio-Temporal AutoEncoder (ST AutoEncoder or STAE), which utilizes deep neural networks to learn video representation automatically and extracts features from both spatial and temporal dimensions by performing 3-dimensional convolutions. In addition to the reconstruction loss used in existing typical autoencoders, we introduce a weight-decreasing prediction loss for generating future frames, which enhances the motion feature learning in videos. Since most anomaly detection datasets are restricted to appearance anomalies or unnatural motion anomalies, we collected a new challenging dataset comprising a set of real-world traffic surveillance videos. Several experiments are performed on both the public benchmarks and our traffic dataset, which show that our proposed method remarkably outperforms the state-of-the-art approaches.

Abstract:
Analyzing videos is one of the fundamental problems of computer vision and multimedia content analysis for decades. The task is very challenging as video is an information-intensive media with large variations and complexities. Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now able to boost the performance of video analysis significantly and initiate new research directions to analyze video content. This tutorial will present recent advances under the umbrella of video understanding, which start from a unified deep learning toolkit--Microsoft Cognitive Toolkit (CNTK) that supports popular model types such as convolutional nets and recurrent networks, to fundamental challenges of video representation learning and video classification, recognition, and finally to an emerging area of video and language.

Abstract:
Image captioning has attracted ever-increasing research attention in multimedia and computer vision. To encode the visual content, existing approaches typically utilize the off-the-shelf deep Convolutional Neural Network (CNN) model to extract visual features, which are sent to Recurrent Neural Network (RNN) based textual generators to output word sequence. Some methods encode visual objects and scene information with attention mechanism more recently. Despite the promising progress, one distinct disadvantage lies in distinguishing and modeling key semantic entities and their relations, which are in turn widely regarded as the important cues for us to describe image content. In this paper, we propose a novel image captioning model, termed StructCap. It parses a given image into key entities and their relations organized in a visual parsing tree, which is transformed and embedded under an encoder-decoder framework via visual attention. We give an end-to-end formulation to facilitate joint training of visual tree parser, structured semantic attention and RNN-based captioning modules. Experimental results on two public benchmarks, Microsoft COCO and Flickr30K, show that the proposed StructCap model outperforms the state-of-the-art approaches under various standard evaluation metrics.

Abstract:
Face aging, which renders aging faces for an input face, has attracted extensive attention in the multimedia research. Recently, several conditional Generative Adversarial Nets (GANs) based methods have achieved great success. They can generate images fitting the real face distributions conditioned on each individual age group. However, these methods fail to capture the transition patterns, e.g., the gradual shape and texture changes between adjacent age groups. In this paper, we propose a novel Contextual Generative Adversarial Nets (C-GANs) to specifically take it into consideration. The C-GANs consists of a conditional transformation network and two discriminative networks. The conditional transformation network imitates the aging procedure with several specially designed residual blocks. The age discriminative network guides the synthesized face to fit the real conditional distribution. The transition pattern discriminative network is novel, aiming to distinguish the real transition patterns with the fake ones. It serves as an extra regularization term for the conditional transformation network, ensuring the generated image pairs to fit the corresponding real transition pattern distribution. Experimental results demonstrate the proposed framework produces appealing results by comparing with the state-of-the-art and ground truth. We also observe performance gain for cross-age face verification.

Abstract:
As a crucial challenge for video understanding, exploiting the spatial-temporal structure of video has attracted much attention recently, especially on video captioning. Inspired by the insight that people always focus on certain interested regions of video content, we propose a novel approach which will automatically focus on regions-of-interest and catch their temporal structures. In our approach, we utilize a specific attention model to adaptively select regions-of-interest for each video frame. Then a Dual Memory Recurrent Model (DMRM) is introduced to incorporate temporal structure of global features and regions-of-interest features in parallel, which will obtain rough understanding of video content and particular information of regions-of-interest. Since the attention model could not always catch the right interests, we additionally adopt semantic supervision to attend to interested regions more correctly. We evaluate our method for video captioning on two public benchmarks: the Microsoft Video Description Corpus (MSVD) and the Montreal Video Annotation Dataset (M-VAD). The experiments demonstrate that catching temporal regions-of-interest information really enhances the representation of input videos and our approach obtains the state-of-the-art results on popular evaluation metrics like BLEU-4, CIDEr, and METEOR.

Abstract:
In generic visual tracking, traditional appearance based trackers suffer from distracting factors like bad lighting or major target deformation, etc., as well as insufficiency of training data. In this work, we propose to exploit the category-specific semantics to boost visual object tracking, and develop a new visual tracking model that augments the appearance based tracker with a top-down reasoning component. The continuous feedback from this reasoning component guides the tracker to reliably identify candidate regions with consistent semantics across frames and localize the target object instance more robustly and accurately. Specifically, a generic object recognition model and a semantic activation map method are deployed to provide effective top-down reasoning about object locations for the tracker. In addition, we develop a voting based scheme for the reasoning component to infer the object semantics. Therefore, even without sufficient training data, the tracker can still obtain reliable top-down clues about the objects. Together with the appearance clues, the tracker can localize objects accurately even in presence of various major distracting factors. Extensive evaluations on two large-scale benchmark datasets, OTB2013 and OTB2015, clearly demonstrate that the top-down reasoning substantially enhances the robustness of the tracker and provides state-of-the-art performance.

Abstract:
Recent years have witnessed the success of the emerging hash-based approximate nearest neighbor search techniques in large-scale image retrieval. However, for large-scale video search, most of the existing hashing methods mainly focus on the visual content contained in the still frames, without considering their temporal relations. Therefore, they usually suffer greatly from the insufficient capability of capturing the intrinsic video similarities, from both the visual and the temporal aspects. To address the problem, we propose a temporal binary coding solution in an unsupervised manner, which simultaneously considers the intrinsic relations among the visual content and the temporal consistency among the successive frames. To capture the inherent data similarities among videos, we adopt the sparse, nonnegative feature to characterize the common local visual content and approximate their intrinsic similarities using a low-rank matrix. Then a standard graph-based loss is adopted to guarantee that the learnt hash codes can well preserve the similarities. Furthermore, we introduce a subspace rotation to model the small variation among the successive frames, and thus essentially preserve the temporal consistency in Hamming space. Finally, we formulate the video hashing problem as a joint learning of the binary codes, the hash functions and the temporal variation, and devise an alternating optimization algorithm that enjoys fast training and discriminative hash functions. Extensive experiments on three large video datasets demonstrate the proposed method significantly outperforms a number of state-of-the-art hashing methods.

Abstract:
In this paper, a deep end-to-end network for sketch recognition, named Deep Visual-Sequential Fusion model (DVSF) is proposed to model the visual and sequential patterns of the strokes. To capture the intermediate states of sketches, a three-way representation learner is first utilized to extract the visual features. These deep features are simultaneously fed into the visual and sequential networks to capture spatial and temporal properties, respectively. More specifically, visual networks are novelly proposed to learn the stroke patterns by stacking the Residual Fully-Connected (R-FC) layers, which integrate ReLU and Tanh activation functions to achieve the sparsity and generalization ability. To learn the patterns of stroke order, sequential networks are constructed by Residual Long Short-Term Memory (R-LSTM) units, which optimize the network architecture by skip connection. Finally, the visual and sequential representations of the sketches are seamlessly integrated with a fusion layer to obtain the final results. Experiments conducted on the benchmark sketch dataset TU-Berlin demonstrate the effectiveness of the proposed method, which outperforms the state-of-the-art approaches.

Abstract:
Monocular simultaneous localization and mapping (SLAM) is a key enabling technique for many augmented reality (AR) applications. However, conventional methods for monocular SLAM can obtain only sparse or semi-dense maps in highly-textured image areas. Poorly-textured regions which widely exist in indoor and man-made urban environments can be hardly reconstructed, impeding interactions between virtual objects and real scenes in AR apps. In this paper,we present a novel method for real-time monocular dense mapping based on the piecewise planarity assumption for poorly textured regions. Specifically, a semi-dense map for highly-textured regions is first calculated by pixel matching and triangulation [6, 7]. Large textureless regions extracted by Maximally Stable Color Regions (MSCR) [11], which is a homogeneous-color region detector, are approximated using piecewise planar models which are estimated by the corresponding semi-dense 3D points and the proposed multi-plane segmentation algorithm. Plane models associated with the same 3D area across multiple overlapping views are linked and fused to ensure a consistent and accurate 3D reconstruction. Experimental results on two public datasets [15, 23] demonstrate that our method is 2.3X~2.9X faster than the state-of-the-art method DPPTAM [2], and meanwhile achieves better reconstruction accuracy and completeness. We also apply our method to a real AR application and live experiments with a hand-held camera demonstrate the effectiveness and efficiency of our method in practical scenario.

Abstract:
Touch Me Here is an interactive performance art piece that invites the participant to virtually touch and paint on the body of a female artist who is standing at a distance, in the same closed intimate space. Using the artist's body as a canvas, the participant makes a virtual body-art piece with a custom-built augmented reality painting application. This interactive digital artwork explores and challenges the ethics of live-interaction, while simultaneously blurring the line between the private and public spheres. Other major themes of this work include: the dialectic between virtual and real, micro and macro gestures, and motions and emotions. Touch Me Here is a digital reenactment of Valie Export's Tap and Touch Cinema (1968).

Abstract:
Although the problem of automatic video summarization has recently received a lot of attention, the problem of creating a video summary that also highlights elements relevant to a search query has been less studied. We address this problem by posing query-relevant summarization as a video frame subset selection problem, which lets us optimise for summaries which are simultaneously diverse, representative of the entire video, and relevant to a text query. We quantify relevance by measuring the distance between frames and queries in a common textual-visual semantic embedding space induced by a neural network. In addition, we extend the model to capture query-independent properties, such as frame quality. We compare our method against previous state of the art on textual-visual embeddings for thumbnail selection and show that our model outperforms them on relevance prediction. Furthermore, we introduce a new dataset, annotated with diversity and query-specific relevance labels. On this dataset, we train and test our complete model for video summarization and show that it outperforms standard baselines such as Maximal Marginal Relevance.

Abstract:
Discriminative localization is essential for fine-grained image classification task, which devotes to recognizing hundreds of subcategories in the same basic-level category. Reflecting on discriminative regions of objects, key differences among different subcategories are subtle and local. Existing methods generally adopt a two-stage learning framework: The first stage is to localize the discriminative regions of objects, and the second is to encode the discriminative features for training classifiers. However, these methods generally have two limitations: (1) Separation of the two-stage learning is time-consuming. (2) Dependence on object and parts annotations for discriminative localization learning leads to heavily labor-consuming labeling. It is highly challenging to address these two important limitations simultaneously. Existing methods only focus on one of them. Therefore, this paper proposes the discriminative localization approach via saliency-guided Faster R-CNN to address the above two limitations at the same time, and our main novelties and advantages are: (1) End-to-end network based on Faster R-CNN is designed to simultaneously localize discriminative regions and encode discriminative features, which accelerates classification speed. (2) Saliency-guided localization learning is proposed to localize the discriminative region automatically, avoiding labor-consuming labeling. Both are jointly employed to simultaneously accelerate classification speed and eliminate dependence on object and parts annotations. Comparing with the state-of-the-art methods on the widely-used CUB-200-2011 dataset, our approach achieves both the best classification accuracy and efficiency.

Abstract:
Multimodal Deep Boltzmann Machines (DBMs) have demonstrated huge successes in multimodal representation learning tasks. During inference, DBMs function as Recurrent Neural Nets (RNNs) because of the intractable distributions. To learn the parameters, optimizations can alternatively be operated on these surrogate RNNs with "truncated message passing". As a consequence, the gradient will propagate through a long chain without any local guidance which can potentially affects the optimization procedure. In this paper, we address this problem by adding skip connections during back-propagation while keeping the forward propagation (inference) untouched. With skip connections, we implicitly assign local "targets" for the states of intermediate inference loops to approach. Applied to different training criteria on different data sets, we demonstrate the proposed algorithms can consistently help to train better models while at a lower cost of training time. Experimental results show that our algorithms can achieve state-of-the-art performance on the Multimedia Information Retrieval (MIR) Flickr data set.

Abstract:
Multi-shot person Re-IDentification (Re-ID) has recently received more research attention as its problem setting is more realistic compared to single-shot Re-ID in terms of application. While many large-scale single-shot Re-ID human image datasets have been released, most existing multishot Re-ID video sequence datasets containonly a few (i.e., several hundreds) human instances, which hinders further improvement of multi-shot Re-ID performance. To this end, we propose a deep cross-modality alignment network, which jointly explores both human sequence pairs and image pairs to facilitate training better multi-shot human Re-ID models, i.e., via transferring knowledge from image data to sequence data. To mitigate modality-to-modality mismatch issue, the proposed network is equipped with an image-to-sequence adaption module called cross-modality alignment sub-network, which successfully maps each human image into a pseudo human sequence to facilitate knowledge transferring and joint training. Extensive experimental results on several multi-shot person Re-ID benchmarks demonstrate great performance gain brought up by the proposed network.

Abstract:
Nowadays, as a beauty-enhancing product, clothing plays an important role in human's social life. In fact, the key to a proper outfit usually lies in the harmonious clothing matching. Nevertheless, not everyone is good at clothing matching. Fortunately, with the proliferation of fashion-oriented online communities, fashion experts can publicly share their fashion tips by showcasing their outfit compositions, where each fashion item (e.g., a top or bottom) usually has an image and context metadata (e.g., title and category). Such rich fashion data offer us a new opportunity to investigate the code in clothing matching. However, challenges co-exist with opportunities. The first challenge lies in the complicated factors, such as color, material and shape, that affect the compatibility of fashion items. Second, as each fashion item involves multiple modalities (i.e., image and text), how to cope with the heterogeneous multi-modal data also poses a great challenge. Third, our pilot study shows that the composition relation between fashion items is rather sparse, which makes traditional matrix factorization methods not applicable. Towards this end, in this work, we propose a content-based neural scheme to model the compatibility between fashion items based on the Bayesian personalized ranking (BPR) framework. The scheme is able to jointly model the coherent relation between modalities of items and their implicit matching preference. Experiments verify the effectiveness of our scheme, and we deliver deep insights that can benefit future research.

Abstract:
Exploiting the temporal dependency among video frames or subshots is very important for the task of video summarization. Practically, RNN is good at temporal dependency modeling, and has achieved overwhelming performance in many video-based tasks, such as video captioning and classification. However, RNN is not capable enough to handle the video summarization task, since traditional RNNs, including LSTM, can only deal with short videos, while the videos in the summarization task are usually in longer duration. To address this problem, we propose a hierarchical recurrent neural network for video summarization, called H-RNN in this paper. Specifically, it has two layers, where the first layer is utilized to encode short video subshots cut from the original video, and the final hidden state of each subshot is input to the second layer for calculating its confidence to be a key subshot. Compared to traditional RNNs, H-RNN is more suitable to video summarization, since it can exploit long temporal dependency among frames, meanwhile, the computation operations are significantly lessened. The results on two popular datasets, including the Combined dataset and VTW dataset, have demonstrated that the proposed H-RNN outperforms the state-of-the-arts.

Abstract:
This paper proposes a novel deep framework of multi-networks joint learning for large-scale cross-modal retrieval. For most existing cross-modal methods, the processes of training and testing don't care about the problem of memory requirement. Hence, they are generally implemented on small-scale data. Moreover, they take feature learning and latent space embedding as two separate steps which cannot generate specific features to accord with the cross-modal task. To alleviate the problems, we first disintegrate the multiplication and inverse of some big matrices, usually involved in existing methods, into that of many sub-matrices. Each sub-matrix is targeted to dispose one pair of image-sentence, for which we further design a novel sampling strategy to select the most representative samples to construct the cross-modal ranking loss and within-modal discriminant loss functions. By this way, the proposed model consumes less memory each time such that it can scale to large-scale data. Furthermore, we apply the proposed discriminative ranking loss to effectively unify two heterogenous networks, deep residual network for images and long short-term memory for sentences, into an end-to-end deep learning architecture. Finally, we can simultaneously achieve specific features adapting to cross-modal task and learn a shared latent space for images and sentences. Extensive evaluations on two large-scale cross-modal datasets show that the proposed method brings substantial improvements over other state-of-the-art ranking methods.

Abstract:
In deep learning scenarios, a lot of labeled samples are needed to train the models. However, in practical application fields, since the objects to be recognized are complex and non-uniformly distributed, it is difficult to get enough labeled samples at one time. Active learning can actively improve the accuracy with fewer training labels, which is one of the promising solutions to tackle this problem. Inspired by human being's cognition process to acquire additional knowledge gradually, we propose a novel deep active learning method through Cognitive Information Parcels (CIPs) based on the analysis of model's cognitive errors and expert's instruction. The transformation of the cognitive parcels is defined, and the corresponding representation feature of the objects is obtained to identify the model's cognitive error information. Experiments prove that the samples, selected based on the CIPs, can benefit the target recognition and boost the deep model's performance efficiently. The characterization of cognitive knowledge can avoid the other samples' disturbance to the cognitive property of the model effectively. We believe that our work could provide a trial of thought about the cognitive knowledge used in deep learning field.

Abstract:
Like the traditional long videos, micro-videos are the unity of textual, acoustic, and visual modalities. These modalities sequentially tell a real-life event from distinct angles. Yet, unlike the traditional long videos with rich content, micro-videos are very short, lasting for 6-15 seconds, and they hence usually convey one or a few high-level concepts. In the light of this, we have to characterize and jointly model the sparseness and multiple sequential structures for better micro-video understanding. To accomplish this, in this paper, we present an end-to-end deep learning model, which packs three parallel LSTMs to capture the sequential structures and a convolutional neural network to learn the sparse concept-level representations of micro-videos. We applied our model to the application of micro-video categorization. Besides, we constructed a real-world dataset for sequence modeling and released it to facilitate other researchers. Experimental results demonstrate that our model yields better performance than several state-of-the-art baselines.

Abstract:
Fine-grained object recognition is challenging due to large intra-class variation and inter-class ambiguity. A good algorithm should be able to: 1) discover discriminative local details and 2) align and aggregate these local discriminative patch-level features in an effective way to facilitate object level classification. Towards this end, we propose a novel local feature discovery, discriminative alignment and aggregation framework, inspired by the recent success of deep recurrent attention model. First, we develop a novel attribute-guided attentive network to sequentially discover informative parts/regions, by seeking a good registration between attentive regions and predefined object attributes. This could be considered as a semantic guided salient region discovery and alignment network, which might be more robust than conventional attention model. Second, these discovered regions are actively and progressively fed into a recurrent neural network, to yield the object-level representation. This could be considered as a discriminant aggregation network and informative patch-level features are propagated and accumulated to the deeper nodes of the recurrent network for final classification. We extensively test our framework on two fine-grained image benchmarks and the results demonstrate the effectiveness of the proposed framework.

Abstract:
Thanks to the recent developments of Convolutional Neural Networks, the performance of face verification methods has increased rapidly. In a typical face verification method, feature normalization is a critical step for boosting performance. This motivates us to introduce and study the effect of normalization during training. But we find this is non-trivial, despite normalization being differentiable. We identify and study four issues related to normalization through mathematical analysis, which yields understanding and helps with parameter settings. Based on this analysis we propose two strategies for training using normalized features. The first is a modification of softmax loss, which optimizes cosine similarity instead of inner-product. The second is a reformulation of metric learning by introducing an agent vector for each class. We show that both strategies, and small variants, consistently improve performance by between 0.2% to 0.4% on the LFW dataset based on two models. This is significant because the performance of the two models on LFW dataset is close to saturation at over 98%.

Abstract:
Many commercial video players rely on bitrate adaptation algorithm to adapt video bitrate to dynamic network condition. To achieve a high quality of experience, bitrate adaptation algorithm is required to strike a balance between response agility and video quality stability. Existing online algorithms select bitrates according to instantaneous throughput and buffer occupancy, achieving an agile reaction to changes but inducing video quality fluctuations due to the high dynamic of reference signals. In this paper, the idea of multi-step prediction is proposed to guide a better tradeoff, and the bitrate selection is formulated as a predictive control problem. With it, a generalized predictive control based approach is developed to calculate the optimal bitrate by minimizing the cost function over a moving look-ahead horizon. Finally, the proposed algorithm is implemented on a reference video player with performance evaluations conducted using realistic bandwidth traces. Experimental results show that the multi-step predictive control adaptation algorithm can achieve zero rebuffer event and 63.3% of reduction in bitrate switch.

Abstract:
Different from traditional long videos, micro-videos are much shorter and usually recorded at a specific place with mobile devices. To better understand the semantics of a micro-video and facilitate downstream applications, it is crucial to estimate the venue where the micro-video is recorded, for example, in a concert or on a beach. However, according to our statistics over two million micro-videos, only 1.22% of them were labeled with location information. For the remaining large number of micro-videos without location information, we have to rely on their content to estimate their venue categories. This is a highly challenging task, as micro-videos are naturally multi-modal (with textual, visual and, acoustic content), and more importantly, the quality of each modality varies greatly for different micro-videos.

Abstract:
This paper aims to generate a better representation of visual arts, which plays a key role in visual arts analysis works. Museums and galleries have a large number of artworks in the database, hiring art experts to do analysis works (e.g., classification, annotation) is time consuming and expensive and the analytic results are not stable because the results highly depend on the experiences of art experts. The problem of generating better representation of visual arts is of great interests to us because of its application potentials and interesting research challenges---both content information and each unique style information within one artwork should be summarized and learned when generating the representation. For example, by studying a vast number of artworks, art experts summary and enhance the knowledge of unique characteristics of each visual arts to do visual arts analytic works, it is non-trivial for computer. In this paper, we present a unified framework, called DeepArt, to learn joint representations that can simultaneously capture contents and style of visual arts. This framework learns unique characteristics of visual arts directly from a large-scale visual arts dataset, it is more flexible and accurate than traditional handcraft approaches. We also introduce Art500k, a large-scale visual arts dataset containing over 500,000 artworks, which are annotated with detailed labels of artist, art movement, genre, etc. Extensive empirical studies and evaluations are reported based on our framework and Art500k and all those reports demonstrate the superiority of our framework and usefulness of Art500k. A practical system for visual arts retrieval and annotation is implemented based on our framework and dataset. Code, data and system are publicly available at http://deepart.ece.ust.hk.

Abstract:
In this paper, we introduce NUBOMEDIA, an open source elastic cloud Platform as a Service (PaaS) specifically designed for real-time interactive multimedia and WebRTC services. NUBOMEDIA exposes its capabilities through simple Application Programming Interfaces (APIs), making possible to deploy and execute developers' applications. To that aim, NUBOMEDIA combines the simplicity and ease of development of API services with the flexibility of PaaS infrastructures. Once an application is implemented, developers just need to deploy it on top of NUBOMEDIA providing elasticity as a service and reliable communication.

Abstract:
In order to support users in the tagging process and recommendation, we had proposed two tag ranking algorithms, Document Frequency-Weights from regression and Folk Popularity Rank, which can extract tags greatly influencing popularity. We have developed a tag recommendation system using the algorithm we proposed. The recommended tags are not only for appropriate annotations but also for popularity boosting.

Abstract:
Simultaneous localization and mapping (SLAM) via a monocular camera is a key enabling technique for many augmented reality (AR) applications. In this work, we present a monocular SLAM system which can provide real-time dense mapping even for challenging poorly-textured regions based on the piecewise planarity approximation. Specifically, our system consists of three modules. First, a tracking module based on the direct method [3] continuously estimates camera poses with respect to the scene. Second, a semi-dense mapping module takes the estimated camera pose as input and calculates depths of highly-textured pixels based on pixel matching and triangulation. Third, dense mapping module approximates textureless regions identified by a homogeneous-color region detector using piecewise plane models. The 3D piecewise planes are reconstructed via the proposed multi-plane segmentation and multi-plane fusion algorithms. Live experiments in a real AR demo with a hand-held camera demonstrate the effectiveness and efficiency of our method in practical scenario.

Abstract:
This demo showcases a real-time visualisation displaying the level of engagement of a group of people attending a Jazz concert. Based on wearable sensor technology and machine learning principles, we present how this visualisation for enhancing events was developed following a user-centric approach. We describe the process of running an experiment using our custom physiological sensor platform, gathering requirements for the visualisation and finally implementing said visualisation. The end result being collaborative artwork to enhance people's immersion into cultural events.

Abstract:
Face aging, also known as age progression, is attracting more and more research interests. It has plenty of applications in various domains including cross-age face recognition, finding lost children, and entertainments. In recent years, face aging has witnessed various breakthroughs and a number of face aging models have been proposed. Face aging, however, is still a very challenging task in practice for various reasons. First, faces may have many different expressions and lighting conditions, which pose great challenges to modeling the aging patterns. Besides, the training data are usually very limited and the face images for the same person only cover a narrow range of ages.

Abstract:
We present a smart audio guide that adapts itself to the environment the user is navigating into. The system builds automatically a point of interest database exploiting Wikipedia and Google APIs as source. We rely on a computer vision system, to overcome the likely sensor limitations, and determine with high accuracy if the user is facing a certain landmark or if he is not facing any. Thanks to this the guide presents audio description at the most appropriate moment without any user intervention, using text-to-speech augmenting the experience.

Abstract:
We introduce a novel multi-modal system for auto-curating golf highlights that fuses information from players' reactions (celebration actions), spectators (crowd cheering), and commentator (tone of the voice and word analysis) to determine the most interesting moments of a game. The start of a highlight is determined with additional metadata (player's name and the hole number), allowing personalized content summarization and retrieval. Our system was demonstrated at Masters 2017, a major golf tournament, generating real-time highlights from four live video streams over four days.

Abstract:
The female facial image beautification usually requires professional editing softwares, which are relatively difficult for common users. In this demo, we introduce a practical system for automatic and personalized facial makeup recommendation and synthesis. First, a model describing the relations among facial features, facial attributes and makeup attributes is learned as the makeup recommendation model for suggesting the most suitable makeup attributes. Then the recommended makeup attributes are seamlessly synthesized onto the input facial image.

Abstract:
For sketch-based image retrieval (SBIR), we propose a generative adversarial network trained on a large number of sketches and their corresponding real images. To imitate human search process, we attempt to match candidate images with theimaginary image in user single s mind instead of the sketch query, i.e., not only the shape information of sketches but their possible content information are considered in SBIR. Specifically, a conditional generative adversarial network (cGAN) is employed to enrich the content information of sketches and recover the imaginary images, and two VGG-based encoders, which work on real and imaginary images respectively, are used to constrain their perceptual consistency from the view of feature representations. During SBIR, we first generate an imaginary image from a given sketch via cGAN, and then take the output of the learned encoder for imaginary images as the feature of the query sketch. Finally, we build an interactive SBIR system that shows encouraging performance.

Abstract:
Outlier detection is a crucial part of robust evaluation for crowdsourceable assessment of Quality of Experience (QoE) and has attracted much attention in recent years. In this paper, we propose some simple and fast algorithms for outlier detection and robust QoE evaluation based on the nonconvex optimization principle. Several iterative procedures are designed with or without knowing the number of outliers in samples. Theoretical analysis is given to show that such procedures can reach statistically good estimates under mild conditions. Finally, experimental results with simulated and real-world crowdsourcing datasets show that the proposed algorithms could produce similar performance to Huber-LASSO approach in robust ranking, yet with nearly 8 or 90 times speed-up, without or with a prior knowledge on the sparsity size of outliers, respectively. Therefore the proposed methodology provides us a set of helpful tools for robust QoE evaluation with crowdsourcing data.

Abstract:
Phantom Limb Pain or simply, Phantom Pain is a severe chronic pain that is experienced as a vivid sensation of the pain in missing body part. Epidemiological studies obtained from a large samples indicate that the short-term incidence rate of the phantom limb pain is 72% [13], while long-term incidence rate (6 months after amputation) is 67%, [5, 13]. A wide spectrum of treatments developed for alleviating phantom limb pain includes the traditional mirror box therapy as well as recently developed virtual reality-based methods. Most of the virtual reality-based methods rely on 3D CAD models of the virtual limb, animating them using the motion data acquired either from patient's existing anatomical limb or myoelectric activity at patient's stump (of the amputated limb). Since motion activity is typically captured using body sensors (Electromyography, EMG, or inertial sensors), these methods are considered as invasive approaches. Further, in the case of virtual reality-based methods, the dependency on the pre-built 3D models degrades the immersive experience due to a mismatch in the skin color, clothes, artificial and rigid look and misalignment of the phantom limb.

Abstract:
Convolutional Neural Network (CNN) is a very powerful approach to extract discriminative local descriptors for effective image search. Recent work adopts fine-tuned strategies to further improve the discriminative power of the descriptors. Taking a different approach, in this paper, we propose a novel framework to achieve competitive retrieval performance. Firstly, we propose various masking schemes, namely SIFT-mask, SUM-mask, and MAX-mask, to select a representative subset of local convolutional features and remove a large number of redundant features. We demonstrate that this can effectively address the burstiness issue and improve retrieval accuracy. Secondly, we propose to employ recent embedding and aggregating methods to further enhance feature discriminability. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art retrieval accuracy.

Abstract:
Food is rich of visible (e.g., colour, shape) and procedural (e.g., cutting, cooking) attributes. Proper leveraging of these attributes, particularly the interplay among ingredients, cutting and cooking methods, for health-related applications has not been previously explored. This paper investigates cross-modal retrieval of recipes, specifically to retrieve a text-based recipe given a food picture as query. As similar ingredient composition can end up with wildly different dishes depending on the cooking and cutting procedures, the difficulty of retrieval originates from fine-grained recognition of rich attributes from pictures. With a multi-task deep learning model, this paper provides insights on the feasibility of predicting ingredient, cutting and cooking attributes for food recognition and recipe retrieval. In addition, localization of ingredient regions is also possible even when region-level training examples are not provided. Experiment results validate the merit of rich attributes when comparing to the recently proposed ingredient-only retrieval techniques.

Abstract:
Image diffusion plays a fundamental role for the task of image denoising. The recently proposed trainable nonlinear reaction diffusion (TNRD) model defines a simple but very effective framework for image denoising. However, as the TNRD model is a local model, whose diffusion behavior is purely controlled by information of local patches, it is prone to create artifacts in the homogenous regions and over-smooth highly textured regions, especially in the case of strong noise levels. Meanwhile, it is widely known that the non-local self-similarity (NSS) prior stands as an effective image prior for image denoising, which has been widely exploited in many non-local methods. In this work, we are highly motivated to embed the NSS prior into the TNRD model to tackle its weaknesses. In order to preserve the expected property that end-to-end training remains available, we exploit the NSS prior by defining a set of non-local filters, and derive our proposed trainable non-local reaction diffusion (TNLRD) model for image denoising. Together with the local filters and influence functions, the non-local filters are learned by employing loss-specific training. The experimental results show that the trained TNLRD model produces visually plausible recovered images with more textures and less artifacts, compared to its local versions. Moreover, the trained TNLRD model can achieve strongly competitive performance to recent state-of-the-art image denoising methods in terms of peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM).

Abstract:
This paper presents the method that underlies our submission to the popularity prediction task of Social Media Prediction Challenge 2017. The task is designed to predict the impact of sharing different posts for a publisher on social media. There are many factors that influence image popularity; these include not only the visual features of the image, but also the social features, such as user characteristics of its poster and even the upload time. In this project, we propose a fast and effective framework for popularity prediction. First, we investigate and extract visual and social features of images. For the visual feature, we introduce 1) global feature descriptors, such as Local Binary Pattern and Color Names, 2) local feature descriptors, such as Local Maximal Occurrence, and 3) deep features. For the social feature, we adopt users features (average views, group count, and member count), post features (title length, description length, and tag count), and time features (month, weekday, day, and hour). Furthermore, we fed a fusion of multi-feature to Linear Regression, Matrix Factorization based on Time and feature Cluster, and Support Vector Regression models respectively, and present comparative analysis of the prediction results. Finally, we choose the best model to predict the popularity scores of the test images. Experimental results demonstrate that our method can achieve 0.8581, 1.4062 and 0.8625 in terms of Spearman Ranking Correlation, Mean Absolute Error, and Mean Squared Error, respectively.

Abstract:
This paper gives an overview of the First International Workshop on Multimedia Verification, organized as part of the 2017 ACM Multimedia Conference. The paper outlines the current verification scene and needs, discusses the goals of the workshop, and presents the workshop's program, consisting of two invited keynote talks and three presentations of full papers that have been accepted at the workshop.

Abstract:
Training deep learning based video classifiers for action recognition requires a large amount of labeled videos. The labeling process is labor-intensive and time-consuming. On the other hand, large amount of weakly-labeled images are uploaded to the Internet by users everyday. To harness the rich and highly diverse set of Web images, a scalable approach is to crawl these images to train deep learning based classifier, such as Convolutional Neural Networks (CNN). However, due to the domain shift problem, the performance of Web images trained deep classifiers tend to degrade when directly deployed to videos. One way to address this problem is to fine-tune the trained models on videos, but sufficient amount of annotated videos are still required. In this work, we propose a novel approach to transfer knowledge from image domain to video domain. The proposed method can adapt to the target domain (i.e. video data) with limited amount of training data. Our method maps the video frames into a low-dimensional feature space using the class-discriminative spatial attention map for CNNs. We design a novel Siamese EnergyNet structure to learn energy functions on the attention maps by jointly optimizing two loss functions, such that the attention map corresponding to a ground truth concept would have higher energy. We conduct extensive experiments on two challenging video recognition datasets (i.e. TVHI and UCF101), and demonstrate the efficacy of our proposed method.

Abstract:
This paper presents a unified framework to learn to quantify perceptual attributes (e.g., safety, attractiveness) of physical urban environments using crowd-sourced street-view photos without human annotations. The efforts of this work include two folds. First, we collect a large-scale urban image dataset in multiple major cities in U.S.A., which consists of multiple street-view photos for every place. Instead of using subjective annotations as in previous works, which are neither accurate nor consistent, we collect for every place the safety score from government's crime event records as objective safety indicators. Second, we observe that the place-centric perception task is by nature a multi-instance regression problem since the labels are only available for places (bags), rather than images or image regions (instances). We thus introduce a deep convolutional neural network (CNN) to parameterize the instance-level scoring function, and develop an EM algorithm to alternatively estimate the primary instances (images or image regions) which affect the safety scores and train the proposed network. Our method is capable of localizing interesting images and image regions for each place. We evaluate the proposed method on a newly created dataset and a public dataset. Results with comparisons showed that our method can clearly outperform the alternative perception methods and more importantly, is capable of generating region-level safety scores to facilitate interpretations of the perception process.

Abstract:
Live streaming video presents new challenges for retrieval and content understanding. Its live nature means that video representations should be relevant to current content, and not necessarily to past content. We investigate retrieval of previously unseen queries for live video content. Drawing from existing whole-video techniques, we focus on adapting image-trained semantic models to the video domain. We introduce the use of future frame representations as a supervision signal for learning temporally aware semantic representations on unlabeled video data. Additionally, we introduce an approach for broadening a query's representation within a pre-constructed semantic space, with the aim of increasing overlap between embedded visual semantics and the query semantics. We demonstrate the efficacy of these contributions for unseen query retrieval on live videos. We further explore their applicability to tasks such as no example, whole-video action classification and no-example live video action prediction, and demonstrate state of the art results.

Abstract:
In cross-domain recommendation, data sparsity becomes more and more serious when the ratings are expressed numerically, e.g., 5-star grades. In this work, we focus on borrowing the knowledge from other domains in the form of binary ratings, such as likes and dislikes for certain items. Most existing works conventionally assume that multiple domains share some common latent information across users and items. In practice, however, the related domains not only share the common latent feature of users and items, but also share some knowledge of rating patterns. Furthermore, conventional methods did not consider the hierarchical structures (i.e., genre, sub genre, detailed-category) in real-world recommendation system. In this paper, we propose a novel Deep Low-rank Sparse Collective Factorization (DLSCF) to facilitate the cross-domain recommendation. Specifically, the low-rank sparse decomposition is adopted to capture the most shared rating patterns with low-rank constraint while integrating the domain-specific patterns with group-sparse scheme. Furthermore, we factorize the rating pattern matrix in multiple layers to obtain the user/item latent category affiliation matrices, which could indicate the affiliation relation between latent categories and latent sub-categories. Experimental results on MoviePilot and Netfilx datasets demonstrate the effectiveness of our proposed algorithm at various sparsity levels, by comparing it with several state-of-the-art approaches.

Abstract:
Automated assessment of visual sentiment has many applications, such as monitoring social media and facilitating online advertising. In current research on automated visual sentiment assessment, images are mainly input and processed as a whole. However, human attention is biased, and a focal region with high acuity can disproportionately influence visual sentiment. To investigate how attention influences visual sentiment, we conducted experiments that reveal critical insights into human perception. We discover that negative sentiments are elicited by the focal region without a notable influence of contextual information, whereas positive sentiments are influenced by both focal and contextual information. Building on these insights, we create new deep convolutional neural networks for sentiment prediction that have additional channels devoted to encoding focal information. On two benchmark datasets, the proposed models demonstrate superior performance compared with the state-of-the-art methods. Extensive visualizations and statistical analyses indicate that the focal channels are more effective on images with focal objects, especially for images that also elicit negative sentiments.

Abstract:
Weighted patch representation of the target object has been proven to be effective for suppressing the background effects in visual tracking. In this paper, we propose a novel approach, called spatially Regularized Graph Learning (ReGLe), to automatically explore the intrinsic relationship among patches both with global and local cues for robust object representation. In particular, the target object bounding box is partitioned into a set of non-overlapping image patches, which are taken as graph nodes, and each of them is associated with a weight to represent how likely it belongs to the target object. To improve the accuracy of node weight computation, we dynamically learn the edge weights (i.e., the appearance compatibility of two nodes) according to both global and local relationship among patches. First, we pursue the low-rank representation for capturing the global low-dimensional subspace structure of patches. Second, we encode the local information into the low-rank representation by exploiting the fact that neighboring nodes usually have similar appearance. Finally, we utilize the representations to learn their affinities (i.e., graph edge weights). The node and edge weights are jointly optimized by a designed ADMM (Alternating Direction Method of Multipliers) algorithm, the object feature representation is updated by imposing the weights of patches on the extracted image features. The object location is finally predicted by maximizing the classification score in the structured SVM. Extensive experiments demonstrate the effectiveness of the proposed approach on the tracking benchmark datasets: OTB100 and Temple-Color.

Abstract:
Current image emotion recognition works mainly classified the images into one dominant emotion category, or regressed the images with average dimension values by assuming that the emotions perceived among different viewers highly accord with each other. However, due to the influence of various personal and situational factors, such as culture background and social interactions, different viewers may react totally different from the emotional perspective to the same image. In this paper, we propose to formulate the image emotion recognition task as a probability distribution learning problem. Motivated by the fact that image emotions can be conveyed through different visual features, such as aesthetics and semantics, we present a novel framework by fusing multi-modal features to tackle this problem. In detail, weighted multi-modal conditional probability neural network (WMMCPNN) is designed as the learning model to associate the visual features with emotion probabilities. By jointly exploring the complementarity and learning the optimal combination coefficients of different modality features, WMMCPNN could effectively utilize the representation ability of each uni-modal feature. We conduct extensive experiments on three publicly available benchmarks and the results demonstrate that the proposed method significantly outperforms the state-of-the-art approaches for emotion distribution prediction.

Abstract:
Knowledge representation learning (KRL) encodes enormous structured information with entities and relations into a continuous low-dimensional semantic space. Most conventional methods solely focus on learning knowledge representation from single modality, yet neglect the complementary information from others. The more and more rich available multi-modal data on Internet also drive us to explore a novel approach for KRL in multi-modal way, and overcome the limitations of previous single-modal based methods. This paper proposes a novel multi-modal knowledge representation learning (MM-KRL) framework which attempts to handle knowledge from both textual and visual modal web data. It consists of two stages, i.e., webly-supervised multi-modal relationship mining, and bi-enhanced cross-modal knowledge representation learning. Compared with existing knowledge representation methods, our framework has several advantages: (1) It can effectively mine multi-modal knowledge with structured textual and visual relationships from web automatically. (2) It is able to learn a common knowledge space which is independent to both task and modality by the proposed Bi-enhanced Cross-modal Deep Neural Network (BC-DNN). (3) It has the ability to represent unseen multi-modal relationships by transferring the learned knowledge with isolated seen entities and relations into unseen relationships. We build a large-scale multi-modal relationship dataset (MMR-D) and the experimental results show that our framework achieves excellent performance in zero-shot multi-modal retrieval and visual relationship recognition.

Abstract:
Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence while ignoring intrinsic multimodality nature. Observing that different modalities (e.g., frame, motion, and audio streams), as well as the elements within each modality, contribute differently to the sentence generation, we present a novel deep framework to boost video captioning by learning Multimodal Attention Long-Short Term Memory networks (MA-LSTM). Our proposed MA-LSTM fully exploits both multimodal streams and temporal attention to selectively focus on specific elements during the sentence generation. Moreover, we design a novel child-sum fusion unit in the MA-LSTM to effectively combine different encoded modalities to the initial decoding states. Different from existing approaches that employ the same LSTM structure for different modalities, we train modality-specific LSTM to capture the intrinsic representations of individual modalities. The experiments on two benchmark datasets (MSVD and MSR-VTT) show that our MA-LSTM significantly outperforms the state-of-the-art methods with 52.3 BLEU@4 and 70.4 CIDER-D metrics on MSVD dataset, respectively.

Abstract:
Profilio Company is a startup in its early business development stage that has developed a profiling solution for the field of paid social media advertising. In particular, the solution is designed for the enrichment of Customer Relationship Managements data and for segmentation of customer audiences. Three different Proof of Concepts with different clients have showed that the solution reduces the costs of paid social media advertising in different settings and with different advertising targets, especially starting from large audiences. In this paper we report the details about Profilio's business idea, the development of Profilio's technologies and the results of the Proof of Concepts.

Abstract:
The study of virality and information diffusion is a topic gaining traction rapidly in the computational social sciences. Computer vision and social network analysis research have also focused on understanding the impact of content and information diffusion in making content viral, with prior approaches not performing significantly well as other traditional classification tasks. In this paper, we present a novel pairwise reformulation of the virality prediction problem as an attribute prediction task and develop a novel algorithm to model image virality on online media using a pairwise neural network. Our model provides significant insights into the features that are responsible for promoting virality and surpasses the existing state-of-the-art by a 12% average improvement in prediction. We also investigate the effect of external category supervision on relative attribute prediction and observe an increase in prediction accuracy for the same across several attribute learning datasets.

Abstract:
With the recent availability of commodity Virtual Reality (VR) products, immersive video content is receiving a significant interest. However, producing high-quality VR content often requires upgrading the entire production pipeline, which is costly and time-consuming. In this work, we propose using video feeds from regular broadcasting cameras to generate immersive content. We utilize the motion of the main camera to generate a wide-angle panorama. Using various techniques, we remove the parallax and align all video feeds. We then overlay parts from each video feed on the main panorama using Poisson blending. We examined our technique on various sports including basketball, ice hockey and volleyball. Subjective studies show that most participants rated their immersive experience when viewing our generated content between Good to Excellent. In addition, most participants rated their sense of presence to be similar to ground-truth content captured using a GoPro Omni 360 camera rig.

Abstract:
360-degree videos are encoded for adaptive streaming by first projecting the spherical surface onto two-dimensional frames, then encoding these as standard video segments. During playback of these 360-degree videos, the video player renders the portion of the spherical surface in the direction of the user's view. These user viewports typically cover only a small portion of the 360 degree surface, causing much of the downloaded bandwidth to be wasted. Tile-based approaches can reduce the wasted bandwidth by cutting video spatially into motion-constrained rectangles. Streaming logic then only needs to download the tiles necessary to render the viewport seen by the user. Existing tile-based approaches cut 360-degree videos into tiles of fixed sizes. These fixed-size tiling approaches, however, suffer from reduced encoding efficiency. Tiling cuts away portions of the video that can be copied by the encoder from adjacent frames or within the current frame that are needed for effective video compression.

Abstract:
The well-established film grammar is often used to change visual and audio elements of videos to invoke audiences' emotional experience. Such film grammar, referred to as domain knowledge, is crucial for affective video content analyses, but has not been thoroughly explored yet. In this paper, we propose a novel method to analyze video affective content through exploring domain knowledge. Specifically, take visual elements as an example, we first infer probabilistic dependencies between visual elements and emotions from the summarized film grammar. Then, we transfer the domain knowledge as constraints, and formulate affective video content analyses as a constrained optimization problem. Experiments on the LIRIS-ACCEDE database and the DEAP database demonstrate that the proposed affective content analyses method can successfully leverage well-established film grammar for better emotion classification from video content.

Abstract:
With the decreasing price of Head-Mounted Displays (HMDs), 360-degree videos are becoming popular. The streaming of such videos through the Internet with state of the art streaming architectures requires, to provide high immersion feeling, much more bandwidth than the median user's access bandwidth. To decrease the need for bandwidth consumption while providing high immersion to users, scientists and specialists proposed to prepare and encode 360-degree videos into quality-variable video versions and to implement viewport-adaptive streaming. Quality-variable versions are different versions of the same video with non-uniformly spread quality: there exists some so-called Quality Emphasized Regions (QERs). With viewport-adaptive streaming the client, based on head movement prediction, downloads the video version with the high quality region closer to where the user will watch. In this paper we propose a generic theoretical model to find out the optimal set of quality-variable video versions based on traces of head positions of users watching a 360-degree video. We propose extensions to adapt the model to popular quality-variable version implementations such as tiling and offset projection. We then solve a simplified version of the model with two quality levels and restricted shapes for the QER. With this simplified model, we show that an optimal set of four quality-variable video versions prepared by a streaming server, together with a perfect head movement prediction, allow for 45% bandwidth savings to display video with the same average quality as state of the art solutions or allows an increase of 102% of the displayed quality for the same bandwidth budget.

Abstract:
Video consumption is being shifted from sit-and-watch to selective skimming. Existing video player interfaces, however, only provide indirect manipulation to support this emerging behavior. Video summarization alleviates this issue to some extent, shortening a video based on the desired length of a summary as an input variable. But an optimal length of a summarized video is often not available in advance. Moreover, the user cannot edit the summary once it is produced, limiting its practical applications. We argue that video summarization should be an interactive, mixed-initiative process in which users have control over the summarization procedure while algorithms help users achieve their goal via video understanding. In this paper, we introduce ElasticPlay, a mixed-initiative approach that combines an advanced video summarization technique with direct interface manipulation to help users control the video summarization process. Users can specify a time budget for the remaining content while watching a video; our system then immediately updates the playback plan using our proposed cut-and-forward algorithm, determining which parts to skip or to fast-forward. This interactive process allows users to fine-tune the summarization result with immediate feedback. We show that our system outperforms existing video summarization techniques on the TVSum50 dataset. We also report two lab studies (22 participants) and a Mechanical Turk deployment study (60 participants), and show that the participants responded favorably to ElasticPlay.

Abstract:
The spatial relationship among objects provide rich clues to object contexts for visual recognition. In this paper, we propose to learn Semantic Feature Map (SFM) by deep neural networks to model the spatial object contexts for better understanding of image and video contents. Specifically, we first extract high-level semantic object features on input image with convolutional neural networks for every object proposals, and organize them to the designed SFM so that spatial information among objects are preserved. To fully exploit the spatial relationship among objects, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) on top of SFM for final recognition. For better training, we also introduce a multi-task learning framework to train the model in an end-to-end manner. It is composed of an overall image classification loss as well as a grid labeling loss, which predicts the objects label at each SFM grid. Extensive experiments are conducted to verify the effectiveness of the proposed approach. For image classification, very promising results are obtained on Pascal VOC 2007/2012 and MS-COCO benchmarks. We also directly transfer the SFM learned on image domain to the video classification task. The results on CCV benchmark demonstrate the robustness and generalization capability of the proposed approach.

Abstract:
Multi-label active learning for image classification has attracted great attention over recent years and a lot of relevant works are published continuously. However, there still remain some problems that need to be solved, such as existing multi-label active learning algorithms do not reflect on the cleanness of sample data and their ways on label correlation mining are defective. For one thing, sample data is usually contaminated in reality, which disturbs the estimation of data distribution and further hinders the model training. For another, previous approaches for label relationship exploration are purely based on the observed label distribution of an incomplete training set, which cannot provide sufficiently efficient information. To address these issues, we propose a novel adaptive low-rank multi-label active learning algorithm, called LRMAL. Specifically, we first use low-rank matrix recovery to learn an effective low-rank feature representation from the noisy data. In a subsequent sampling phase, we make use of its superiorities to evaluate the general informativeness of each unlabeled example-label pair. Based on an intrinsic mapping relation between the example space and the label space of a certain multi-label dataset, we recover the incomplete labels of a training set for a more comprehensive label correlation mining. Furthermore, to reduce the redundancy among the selected example-label pairs, we use a diversity measurement to diversify the sampled data. Finally, an effective sampling strategy is developed by integrating these two aspects of potential information with uncertainty based on an adaptive integration scheme. Experimental results demonstrate the effectiveness of our approach.

Abstract:
Subspace representations have been widely applied for videos in many tasks. In particular, the subspace-based query-by-image video retrieval (QBIVR), facing high challenges on similarity-preserving measurements and efficient retrieval schemes, urgently needs considerable research attention. In this paper, we propose a novel subspace-based QBIVR framework to enable efficient video search. We first define a new geometry-preserving distance metric to measure the image-to-video distance, which transforms the QBIVR task to be the Maximum Inner Product Search (MIPS) problem. The merit of this distance metric lies in that it helps to preserve the genuine geometric relationship between query images and database videos to the greatest extent. To boost the efficiency of solving the MIPS problem, we introduce two asymmetric hashing schemes which can bridge the domain gap of images and videos properly. The first approach, termed Inner-product Binary Coding (IBC), achieves high-quality binary codes by learning the binary codes and coding functions simultaneously without continuous relaxations. The other one, Bilinear Binary Coding (BBC) approach, employs compact bilinear projections instead of a single large projection matrix to further improve the retrieval efficiency. Extensive experiments on four real-world video datasets verify the effectiveness of our proposed approaches, as compared to the state-of-the-art methods.

Abstract:
In this paper we present a cyber-physiotherapy system (CyPhy) that brings daily rehabilitation to patient's home with supervision from trained therapist. CyPhy is able to capture and record RGB-D, skeleton, and physiotherapy-related medical sensing data streams from patient's exercises using multiple cameras and body sensors. With hours of exercises from every patient, that are captured every day from multiple cameras, therapists spend huge amount of their time watching videos to monitor the correctness of patients' moves. This becomes even more challenging in the presence of multiple cameras where the therapist might not know which camera stream shows the incorrect motion. In this paper, we explore the multicamera summarization problem from various aspects: (1) We first explore the types of exercises that benefit the most from using multiple cameras; (2) We propose a method to detect incorrect motion from multiple cameras in rehabilitation exercises; (3) We show how the analysis of incorrect motion is used to summarize the video and recommend the camera view that best visualizes the mistake. Our method for detecting incorrect motion achieves more than 92% accuracy at wide range of thresholds with significant improvement of 20% over single camera and 10% over the closest approach that uses multiple cameras.

Abstract:
With advances of recent technologies, augmented reality systems and autonomous vehicles gained a lot of interest from academics and industry. Both these areas rely on scene geometry understanding, which usually requires depth map estimation. However, in case of systems with limited computational resources, such as smartphones or autonomous robots, high resolution dense depth map estimation may be challenging. In this paper, we study the problem of semi-dense depth map interpolation along with low resolution depth map upsampling. We present an end-to-end learnable residual convolutional neural network architecture that achieves fast interpolation of semi-dense depth maps with different sparse depth distributions: uniform, sparse grid and along intensity image gradient. We also propose a loss function combining classical mean squared error with perceptual loss widely used in intensity image super-resolution and style transfer tasks. We show that with some modifications, this architecture can be used for depth map super-resolution. Finally, we evaluate our results on both synthetic and real data, and consider applications for autonomous vehicles and creating AR/MR video games.

Abstract:
Face analytics benefits many multimedia applications. It consists of a number of tasks, such as facial emotion recognition and face parsing, and most existing approaches generally treat these tasks independently, which limits their deployment in real scenarios. In this paper we propose an integrated Face Analytics Network (iFAN), which is able to perform multiple tasks jointly for face analytics with a novel carefully designed network architecture to fully facilitate the informative interaction among different tasks. The proposed integrated network explicitly models the interactions between tasks so that the correlations between tasks can be fully exploited for performance boost. In addition, to solve the bottleneck of the absence of datasets with comprehensive training data for various tasks, we propose a novel cross-dataset hybrid training strategy. It allows "plug-in and play'' of multiple datasets annotated for different tasks without the requirement of a fully labeled common dataset for all the tasks. We experimentally show that the proposed iFAN achieves state-of-the-art performance on multiple face analytics tasks using a single integrated model. Specifically, iFAN achieves an overall F-score of 91.15% on the Helen dataset for face parsing, a normalized mean error of 5.81% on the MTFL dataset for facial landmark localization and an accuracy of 45.73% on the BNU dataset for emotion recognition with a single model.

Abstract:
Metric learning is an important issue in the person verification problem, which is to identify whether a pair of face or human body images is about the same person. Due to low running cost, the non-iterative statistical inference methods for metric learning show their efficiency and effectiveness to large scale datasets and on-line updating person verification applications. The KISSME method is a typical one that constructs the metric based on two assumptions that both of the discrepancy spaces of negative pairs and positive pairs should be Gaussian structures. However, we find that, in fact, the distribution of discrepancies of positive pairs might tend to the Laplace distribution rather than the Gaussian distribution. Based on this finding, we propose a metric learning method by exploiting Gaussian-Laplace distribution statistical inference, where the Gaussian distribution of negative discrepancies and the Laplace distribution of positive discrepancies are considered together. Experiments conducted on two human body datasets (VIPeR and Market-1501) and one face dataset (LFW) show its superiority in terms of effectiveness and efficiency as compared with the state-of-the-art approaches, no matter the appearance description is handcrafted or deep learned.

Abstract:
Virtual reality and 360-degree video streaming are growing rapidly, yet, streaming high-quality 360-degree video is still challenging due to high bandwidth requirements. Existing solutions reduce bandwidth consumption by streaming high-quality video only for the user's viewport. However, adding the spatial domain (viewport) to the video adaptation space prevents the existing solutions from buffering future video chunks for a duration longer than the interval that user's viewport is predictable. This makes playback more prone to video freezes due to rebuffering, which severely degrades the user's Quality of Experience especially under challenging network conditions. We propose a new method that alleviates the restrictions on buffer duration by utilizing scalable video coding. Our method significantly reduces the occurrence of rebuffering on links with varying bandwidth without compromising playback quality or bandwidth efficiency compared to the existing solutions. We demonstrate the efficiency of our proposed method using experimental results with real world cellular network bandwidth traces.

Abstract:
Interactive segmentation consists in building a pixel-wise partition of an image, into foreground and background regions, with the help of user inputs. Most state-of-the-art algorithms use scribble-based interactions to build foreground and background models, and very few of these work focus on the usability of the scribbling interaction. In this paper we study the outlining interaction, which is very intuitive to non-expert users on touch devices. We present an algorithm, built upon the existing GrabCut algorithm, which infers both foreground and background models out of a single outline. We conducted a user study on 20 participants to demonstrate the usability of this interaction, and its performance for the task of interactive segmentation.

Abstract:
The topic diversity of open-domain videos leads to various vocabularies and linguistic expressions in describing video contents, and therefore, makes the video captioning task even more challenging. In this paper, we propose an unified caption framework, M&M TGM, which mines multimodal topics in unsupervised fashion from data and guides the caption decoder with these topics. Compared to pre-defined topics, the mined multimodal topics are more semantically and visually coherent and can reflect the topic distribution of videos better. We formulate the topic-aware caption generation as a multi-task learning problem, in which we add a parallel task, topic prediction, in addition to the caption task. For the topic prediction task, we use the mined topics as the teacher to train a student topic prediction model, which learns to predict the latent topics from multimodal contents of videos. The topic prediction provides intermediate supervision to the learning process. As for the caption task, we propose a novel topic-aware decoder to generate more accurate and detailed video descriptions with the guidance from latent topics. The entire learning procedure is end-to-end and it optimizes both tasks simultaneously. The results from extensive experiments conducted on the MSR-VTT and Youtube2Text datasets demonstrate the effectiveness of our proposed model. M&M TGM not only outperforms prior state-of-the-art methods on multiple evaluation metrics and on both benchmark datasets, but also achieves better generalization ability.

Abstract:
Exploiting multimodal features has become a standard approach towards many video applications, including the video captioning task. One problem with the existing work is that it models the relevance of each type of features evenly, which neutralizes the impact of each individual modality to the word to be generated. In this paper, we propose a novel Modal Attention Network (MANet) to account for this issue. Our MANet extends the standard encoder-decoder network by adapting the attention mechanism to video modalities. As a result, MANet emphasizes the impact of each modality with respect to the word to be generated. Experimental results show that our MANet effectively utilizes multimodal features to generate better video descriptions. Especially, our MANet system was ranked among the top three systems at the 2nd Video to Language Challenge in both automatic metrics and human evaluations.

Abstract:
Generating natural language descriptions for videos (a.k.a video captioning) has attracted much research attention in recent years, and a lot of models have been proposed to improve the caption performance. However, due to the rapid progress in dataset expansion and feature representation, newly proposed caption models have been evaluated on different settings, which makes it unclear about the contributions from either features or models. Therefore, in this work we aim to gain a deep understanding about "where are we" for the current development of video captioning. First, we carry out extensive experiments to identify the contribution from different components in video captioning task and make fair comparison among several state-of-the-art video caption models. Second, we discover that these state-of-the-art models are complementary so that we could benefit from "wisdom of the crowd" through ensembling and reranking. Finally, we give a preliminary answer to the question "how far are we from the human-level performance in general'' via a series of carefully designed experiments. In summary, our caption models achieve the state-of-the-art performance on the MSR-VTT 2017 challenge, and it is comparable with the average human-level performance on current caption metrics. However, our analysis also shows that we still have a long way to go, such as further improving the generalization ability of current caption models.

Abstract:
This paper presents a brief summary of the first workshop on Visual Analysis for Smart and Connected Communities (VSCC'2017), which is held in conjunction with the ACM Conference on Multimedia, 2017. VSCC'2017 is the first workshop on visual analysis for smart and connected communities, and is aimed at the creation of a multi-discipline community with the concentration on visual data analysis. The topics addressed in VSCC'2017 cover all aspects of smart communities, including safety, security, retrieval, transportation, information technologies, Internet of Things, etc. The focus is on the development of fundamental theories, algorithms, models towards the parsing of visual data generated by various digital devices (e.g., smart phones, vehicles) in smart communities.

Abstract:
Photo composition is an important factor affecting the aesthetics in photography. However, it is a highly challenging task to model the aesthetic properties of good compositions due to the lack of globally applicable rules to the wide variety of photographic styles. Inspired by the thinking process of photo taking, we formulate the photo composition problem as a view finding process which successively examines pairs of views and determines their aesthetic preferences. We further exploit the rich professional photographs on the web to mine unlimited high-quality ranking samples and demonstrate that an aesthetics-aware deep ranking network can be trained without explicitly modeling any photographic rules. The resulting model is simple and effective in terms of its architectural design and data sampling method. It is also generic since it naturally learns any photographic rules implicitly encoded in professional photographs. The experiments show that the proposed view finding network achieves state-of-the-art performance with sliding window search strategy on two image cropping datasets.

Abstract:
In the recent years, there has been an astounding pace of advances in the areas of embedded sensors and artificial intelligence technologies. These breakthrough developments are increasingly enabling us to create devices and systems that can sense and understand the world around them. In this keynote, we will highlight some of the current state-of-the-art results in the field of enhancing and augmenting the human sensation and perceptual processes with applications based on novel transduction devices and artificial intelligence technologies. The presentation will also highlight the emerging trends in these technologies as well as the associated business impact and opportunities.

Abstract:
Fashion landmarks are functional key points defined on clothes, such as corners of neckline, hemline, and cuff. They have been recently introduced [18]as an effective visual representation for fashion image understanding. However, detecting fashion landmarks are challenging due to background clutters, human poses, and scales. To remove the above variations, previous works usually assumed bounding boxes of clothes are provided in training and test as additional annotations, which are expensive to obtain and inapplicable in practice. This work addresses unconstrained fashion landmark detection, where clothing bounding boxes are not provided in both training and test. To this end, we present a novel Deep LAndmark Network (DLAN), where bounding boxes and landmarks are jointly estimated and trained iteratively in an end-to-end manner. DLAN contains two dedicated modules, including a Selective Dilated Convolution for handling scale discrepancies, and a Hierarchical Recurrent Spatial Transformer for handling background clutters. To evaluate DLAN, we present a large-scale fashion landmark dataset, namely Unconstrained Landmark Database (ULD), consisting of 30K images. Statistics show that ULD is more challenging than existing datasets in terms of image scales, background clutters, and human poses. Extensive experiments demonstrate the effectiveness of DLAN over the state-of-the-art methods. DLAN also exhibits excellent generalization across different clothing categories and modalities, making it extremely suitable for real-world fashion analysis.

Abstract:
Retrieving image content with a natural language expression is an emerging interdisciplinary problem at the intersection of multimedia, natural language processing and artificial intelligence. Existing methods tackle this challenging problem by learning features from the visual and linguistic domains independently while the critical semantic correlations bridging two domains have been under-explored in the feature learning process. In this paper, we propose to exploit sharable semantic attributes as "anchors" to ensure the learned features are well aligned across domains for better object retrieval. We define "attributes" as the common concepts that are informative for object retrieval and can be easily learned from both visual content and language expression. In particular, diverse and complex attributes (e.g., location, color, category, interaction between object and context) are modeled and incorporated to promote cross-domain alignment for feature learning from multiple perspectives. Based on the sharable attributes, we propose a deep Attribute-Preserving Metric learning (AP-Metric) framework that jointly generates unique query-sensitive region proposals and conducts novel cross-modal feature learning that explicitly pursues consistency over semantic attribute abstraction within both domains for deep metric learning. Benefiting from the cross-modal semantic correlations, our proposed framework can localize challenging visual objects to match complex query expressions within cluttered background accurately. The overall framework is end-to-end trainable. Extensive evaluations on popular datasets including ReferItGame, RefCOCO, and RefCOCO+ well demonstrate its superiority. Notably, it achieves state-of-the-art performance on the challenging ReferItGame dataset.

Abstract:
Predicting the walking path of a pedestrian in crowds is a pivotal step towards understanding his/her behavior. This is one of the recently emerging tasks in computer vision scarcely addressed to date. In this paper, we put forth a deep spatio-temporal learning-forecasting approach, which is composed of two modules. First, displacement information from pedestrians' walking history is extracted and fed into a convolutional layer in order to learn the undergoing motion patterns and produce high-level representations. Second, unlike the mainstream literature which learns the temporal or the spatial dynamics among the pedestrians separately, we propose to embed both components into a single framework via a Long-Short Term Memory based architecture that takes as input the previously extracted high-level motion cues and outputs the potential future walking routes of all pedestrians in one shot. We evaluate our approach on three large benchmark datasets, and show that it introduces large margin improvements with respect to recent works in the literature, both in short and long-term forecasting scenarios.

Abstract:
A major challenge that arises in Weakly Supervised Object Detection (WSOD) is that only image-level labels are available, whereas WSOD trains instance-level object detectors. A typical approach to WSOD is to 1) generate a series of region proposals for each image and assign the image-level label to all the proposals in that image; 2) train a classifier using all the proposals; and 3) use the classifier to select proposals with high confidence scores as the positive instances for another round of training. In this way, the image-level labels are iteratively transferred to instance-level labels.

Abstract:
Image matting plays an important role in image and video editing. However, the formulation of image matting is inherently ill-posed. Traditional methods usually employ interaction to deal with the image matting problem with trimaps and strokes, and cannot run on the mobile phone in real-time. In this paper, we propose a real-time automatic deep matting approach for mobile devices. By leveraging the densely connected blocks and the dilated convolution, a light full convolutional network is designed to predict a coarse binary mask for portrait image. And a feathering block, which is edge-preserving and matting adaptive, is further developed to learn the guided filter and transform the binary mask into alpha matte. Finally, an automatic portrait animation system based on fast deep matting is built on mobile devices, which does not need any interaction and can realize real-time matting with 15 fps. The experiments show that the proposed approach achieves comparable results with the state-of-the-art matting solvers.

Abstract:
Among the previous studies in modeling users' preference on images, most of them assume there is a consistent ranking of images, and users' preference is transitive. That is, if a user likes image A over B and B over C, it must have A over C for this user. This condition holds when user compares images from a single angle. However, if there are multiple angles to consider, users' preference may not be transitive at all. Thus, it is interesting to know whether users' pairwise preference on images can be intransitive, and how can such personalized intransitivity be modeled.

Abstract:
People spend considerable effort managing the impressions they give others. Social psychologists have shown that people manage these impressions differently depending upon their personality. Facebook and other social media provide a new forum for this fundamental process; hence, understanding people's behaviour on social media could provide interesting insights on their personality. In this paper we investigate automatic personality recognition from Facebook profile pictures. We analyze the effectiveness of four families of visual features and we discuss some human interpretable patterns that explain the personality traits of the individuals. For example, extroverts and agreeable individuals tend to have warm colored pictures and to exhibit many faces in their portraits, mirroring their inclination to socialize; while neurotic ones have a prevalence of pictures of indoor places. Then, we propose a classification approach to automatically recognize personality traits from these visual features. Finally, we compare the performance of our classification approach to the one obtained by human raters and we show that computer-based classifications are significantly more accurate than averaged human-based classifications for Extraversion and Neuroticism.

Abstract:
"Drag A Star 3.0" is a site-specific interactive art which setup in a café umbrella context. Audiences were meant to sit down and relax under the umbrella while interacting with this piece. With their smart phone, audience able to generate their own unique design star and send it to the star field. Audience could even embed their star with a wish, just like the old myth of wishing upon a shooting star. Stars that being generated are stored in the web server database, it is then being retrieved and visually display as a star in the night sky. Thus, every star tells a story. Audience could catch the shooting star by just performing a simple dragging gesture and then able to read at others wishes, or even reply to the wishes. This art piece is a combination of different technologies which involve projection mapping technique display, mobile application, web-based messaging system and web server database. Every action by the users is stored in the web server database and all these actions would determine the visual component of the star field. These participatory social interactions between the audiences enable the connection of audiences from the past, present and future.

Abstract:
This research project is based on 32 years of Kuchera-Morin's research and practice in spatio-temporal music composition and media arts. The project is an immersive interactive visual, sonic computational instrument presented as an installation, which includes the development of an open-source computational language, and Kuchera-Morin's immersive interactive visual/sonic composition PROBABLY/POSSIBLY? Using the mathematics of quantum mechanics, the immersive instrument and computational language facilitates the creation of new, unique visual/sonic art forms. This project allows the artist to drive scientific and technological research for creative expression. This same technology is giving physicists insight into higher dimensional representation. The immersive visual/sonic instrument and language is based on the time-dependent Schrödinger equation splitting a hydrogen-like atom's electron in superposition in various orbitals. The immersive media composition, PROBABLY/POSSIBLY? can be interactively performed using our multimodal computational platform and open source language. The instrument/installation can also be used to compose and perform a number of art works based on the time-dependent Schrödinger equation

Abstract:
In this paper, we proposed a novel method for effective salient object detection by designing a chained multi-scale fully convolutional network (CMSFCN). CMSFCN contained multiple single-scale fully convolutional networks (SSFCNs), which were integrated successively by using chained connections and generated saliency prediction results from coarse to fine. The chained connections not only combined the saliency prediction result from previous SSFCN with the input image of current SSFCN, but also combined the intermediate features from previous SSFCN and current SSFCN. With these chained connections, the sequential SSFCNs in CMSFCN automatically learned complemental and discriminative features to improve the saliency predictions progressively. Therefore, after jointly training CMSFCN with an end-to-end manner, precise saliency prediction results were produced under a coarse-to-fine behaviour. Compared with seven state-of-the-art CNN based salient object detection approaches over five benchmark datasets, experimental results demonstrated the efficiency and effectiveness of CMSFCN.

Abstract:
Existing image classification systems often suffer from re-training models for novel unseen classes. Zero-shot learning (ZSL) aims to recognise these unseen classes directly using trained models with a further inference procedure. However, existing approaches highly rely on human-defined class-attribute associations to achieve the inference, which significantly increases the annotation cost. This paper aims to address ZSL on non-attribute tasks, i.e. only training images with labels are used as most of the supervised settings. Our main contributions are: 1) to circumvent expensive attributes, we propose to use semantic similes that directly indicate the unseen-to-seen associations; 2) a novel similarity-based representation is proposed to represent both visual images and semantic similes in a unified embedding space; 3) in order to reduce the annotation cost, we use only a few similes to infer a class-level prototype for each unseen class. On two popular benchmarks, AwA and aPY, extensive experiments manifest that our method can significantly improve the state-of-the-art results using only two similes for each unseen class. Furthermore, we revisit the Caltech 101 dataset without attributes. Our ZSL results can exceed that of previous supervised methods.

Abstract:
This paper introduces our augmented reality platform, Aristo, which aims to provide users with physical feedback when interacting with virtual objects. We use Vivepaper, a product we launched on Aristo in 2016, to illustrate the platform's performance requirements and key algorithms. We specifically depict Vivepaper's tracking and gesture recognition algorithms, which involve several trade-offs between speed and accuracy to achieve an immersive experience.

Abstract:
Anomaly detection is of great interest to big data applications, and both supervised and unsupervised learning have been applied for anomaly detection. However, it still remains a challenging problem because: (1) for supervised learning, it is difficult to acquire training data for anomaly samples; while (2) for unsupervised learning, the performance may not be satisfactory due to the lack of training data. To address the limitations, we propose a hybrid solution by using both normal (positive) data and unlabeled data (could be positive or negative) for semi-supervised anomaly detection. Particularly, we introduce a new framework based on Positive and Unlabeled (PU) Learning using multi-features to detect anomalies. We extend previous PU learning methods to (1) better address unbalanced class problem which is typical for anomaly detection, and (2) handle multiple features for anomaly detection. An iterative algorithm is proposed to learn the anomaly classifier incrementally from the labeled normal data and also unlabeled data. Our proposed method is verified on three benchmark datasets and one synthetic dataset. Experimental results show that our method outperforms existing methods under different class priors and different proportions of given positive classes.

Abstract:
In this paper, we investigate the cross-database micro-expression recognition problem, where the training and testing samples are from two different micro-expression databases. Under this setting, the training and testing samples would have different feature distributions and hence the performance of most existing micro-expression recognition methods may decrease greatly. To solve this problem, we propose a simple yet effective method called Target Sample Re-Generator (TSRG) in this paper. By using TSRG, we are able to re-generate the samples from target micro-expression database and the re-generated target samples would share same or similar feature distributions with the original source samples. For this reason, we can then use the classifier learned based on the labeled source samples to accurately predict the micro-expression categories of the unlabeled target samples. To evaluate the performance of the proposed TSRG method, extensive cross-database micro-expression recognition experiments designed based on SMIC and CASME II databases are conducted. Compared with recent state-of-the-art cross-database emotion recognition methods, the proposed TSRG achieves more promising results.

Abstract:
Videos over HTTP adaptive streaming have been the most popular vehicle for delivering media content on mobile platform. Rather, today's mobile video streaming are excessively tailored for visual quality, imposing a heavy burden on user's data budget. In this paper, we aim to optimize mobile video streaming of low bitrate efficiency with considering human visual acuity, i.e., preferably avoid sacrificing viewing quality. First, we conduct to in-depth analysis of mobile HTTP adaptive video streaming with a focus not only on how it works, but also on the significance of bitrate saving. Second, we identify a novel research problem on excessive visual quality which leads to bitrate-inefficient video streaming, and propose a flexible system called EyeTube to address it. Specifically, we apply dynamic resolution scaling on mobile video streaming to trade off the bitrate efficiency and user viewing experience Third, we derive general principles for achieving bitrate-efficient mobile video streaming, and employ the principles to an open source web browser, i.e., Chromium, to verify its applicability. An end-to-end EyeTube system is implemented on Samsung smartphones, and the efficiency are evaluated against 10 popular YouTube videos. Experimental results show that all the bitrates of the 10 videos can be reduced by at least 54.2% on average and up to 90.9% at most when the resolution is quartered. A user study with 40 respondents has indicated that our system can achieve good performance on both bitrate saving and high viewing quality.

Abstract:
To improve the discrimination of attribute representation, in this paper, we propose to extend the traditional attribute representations via embedding the latent high-order structure between attributes. Specifically, our aim is to construct the Latent Extended Attribute Features (LEAF) for visual classification. Since there only exist weak label for each attribute, we firstly propose a feature selection method to explore the common feature structures across categories. After that, the attribute classifiers are trained based on the selected features. Then, the category specific graph is introduced, which is composed of single attributes and their co-occurrence attribute pairs. This attribute graph is used as the initialized representation of each image. Considering our aim, we should discover the discriminative latent structure between attributes and train the robust category classifiers. To that end, we develop a joint learning objective function which is composed of the high-order representation mining term and the classifier training term. The mining term can both preserve category-specific information and discover the common structure between categories. Based on the discovery representation, the robust visual classifiers could be trained by the classifier term. Finally, an alternating optimization method is designed to seek the optimal solution of our objective function. Experimental results on the challenging datasets demonstrate the advantages of our proposed model over existing work.

Abstract:
Video question answering is a challenging task in visual information retrieval, which provides the accurate answer from the referenced video contents according to the given question. However, the existing visual question answering approaches mainly tackle the problem of static image question answering, which may be ineffectively applied for video question answering directly, due to the insufficiency of modeling the video temporal dynamics. In this paper, we study the problem of video question answering from the viewpoint of hierarchical dual-level attention network learning. We obtain the object appearance and movement information in the video based on both frame-level and segment-level feature representation methods. We then develop the hierarchical duallevel attention networks to learn the question-aware video representations with word-level and question-level attention mechanisms. We next devise the question-level fusion attention mechanism for our proposed networks to learn the questionaware joint video representation for video question answering. We construct two large-scale video question answering datasets. The extensive experiments validate the effectiveness of our method.

Abstract:
Quality, cost, and accessibility together form an iron triangle that has prevented healthcare from achieving accelerated advancement in the last few decades. Improving any one of the three iron triangle vertices may lead to degradation of the other two. For example, a policy that increases healthcare accessibility would lower quality and/or increase cost. Thanks to recent breakthroughs in artificial intelligence (AI) and virtual reality (VR), this iron triangle may finally be shattered. In this talk, I will share our experience of developing DeepQ, an AI platform, for supporting AI-aided diagnosis and VR-aided surgery. I will present two healthcare initiatives that we have been undertaking since 2013: XPRIZE Tricorder and VR surgery, and explain how AI and VR play pivotal roles in improving diagnosis accuracy and treatment effectiveness. More specifically, I will depict how we have dealt with not only big data machine learning, but also small data learning, which is typical in the medical domain. The talk concludes with roadmaps and a list of open research issues in multimodal signal processing, fusion, and mining to achieve precision medicine and precision surgery.

Abstract:
In this paper we are interested in recognizing human actions from sequences of 3D skeleton data. For this purpose we combine a 3D Convolutional Neural Network with body representations based on Euclidean Distance Matrices (EDMs), which have been recently shown to be very effective to capture the geometric structure of the human pose. One inherent limitation of the EDMs, however, is that they are defined up to a permutation of the skeleton joints, i.e., randomly shuffling the ordering of the joints yields many different representations. In oder to address this issue we introduce a novel architecture that simultaneously, and in an end-to-end manner, learns an optimal transformation of the joints, while optimizing the rest of parameters of the convolutional network. The proposed approach achieves state-of-the-art results on 3 benchmarks, including the recent NTU RGB-D dataset, for which we improve on previous LSTM-based methods by more than 10 percentage points, also surpassing other CNN-based methods while using almost 1000 times fewer parameters.

Abstract:
This paper introduces a novel approach for generating videos called Synchronized Deep Recurrent Attentive Writer (Sync-DRAW). Sync-DRAW can also perform text-to-video generation which, to the best of our knowledge, makes it the first approach of its kind. It combines a Variational Autoencoder(VAE) with a Recurrent Attention Mechanism in a novel manner to create a temporally dependent sequence of frames that are gradually formed over time. The recurrent attention mechanism in Sync-DRAW attends to each individual frame of the video in sychronization, while the VAE learns a latent distribution for the entire video at the global level. Our experiments with Bouncing MNIST, KTH and UCF-101 suggest that Sync-DRAW is efficient in learning the spatial and temporal information of the videos and generates frames with high structural integrity, and can generate videos from simple captions on these datasets.

Abstract:
DASH, or Dynamic Adaptive Streaming over HTTP, relies on a rate adaptation component to decide on which representation to download for each video segment. A plethora of rate adaptation algorithms has been proposed in recent years. The decisions of which bitrate to download made by these algorithms largely depend on several factors: estimated network throughput, buffer occupancy, and buffer capacity. Yet, these algorithms are not informed by a fundamental relationship between these factors and the chosen bitrate, and as a result, we found that they do not perform consistently in all scenarios, and require parameter tuning to work well under different buffer capacity. In this paper, we model a DASH client as an M/D/1/K queue, which allows us to calculate the expected buffer occupancy given a bitrate choice, network throughput, and buffer capacity. Using this model, we propose QUETRA, a simple rate adaptation algorithm. We evaluated QUETRA under a diverse set of scenarios and found that, despite its simplicity, it leads to better quality of experience (7% - 140%) than existing algorithms.

Abstract:
Advertisements (ads) often include strongly emotional content to leave a lasting impression on the viewer. This work (i) compiles an affective ad dataset capable of evoking coherent emotions across users, as determined from the affective opinions of five experts and 14 annotators; (ii) explores the efficacy of convolutional neural network (CNN) features for encoding emotions, and observes that CNN features outperform low-level audio-visual emotion descriptors[9] upon extensive experimentation; and (iii) demonstrates how enhanced affect prediction facilitates computational advertising, and leads to better viewing experience while watching an online video stream embedded with ads based on a study involving 17 users. We model ad emotions based on subjective human opinions as well as objective multimodal features, and show how effectively modeling ad emotions can positively impact a real-life application.

Abstract:
Compared with normal modalities, the representations of paintings are much more complex due to its large intra-class and small inter-class variation. This poses more difficulties in the task of authorship identification. In this paper, we propose a multi-task multi-range (MTMR) representation framework and try to resolve this issue in two ways. First, we investigate how to improve the representation through multi-task learning. Specifically, we attempt to optimize authorship identification with subtly correlated identification tasks such as style, genre and date. Second, in order to make the representation more comprehensive and reduce the information loss from image scaling, we propose a multi-range structure which is composed of local, regional and global representations. Experiments on the two most representative large-scale painting datasets, Rijksmuseum Challenge and Wikiart, have shown that our method significantly outperforms the existing methods. To give better understanding and provide more effective predictions, we utilize random forest as the feature ranking method to analyze the importance of different features and apply external knowledge matching to further examine the predictions. Moreover, the framework's effects of identifying the authorship are visualized on the paintings' artist-characteristic regions and t-SNE is further applied to perform artist-based cluster analysis. Extensive validation has demonstrated that the proposed framework yields superior performance in the chanllenging task of painting authorship identification.

Abstract:
Recently we have observed emerging uses of deep learning techniques in multimedia systems. Developing a practical deep learning system is arduous and complex. It involves labor-intensive tasks for constructing sophisticated neural networks, coordinating multiple network models, and managing a large amount of training-related data. To facilitate such a development process, we propose TensorLayer which is a Python-based versatile deep learning library. TensorLayer provides high-level modules that abstract sophisticated operations towards neuron layers, network models, training data and dependent training jobs. In spite of offering simplicity, it has transparent module interfaces that allows developers to flexibly embed low-level controls within a backend engine, with the aim of supporting fine-grain tuning towards training. Real-world cluster experiment results show that TensorLayeris able to achieve competitive performance and scalability in critical deep learning tasks. TensorLayer was released in September 2016 on GitHub. Since after, it soon become one of the most popular open-sourced deep learning library used by researchers and practitioners.

Abstract:
Despite significant progress of deep learning in the field of computer vision, there has not been a software library that covers these methods in a unifying manner. We introduce ChainerCV, a software library that is intended to fill this gap. ChainerCV supports numerous neural network models as well as software components needed to conduct research in computer vision. These implementations emphasize simplicity, flexibility and good software engineering practices. The library is designed to perform on par with the results reported in published papers and its tools can be used as a baseline for future research in computer vision. Our implementation includes sophisticated models like Faster R-CNN and SSD, and covers tasks such as object detection and semantic segmentation.

Abstract:
Anomaly detection and localization in surveillance videos have attracted broad attention in both academic and industry for its importance to public safety, which however remain challenging. In this demonstration, we propose an anomaly detection algorithm called 2stream-VAE/GAN by embedding VAE/GAN in a two-stream architecture. By taking both spatial and temporal information into consideration, normality can be captured and anomaly detection can be achieved. With an outlier detection rule, the system automatically locates anomaly based on a pre-trained model, which suits well for both streaming and local videos.

Abstract:
We propose a new smart lock system that uses user's own item as a smart key to the lock system. The system marks the item with an ink dot, named mIDoT-key (pronounced "my-dot-key"), and the assigned lock is opened by identifying the microscopic image of the dot. Unlike conventional smart keys, new users do not have to get an IC card or to install new software for using our system. Thus, the registration process is much easier for new users. This instant registration is beneficial to access control of many rooms. New visitors of these rooms can be supplied with a mIDoT registration key for easy access. We demonstrate a miniature door lock that can be opened with various items marked with the mIDoT-keys.

Abstract:
HTTP Adaptive Streaming have become the de-facto solutions to deliver video over the Internet due to their ability to enhance consumers' Quality of Experience (QoE). Nevertheless, they do not have the possibility to improve the actual delivered video quality, limited by the available client-server throughput. In comparison, multiple-server and P2P streaming offer the opportunity to obtain enhanced QoE by benefiting from expanded bandwidth, link diversity and reliability in distributed streaming infrastructures. We present a prototype for a hybrid P2P/multi-server quality-adaptive streaming solution, simultaneously using several servers and peers, and trading off the server infrastructure capacities and QoE gains.

Abstract:
Image and video super-resolution (SR) has been explored for several decades. However, few works are integrated into practical systems for real-time image and video SR. In this work, we present a real-time deep video SpaTial Resolution UpConversion SysTem (STRUCT++). Our demo system achieves real-time performance (50 fps on CPU for CIF sequences and 45 fps on GPU for HDTV videos) and provides several functions: 1) batch processing; 2) full resolution comparison; 3) local region zooming in. These functions are convenient for super-resolution of a batch of videos (at most 10 videos in parallel), comparisons with other approaches and observations of local details of the SR results. The system is built on a Global context aggregation and Local queue jumping Network (GLNet). It has a thinner and deeper network structure to aggregate global context with an additional local queue jumping path to better model local structures of the signal. GLNet achieves state-of-the-art performance for real-time video SR.

Abstract:
Dynamic music emotion prediction is to recognize the continuous emotion information in music, which is necessary for music retrieval and recommendation. In this paper, we adopt the dimensional valence-arousal (V-A) emotion model to represent the dynamic emotion in music. In our opinion, music and V-A emotion label do not have the one-to-one correspondence in the time domain, while the expression of music emotion at one moment is the accumulation of previous music content for a period of time, so we propose Long Short-Term Memory (LSTM) based sequence-to-one mapping for dynamic music emotion prediction. Based on this sequence-to-one music emotion mapping, it is proved that different time scales' preceding content has an influence on the LSTM model's performance, so we further propose the Multi-scale Context based Attention (MCA) for dynamic music emotion prediction. We evaluate our proposed method on the database of Emotion in Music task at MediaEval 2015, and the results show that our proposed method outperforms most of the models using the same features and achieves a competitive performance with the state-of-the-art methods.

Abstract:
High Dynamic Ranges (HDR) displays can show images with higher color contrast levels and peak luminosities than the commonly used Low Dynamic Range (LDR) displays. Although HDR displays are still expensive, they are reaching the consumer market in the coming years. Unfortunately, most video content is recorded and/or graded in LDR format. Typically, dynamic range expansion by using an Inverse Tone Mapped Operator (iTMO) is required to show LDR content in HDR displays. The most common type of artifact derived from dynamic range expansion is false contouring, which negatively affects the overall image quality. In this paper, we propose a new fast iterative false-contour removal method for inverse tone mapped HDR content. We consider the false-contour removal as a signal reconstruction problem, and we solve it using an iterative Projection Onto Convex Sets (POCS) minimization algorithm. Unlike most other false-contour removal techniques, we define reconstruction constraints taking into account the iTMO used. Experimental results demonstrate the effectiveness of the proposed method to remove false contours while preserving details in the image. In order speed-up the execution time, the proposed method was implemented to run on a GPU. We were able to show that it can be used to remove false contours in real-time from an inverse tone mapped High-definition HDR video sequences at 24 fps.

Abstract:
Multi-task learning aims to boost the performance of multiple prediction tasks by appropriately sharing relevant information among them. However, it always suffers from the negative transfer problem. And due to the diverse learning difficulties and convergence rates of different tasks, jointly optimizing multiple tasks is very challenging. To solve these problems, we present a weighted multi-task deep convolutional neural network for person attribute analysis. A novel validation loss trend algorithm is, for the first time proposed to dynamically and adaptively update the weight for learning each task in the training process. Extensive experiments on CelebA, Market-1501 attribute and Duke attribute datasets clearly show that state-of-the-art performance is obtained; and this validates the effectiveness of our proposed framework.

Abstract:
With ever-increasing number of car-mounted electronic devices that are accessed, managed, and controlled with smartphones, car apps are becoming an important part of the automotive industry. Audio classification is one of the key components of car apps as a front-end technology to enable human-app interactions. Existing approaches for audio classification, however, fall short as the unique and time-varying audio characteristics of car environments are not appropriately taken into account. Leveraging recent advances in mobile sensing technology that allow for effective and accurate driving environment detection, in this paper, we develop an audio classification framework for mobile apps that categorizes an audio stream into music, speech, speech+music, and noise, adaptably depending on different driving environments. A case study is performed with four different driving environments, i.e., highway, local road, crowded city, and stopped vehicle. More than 420 minutes of audio data are collected including various genres of music, speech, speech+music, and noise from the driving environments. The results demonstrate that the proposed approach improves the average classification accuracy by up to 166%, and 64% for speech, and speech+music, respectively, compared with a non-adaptive approach in our experimental settings.

Abstract:
Cross-media retrieval aims at seeking the semantic association between different media types. Most existing methods paid much attention on learning mapping functions or finding the optimal spaces, but neglected how people accurately cognize images and texts. This paper proposes a brain inspired cross-media retrieval framework to learn rich semantic embeddings of multimedia. Different from directly using off-the-shelf image features, we combine the visual and descriptive senses for an image from the view of human perception via a joint model, called multi-sensory fusion network (MSFN). A topic model based TextNet maps texts into the same semantic space as images according to their shared ground truth labels. Moreover, in order to overcome the limitations of insufficient data for training neural networks and less complexity in text form, we introduce a large-scale image-text dataset, called Britannica dataset. Extensive experiments show the effectiveness of our framework for different lengths of texts on three benchmark datasets as well as Britannica dataset. Most of all, we report the best known average results of Img2Text and Text2Img compared with several state-of-the-art methods.

Abstract:
We are creating multimedia contents everyday and everywhere. While automatic content generation has played a fundamental challenge to multimedia community for decades, recent advances of deep learning have made this problem feasible. For example, the Generative Adversarial Networks (GANs) is a rewarding approach to synthesize images. Nevertheless, it is not trivial when capitalizing on GANs to generate videos. The difficulty originates from the intrinsic structure where a video is a sequence of visually coherent and semantically dependent frames. This motivates us to explore semantic and temporal coherence in designing GANs to generate videos. In this paper, we present a novel Temporal GANs conditioning on Captions, namely TGANs-C, in which the input to the generator network is a concatenation of a latent noise vector and caption embedding, and then is transformed into a frame sequence with 3D spatio-temporal convolutions. Unlike the naive discriminator which only judges pairs as fake or real, our discriminator additionally notes whether the video matches the correct caption. In particular, the discriminator network consists of three discriminators: video discriminator classifying realistic videos from generated ones and optimizes video-caption matching, frame discriminator discriminating between real and fake frames and aligning frames with the conditioning caption, and motion discriminator emphasizing the philosophy that the adjacent frames in the generated videos should be smoothly connected as in real ones. We qualitatively demonstrate the capability of our TGANs-C to generate plausible videos conditioning on the given captions on two synthetic datasets (SBMG and TBMG) and one real-world dataset (MSVD). Moreover, quantitative experiments on MSVD are performed to validate our proposal via Generative Adversarial Metric and human study.

Abstract:
Translating and summarizing a video into natural language is an interesting and challenging visual task. In this work, a novel framework is built to generate sentences for videos with more coherence and semantics. A long short term memory (LSTM) network with an improved factored way is first developed, which takes inspiration of the conventional factored way and a common practice of presenting multi-modal features at the first time in LSTM for video captioning. An LSTM network with the combination of improved factored and un-factored ways is exploited, and a voting strategy is employed to predict the words. Then, the residual is used to enhance the gradient signals which is learned from residual network (ResNet), and a deeper LSTM network is constructed. Furthermore, several convolutional neural network (CNN) features from deep models with different architectures are fused to catch more comprehensive and complementary visual information. Experiments are conducted on the MSR-VTT2016 and MSR-VTT2017 grand challenge datasets to demonstrate the effectiveness of each presented techniques as well as the superiority compared to other state-of-the-art methods.

Abstract:
Emerging commercial live content broadcasting platforms are facing great challenges to accommodate large scale dynamic viewer populations. Existing solutions constantly suffer from balancing the cost of deploying at the edge close to the viewers and the quality of content delivery. We propose LiveJack, a novel network service to allow CDN servers to seamlessly leverage ISP edge cloud resources. LiveJack can elastically scale the serving capacity of CDN servers by integrating Virtual Media Functions (VMF) in the edge cloud to accommodate flash crowds for very popular contents. LiveJack introduces minor application layer changes for streaming service providers and is completely transparent to end users. We have prototyped LiveJack in both LAN and WAN environments. Evaluations demonstrate that LiveJack can increase CDN server capacity by more than six times, and can effectively accommodate highly dynamic workloads with an improved service quality.

Abstract:
Personalized video recommender systems play an essential role in bridging users and videos. However, most existing video recommendation methods assume that user profiles (interests) are static. In fact, the static assumption is inadequate to reflect users' dynamic interests as time goes by, especially in the online video recommendation scenarios with dramatic changes of video contents and frequent drift of users' interests over different topics. To overcome the above issue, we propose a dynamic recurrent neural network to model users' dynamic interests over time in a unified framework for personalized video recommendation. Furthermore, to build a much more comprehensive recommendation system, the proposed model is designed to exploit video semantic embedding, user interest modeling, and user relevance mining jointly to model users' preferences. By considering these three factors, the RNN model becomes an interest network which can capture users' high level interests effectively. Extensive experimental results on both single-network and cross-network video recommendation scenarios demonstrate the superior performance of the proposed model compared with other state-of-the-art algorithms.

Abstract:
Driven by the increasing popular image-dominated social networks, such as Instagram, Pinterest and Chictopica, sharing of daily-life street photos now plays an influential role in fashion adoption between fashion trend-setters and followers. In this work, we propose a deep learning based fine-grained embedding learning approach for street fashion analysis by leveraging user-generated street fashion data. Specifically, we present QuadNet, an effective CNN based image embedding network driven by both multi-task classification loss and neighbor-constrained similarity loss. The latter loss function is computed with a novel quadruplet loss function, which considers both hard and soft positive neighbors as well as a negative neighbor for each anchor image. The embedded feature learned from co-optimization is effective for both fine-grained classification task and image retrieval task. Quantitative evaluation on a newly collected large-scale multi-task street photo dataset shows that our QuadNet outperforms the state-of-the-art triplet network by a significant margin. In order to further evaluate the effectiveness of the learned embedding, we analyze and trace the fashion trends of New York City from 2011 to 2016. In our analysis, we are able to identify some short-term and long-term fashion styles.

Abstract:
Virtual Reality (VR) devices are becoming accessible to a large public, which is going to increase the demand for 360° VR videos. VR videos are often characterized by a poor quality of experience, due to the high bandwidth required to stream the 360° video. To overcome this issue, we spatially divide the VR video into tiles, so that each temporal segment is composed of several spatial tiles. Only the tiles belonging to the viewport, the region of the video watched by the user, are streamed at the highest quality. The other tiles are instead streamed at a lower quality. We also propose an algorithm to predict the future viewport position and minimize quality transitions during viewport changes. The video is delivered using the server push feature of the HTTP/2 protocol. Instead of retrieving each tile individually, the client issues a single push request to the server, so that all the required tiles are automatically pushed back to back. This approach allows to increase the achieved throughput, especially in mobile, high RTT networks. In this paper, we detail the proposed framework and present a prototype developed to test its performance using real-world 4G bandwidth traces. Particularly, our approach can save bandwidth up to 35% without severely impacting the quality viewed by the user, when compared to a traditional non-tiled VR streaming solution. Moreover, in high RTT conditions, our HTTP/2 approach can reach 3 times the throughput of tiled streaming over HTTP/1.1, and consistently reduce freeze time. These results represent a major improvement for the efficient delivery of 360° VR videos over the Internet.

Abstract:
We present PD-Survey, a platform to conduct surveys across a network of interactive screens. Our research is motivated by the fact that obtaining and analyzing data about users of public displays requires signi cant effort; e.g., running long-term observations or post-hoc analyses of video/interaction logs. As a result, research is often constrained to a single installation within a particular context, neither accounting for a diverse audience (children, shoppers, commuters) nor for different situations (waiting vs. passing by) or times of the day. As displays become networked, one way to address this challenge is through surveys on displays, where audience feedback is collected insitu. Since current tools do not appropriately address the requirements of a display network, we implemented a tool for use on public displays and report on its design and development. Our research is complemented by two in-the-wild deployments that (a) investigate di erent channels for feedback collection, (b) showcase how the work of researchers is supported, and (c) testify that the platform can easily be extended with novel features.

Abstract:
In recent years, cross-modal scene retrieval has attracted more attention. However, most existing approaches neglect the semantic relationship between objects in a scene together with the embedded spatial layouts. Moreover, these methods mostly apply the batch learning strategy, which is not suitable for processing streaming data. To address the aforementioned problems, we propose a new framework for online cross-modal scene retrieval based on binary representations and semantic graph. Specially, we adopt the cross-modal hashing based on the quantization loss of different modalities. By introducing the semantic graph, we are able to extract wealthy semantics and measure their correlation across different modalities. Further more, we propose a two-step optimization procedure based on stochastic gradient descent for online update. Experimental results on four datasets show the superiority of our approach over the state-of-the-art.

Abstract:
On-demand video streaming is a popular application which accounts for a large share of today's Internet traffic. Dynamic Adaptive Streaming over HTTP (DASH) is the major streaming technology used by large content providers. However, this technology suffers from performance problems when multiple clients are streaming on shared network links. Aiming at improving the viewers' Quality of Experience, this thesis studies how DASH Assisting Network Elements (DANEs) optimize bottleneck network links and improve DASH streaming performance. The DANE is aware of DASH traffic on the network link, and partitions available network resources between DASH players and other traffic. Two DANE prototypes have been proposed as part of this thesis. In experiments with DASH players in wired and wireless networks, it is shown that the DANEs increase video bitrate, reduce quality switches, and improve fairness between players. Additionally, Markov models have been developed to explore sharing policies for DANEs. The models are used to determine the effect of those policies on streaming performance and to optimize network resource sharing for DASH players. Thorough validations with real DASH players show that the Markov models are highly accurate. In the remaining last year of the PhD program, I will apply DANE technology to cellular (5G) networks and use the Markov models to obtain an optimal sharing strategy.

Abstract:
The number of "hits" has been widely regarded as the lifeblood of many web systems, e.g., e-commerce systems, advertising systems and multimedia consumption systems. However, users would not hit an item if they cannot see it, or they are not interested in the item. Recommender system plays a critical role of discovering interested items from near-infinite inventory and exhibiting them to potential users. Yet, two issues are crippling the recommender systems. One is "how to handle new users", and the other is "how to surprise users". The former is well-known as cold-start recommendation, and the latter can be investigated as long-tail recommendation. This paper, for the first time, proposes a novel approach which can simultaneously handle both cold-start and long-tail recommendation in a unified objective.

Abstract:
Deep architectures using identity skip-connections have demonstrated groundbreaking performance in the field of image classification. Recently, empirical studies suggested that identity skip-connections enable ensemble-like behaviour of shallow networks, and that depth is not a solo ingredient for their success. Therefore, we examine the potential of identity skip-connections for the task of Speech Emotion Recognition (SER) where moderately deep temporal architectures are often employed. To this end, we propose a novel architecture which regulates unimpeded feature flows and captures long-term dependencies via gate-based skip-connections and a memory mechanism. Our proposed architecture is compared to other state-of-the-art methods of SER and is evaluated on large aggregated corpora recorded in different contexts. Our proposed architecture outperforms the state-of-the-art methods by 9 - 15% and achieves an Unweighted Accuracy of 80.5% in an imbalanced class distribution. In addition, we examine a variant adopting simplified skip-connections of Residual Networks (ResNet) and show that gate-based skip-connections are more effective than simplified skip-connections.

Abstract:
Understanding the semantics in videos is a complex but crucial task in video analysis. This paper focuses on localizing category-independent events, actions or other semantics in an untrimmed video, referred as salient temporal proposal localization. Traditional methods like sliding window have a high computational cost due to the densely sampling of different video segments. We propose a reinforcement learning based method, which trains a localizer that learns a search policy that, instead of exploring every video segment, finds an optimal search path to locate a salient proposal based on the currently observing video segment in a tree structure, therefore reduces the number of video segments fed into the proposal detector. In each search step, a localizer is trained to iteratively select the next sub-region containing salient proposals to continue the search, and a proposal detector is trained to recognize salient proposal from the sub-regions. The experiments demonstrate that our method is able to precisely detect salient proposals with a comparable recall and with much fewer candidate windows.

Abstract:
Frames of free viewpoint video (FVV) synthesized with depth image-based rendering (DIBR) mainly contains special local artifacts like geometric distortions, in which the shape of objects may be stretched/bent. Human observers tend to perceive such local severe deformations instead of consistent shifting artifacts that penalized by most of the existing metrics. Elastic metric is capable of measuring the difference in stretching or bending between two curves, and thus is suitable for evaluating such geometric distortions. In this paper, an elastic metric based image quality assessment (EM-IQA) scheme is proposed by first selecting local distortion regions and then quantifying the deformations of curves. According to the experimental results on the IRCCyn/IVC DIBR image database, the proposed EM-IQA outperforms the state of the art metrics designed for synthesized images and obtains a gain of 6.97% in pearson correlation compared to the second best performing MP-PSNRreduced.

Abstract:
Nowadays multi-track conferences impose great difficulty to attendees in making a day plan of attending relevant sessions/papers, because a large number of sessions are scheduled on a single day and many of them are at the same time. To address this problem, we introduce MatPlanner, a mobile application, which provides an alternate way of scheduling a day in a conference by understanding the attendee's interest and preferences. MatPlanner learns event-topic interest matrix from few sampled events selected by each user and then recommend relevant events and display the final output. More importantly, MatPlanner's interface enables users to resolve conflicting events by showing them at the same panel which makes it easy to compare. Interview results suggested a high level of satisfaction among participants in scheduling a day.

Abstract:
In this demo we present a system for immersive experiences in museums using Voice Commands (VCs) and Virtual Reality (VR). The system has been specifically designed for use by people with motor disabilities. Natural interaction is provided through Automatic Speech Recognition (ASR) and allows to experience VR environments wearing an Head Mounted Display (HMD), i.e. the Oculus Rift. Insights gathered during the implementation and results from an initial usability evaluation are reported.

Abstract:
Visual content description has been attracting broad research attention in multimedia community because it deeply uncovers intrinsic semantic facet of visual data. Most existing approaches formulate visual captioning as machine translation task (i.e., from vision to language) via a top-down paradigm with global attention, which ignores to distinguish visual and non-visual parts during word generation. In this work, we propose a novel adaptive attention strategy for visual captioning, which can selectively attend to salient visual content based on linguistic knowledge. Specifically, we design a key control unit, termed visual gate, to adaptively decide "when" and "what" the language generator attend to during the word generation process. We map all the preceding outputs of language generator into a latent space to derive the representation of sentence structures, which assists the "visual gate" to choose appropriate attention timing. Meanwhile, we employ a bottom-up workflow to learn a pool of semantic attributes for serving as the propositional attention resources. We evaluate the proposed approach on two commonly-used benchmarks, i.e., MSCOCO and MSVD. The experimental results demonstrate the superiority of our proposed approach compared to several state-of-the-art methods.

Abstract:
We address the question of what visual cues, including scene objects and demographic attributes, contribute to the automatic inference of perceived ambiance in social media venues. We first use a state-of-art, deep scene semantic parsing method and a face attribute extractor to understand how different cues present in a scene relate to human perception of ambiance on Foursquare images of social venues. We then analyze correlational links between visual cues and thirteen ambiance variables, as well as the ability of the semantic attributes to automatically infer place ambiance. We study the effect of the type and amount of image data used for learning, and compare regression results to previous work, showing that the proposed approach results in marginal-to-moderate performance increase for up to ten of the ambiance dimensions, depending on the corpus.

Abstract:
We consider the bandwidth allocation problem in automated video surveillance systems, in which a monitoring station analyzes the video streams captured and delivered wirelessly by multiple cameras. In contrast with prior studies, we provide a detailed experimental analysis of cross-layer optimization by developing a real system and conducting extensive experiments. In addition, we present an enhanced cross-layer optimization solution that allocates bandwidth to different cameras in a manner that optimizes the overall detection accuracy. The solution works with the popular HTTP streaming approach and includes a new online scheme for estimating the effective airtime of the network. The results show that the proposed solution significantly improves the detection accuracy.

Abstract:
Hashing methods play an important role in large scale image retrieval. Traditional hashing methods use hand-crafted features to learn hash functions, which can not capture the high level semantic information. Deep hashing algorithms use deep neural networks to learn feature representation and hash functions simultaneously. Most of these algorithms exploit supervised information to train the deep network. However, supervised information is expensive to obtain. In this paper, we propose a pseudo label based unsupervised deep discriminative hashing algorithm. First, we cluster images via K-means and the cluster labels are treated as pseudo labels. Then we train a deep hashing network with pseudo labels by minimizing the classification loss and quantization loss. Experiments on two datasets demonstrate that our unsupervised deep discriminative hashing method outperforms the state-of-art unsupervised hashing methods.

Abstract:
With the remarkable growth of adaptive streaming media applications, especially the wide usage of dynamic adaptive streaming schemes over HTTP (DASH), it becomes ever more important to understand the perceptual quality-of-experience (QoE) of end users, who may be constantly experiencing adaptations (switchings) of video bitrate, spatial resolution, and frame-rate from one time segment to another in a scale of a few seconds. This is a sophisticated and challenging problem, for which existing visual studies provide very limited guidance. Here we build a new adaptive streaming video database and carry out a series of subjective experiments to understand human QoE behaviors in this multi-dimensional adaptation space. Our study leads to several useful findings. First, our path-analytic results show that quality deviation introduced by quality adaptation is asymmetric with respect to the adaptation direction (positive or negative), and is further influenced by the intensity of quality change (intensity), dimension of adaptation (type), intrinsic video quality (level), content, and the interactions between them. Second, we find that for the same intensity of quality adaptation, a positive adaptation occurred in the low-quality range has more impact on QoE, suggesting an interesting Weber's law effect; while such phenomenon is reversed for a negative adaptation. Third, existing objective video quality assessment models are very limited in predicting time-varying video quality.

Abstract:
Game development is a complex and labor-intensive endeavor. Game environments, storylines, audio, and character behaviors are carefully crafted requiring graphics artists, storytellers, and software developers to work in unison. Often games end up with a delicate mix of hard-wired behavior in the form of traditional code and somewhat more responsive behavior in the form of large collections of complex rules. Similarly, audio, video, and graphics are carefully and manually curated and synchronized with game actions. The addition of Virtual Reality (VR) and Augmented Reality (AR) have only added to the many challenges that game developers and storytellers face.

Abstract:
The overwhelming volume and complexity of information in online applications make recommendation essential for users to find information of interest. However, two major limitations that coexist in real world applications (1) incomplete user profiles, and (2) the dynamic nature of user preferences continue to degrade recommender quality in aspects such as timeliness, accuracy, diversity and novelty. To address both the above limitations in a single solution, we propose a novel cross-network time aware recommender solution. The solution first learns historical user models in the target network by aggregating user preferences from multiple source networks. Second, user level time aware latent factors are learnt to develop current user models from the historical models and conduct timely recommendations. We illustrate our solution by using auxiliary information from the Twitter source network to improve recommendations for the YouTube target network. Experiments conducted using multiple time aware and cross-network baselines under different time granularities show that the proposed solution achieves superior performance in terms of accuracy, novelty and diversity.

Abstract:
In this paper, we describe the first-ever machine human collaboration at creating a real movie trailer (officially released by 20th Century Fox). We introduce an intelligent system designed to understand and encode patterns and types of emotions in horror movies that are useful in trailers. We perform multi-modal semantics extraction including audio visual sentiments and scene analysis and employ a statistical approach to model the key defining components that characterize horror movie trailers. The system was applied on a full-length feature film, "Morgan'' released in 2016 where the system identified 10 moments as best candidates for a trailer. We partnered with a professional filmmaker who arranged and edited each of the moments together to construct a comprehensive trailer completing the entire processing as well as the final trailer assembly within 24 hours. We discuss disruptive opportunities for the film industry and the tremendous media impact of the AI trailer. We confirm the effectiveness of our trailer with a very supportive user study. Finally based on our close interaction with the film industry, we also introduce and investigate the novel paradigm of tropes within the context of movies for advancing content creation.

Abstract:
In this paper, we propose a novel graph model, called weighted sparse representation regularized graph, to learn a robust object representation using multispectral (RGB and thermal) data for visual tracking. In particular, the tracked object is represented with a graph with image patches as nodes. This graph is dynamically learned from two aspects. First, the graph affinity (i.e., graph structure and edge weights) that indicates the appearance compatibility of two neighboring nodes is optimized based on the weighted sparse representation, in which the modality weight is introduced to leverage RGB and thermal information adaptively. Second, each node weight that indicates how likely it belongs to the foreground is propagated from others along with graph affinity. The optimized patch weights are then imposed on the extracted RGB and thermal features, and the target object is finally located by adopting the structured SVM algorithm. Moreover, we also contribute a comprehensive dataset for RGB-T tracking purpose. Comparing with existing ones, the new dataset has the following advantages: 1) Its size is sufficiently large for large-scale performance evaluation (total frame number: 210K, maximum frames per video pair: 8K). 2) The alignment between RGB-T video pairs is highly accurate, which does not need pre- and post-processing. 3) The occlusion levels are annotated for analyzing the occlusion-sensitive performance of different methods. Extensive experiments on both public and newly created datasets demonstrate the effectiveness of the proposed tracker against several state-of-the-art tracking methods.

Abstract:
In this paper, we propose a unified framework for the residual learning and random forest regression for social media prediction task. Given a post including photo and its social information, the primary goal is to predict the view count of the post. In this regression problem, we first predict the view count based on random forest regressor for the social information. Since regressor tends to learn a relative soothingness model to avoid overfitting, the extreme high/low view counts of the poses are hard to predict. We solve this problem by using residual learning to refine the prediction. Based on this initial prediction, the residual value of the prediction and its ground truth is calculated. Then, the image and its social information will feed to 13-layers ResNet to predict the residual value to compensate the initial prediction for extreme high/low view counts. Experiments show that the performance of the proposed method significantly outperforms other methods.

Abstract:
Popularity prediction, aiming at predicting target items' total interactions with users, is a very significant type of problem and has attracted a lot of attention in recent years. It can benefit a lot of real applications, such as cold-start recommendation[8] and online advertising [4]. The Social Media Prediction Task-1 (SMP-T1) of the ACM Multimedia 2017 Grand Challenge is designed to predict popularity of photos published by users in social media.

Abstract:
Social media websites have become an important channel for content sharing and communication between users on social networks. The shared images on the websites, even the ones from the same user, tend to receive a quite diverse distribution of views. This raises the problem of image popularity prediction on social media. To address this important research topic, we explore three essential components that have considerable impact of the image popularity, which are user profile, post metadata, and photo aesthetics. Moreover, we make use of state-of-the-art predictive modeling approaches to demonstrate the effectiveness of our proposed features in predicting image popularity. We then evaluate the proposed method through a large number of real image posts from Flickr. The experimental results show significant statistical evidence that incorporating the proposed features with ensemble learning method that combines predictions from support vector regression (SVR) and classification and regression tree (CART) models offers a satisfactory popularity prediction. By understanding the social behavior and the underlying structure of content popularity, our research results can also contribute to designing better algorithms for important applications like content recommendation and advertisement placement.

Abstract:
Person re-identification (re-ID), which aims at spotting a person of interest across multiple camera views, has gained more and more attention in computer vision community. In this paper, we propose a novel deep Siamese architecture based on convolutional neural network (CNN) and multi-level similarity perception. According to the distinct characteristics of diverse feature maps, we effectively apply different similarity constraints to both low-level and high-level feature maps, during training stage. Therefore, our network can efficiently learn discriminative feature representations at different levels, which significantly improves the re-ID performance. Besides, our framework has two additional benefits. Firstly, classification constraints can be easily incorporated into the framework, forming a unified multi-task network with similarity constraints. Secondly, as similarity comparable information has been encoded in the network's learning parameters via back-propagation, pairwise input is not necessary at test time. That means we can extract features of each gallery image and build index in an off-line manner, which is essential for large-scale real-world applications. Experimental results on multiple challenging benchmarks demonstrate that our method achieves splendid performance compared with the current state-of-the-art approaches.

Abstract:
The organisation of personal data is receiving increasing research attention due to the challenges we face in gathering, enriching, searching, and visualising such data. Given the increasing ease with which personal data being gathered by individuals, the concept of a lifelog digital library of rich multimedia and sensory content for every individual is fast becoming a reality. The LTA 2017 workshop aims to bring together academics and practitioners to discuss approaches to lifelog data analytics and applications; and to debate the opportunities and challenges for researchers in this new and challenging area.

Abstract:
AltMM 2017, the 2nd International Workshop on Multimedia Alternate Realities at ACM Multimedia aims to provide a forum for researchers and practitioners concerned with multimedia that enables experiencing "alternate realities". Such experiences may allow us to access other worlds, to live other people's stories, to communicate with or experience alternate realities. Different spaces, times or situations can be entered thanks to multimedia contents and systems, which coexist with our current reality, and are sometimes so vivid and engaging that we feel we are living in them. Advances in multimedia are making it possible to create immersive experiences that may involve the user in a different or augmented world, as an alternate reality.

Abstract:
Projecting stereoscopic content onto large general outdoor surfaces, say building facades, presents many challenges to be overcome, particularly when using red-cyan anaglyph stereo representation, so that as accurate as possible colour and depth perception can still be achieved. In this paper, we address the challenges relating to long-range projection mapping of stereoscopic content in outdoor areas and present a complete framework for the automatic adjustment of the content to compensate for any adverse projection surface behaviour. We formulate the problem of modeling the projection surface into one of simultaneous recovery of shape and appearance. Our system is composed of two standard fixed cameras, a long range fixed projector, and a roving video camera for multi-view capture. The overall computational framework comprises of four modules: calibration of a long-range vision system using the structure from motion technique, dense 3D reconstruction of projection surface from calibrated camera images, modeling the light behaviour of the projection surface using roving camera images and, iterative adjustment of the stereoscopic content. In addition to cleverly adapting some of the established computer vision techniques, the system design we present is distinct from previous work. The proposed framework has been tested in real-world applications with two non-trivial user experience studies and the results reported show considerable improvements in the quality of 3D depth and colour perceived by human participants.

Abstract:
Recent Deep Neural Networks (DNN) are able to successfully extract global art style from a painting and apply it to a separate content image. However, the foreground and background of a painting or image typically have distinctively different styles i.e. textures and color compositions. In this paper a method is proposed to consider the background and foreground styles separately when painting using artificial neural networks.

Abstract:
In this paper, we address the new problem of the prediction of human intentions. There is neuro-psychological evidence that actions performed by humans are anticipated by peculiar motor acts which are discriminant of the type of action going to be performed afterwards. In other words, an actual intention can be forecast by looking at the kinematics of the immediately preceding movement. To prove it in a computational and quantitative manner, we devise a new experimental setup where, without using contextual information, we predict human intentions all originating from the same motor act. We posit the problem as a classification task and we introduce a new multi-modal dataset consisting of a set of motion capture marker 3D data and 2D video sequences, where, by only analysing very similar movements in both training and test phases, we are able to predict the underlying intention, i.e., the future, never observed action. We also present an extensive experimental evaluation as a baseline, customizing state-of-the-art techniques for either 3D and 2D data analysis. Realizing that video processing methods lead to inferior performance but show complementary information with respect to 3D data sequences, we developed a 2D+3D fusion analysis where we achieve better classification accuracies, attesting the superiority of the multimodal approach for the context-free prediction of human intentions.

Abstract:
Most of existing works in visual question answering (VQA) are dedicated to improving the performance of answer predictions, while leaving the explanation of answering unexploited. We argue that, exploiting the explanations of question answering not only makes VQA explainable, but also quantitatively improves the prediction performance. In this paper, we propose a novel network architecture, termed Neural Pivot Network (NPN), towards simultaneous VQA and generating explanations in a multi-task learning architecture. NPN is trained by using both image-caption and image-question-answer pairs. In principle, CNN-based deep visual features are extracted and sent to both the VQA channel and the captioning module, the latter of which serves as a pivot to bridge the source image module to the target QA predictor. Such an innovative design enables us to introduce large-scale image-captioning training sets, e.g., MS-COCO Caption and Visual Genome Caption, together with cutting-edge image captioning models to benefit VQA learning. Quantitatively, the proposed NPN performs significantly better than alternatives and state-of-the-art schemes trained on VQA datasets only. Besides, by investigating the by-product of experiments, in-depth digests can be provided along with the answers.

Abstract:
Content-based visual landmark search (CBVLS) enjoys great importance in many practical applications. In this paper, we propose a novel discrete hashing with pair-exemplar (DHPE) to support scalable and efficient large-scale CBVLS. Our approach mainly solves two essential problems in scalable landmark hashing: 1) Intra-landmark visual diversity, and 2) Discrete optimization of hashing codes. Motivated by the characteristic of landmark, we explore the consistent preferences of tourists on landmark as pair-exemplars for scalable discrete hashing learning. In this paper, a pair-exemplar is comprised of a canonical view and the corresponding representative tags. Canonical view captures the key visual component of landmarks, and representative tags potentially involve landmark-specific semantics that can cope with the visual variations of intra-landmark. Based on pair-exemplars, a unified hashing learning framework is formulated to combine visual preserving with exemplar graph and the semantic guidance from representative tags. Further, to guarantee direct semantic transfer for hashing codes and remove information redundancy, we design a novel optimization method based on augmented Lagrange multiplier to explicitly deal with the discrete constraint, the bit-uncorrelated constraint and balance constraint. The whole learning process has linear computation complexity and enjoys desirable scalability. Experiments demonstrate the superior performance of DHPE compared with state-of-the-art methods.

Abstract:
We develop a novel visual model which can recognize protesters, describe their activities by visual attributes and estimate the level of perceived violence in an image. Studies of social media and protests use natural language processing to track how individuals use hashtags and links, often with a focus on those items' diffusion. These approaches, however, may not be effective in fully characterizing actual real-world protests (e.g., violent or peaceful) or estimating the demographics of participants (e.g., age, gender, and race) and their emotions. Our system characterizes protests along these dimensions. We have collected geotagged tweets and their images from 2013-2017 and analyzed multiple major protest events in that period. A multi-task convolutional neural network is employed in order to automatically classify the presence of protesters in an image and predict its visual attributes, perceived violence and exhibited emotions. We also release the UCLA Protest Image Dataset, our novel dataset of 40,764 images (11,659 protest images and hard negatives) with various annotations of visual attributes and sentiments. Using this dataset, we train our model and demonstrate its effectiveness. We also present experimental results from various analysis on geotagged image data in several prevalent protest events. Our dataset will be made accessible at https://www.sscnet.ucla.edu/comm/jjoo/mm-protest/.

Abstract:
Microblogs have become popular media for news propagation in recent years. Meanwhile, numerous rumors and fake news also bloom and spread wildly on the open social media platforms. Without verification, they could seriously jeopardize the credibility of microblogs. We observe that an increasing number of users are using images and videos to post news in addition to texts. Tweets or microblogs are commonly composed of text, image and social context. In this paper, we propose a novel Recurrent Neural Network with an attention mechanism (att-RNN) to fuse multimodal features for effective rumor detection. In this end-to-end network, image features are incorporated into the joint features of text and social context, which are obtained with an LSTM (Long-Short Term Memory) network, to produce a reliable fused classification. The neural attention from the outputs of the LSTM is utilized when fusing with the visual features. Extensive experiments are conducted on two multimedia rumor datasets collected from Weibo and Twitter. The results demonstrate the effectiveness of the proposed end-to-end att-RNN in detecting rumors with multimodal contents.

Abstract:
It is very challenging to evaluate the creative work of artificial intelligence, such as algorithmic composition. Due to the nature of creativity, most existing criteria of music analysis, for example, similarity of the data, cannot be used directly to measure the quality of a new piece of music composed by computer. Subjective evaluation based on questionnaire lacks quantitative evaluation with solid evidence. To address these difficulties, this paper proposes a novel computational model combined with a novel psychological paradigm. Utilizing brain imaging techniques, the proposed evaluation method can provide reliable musicality score for machine-composed music.

Abstract:
Recently accumulated massive amounts of geo-tagged photos provide an excellent opportunity to understand human behaviors and can be used for personalized tour recommendation. However, no existing work has considered the visual content information in these photos for tour recommendation. We believe the visual features of photos provide valuable information on measuring user / Point-of-Interest (POI) similarities, which is challenging due to data sparsity. To this end, in this paper, we propose a visual feature enhanced tour recommender system, named 'Photo2Trip', to utilize the visual contents and collaborative filtering models for recommendation. Specifically, we first extract various visual features from photos taken by tourists. Then, we propose a Visual-enhanced Probabilistic Matrix Factorization model (VPMF), which integrates visual features into the collaborative filtering model, to learn user interests by leveraging the historical travel records. Moreover, user interests together with trip constraints are formalized to an optimization problem for trip planning. Finally, the experimental results on real-world data show that our proposed visual-enhanced personalized tour recommendation method outperforms other benchmark methods in terms of recommendation accuracy. The results also show that visual features are effective on alleviating the data sparsity and cold start problems on personalized tour recommendation.

Abstract:
Although deep convolutional neural networks (CNNs) have significantly boosted the performance of many computer vision tasks, their complexities~(the size or the number of parameters) are also dramatically increased even with slight performance improvement. However, the larger network leads to more computation requirements, which are unfavorable to resource-constrained scenarios, such as the widely used embedded systems. In this paper, we tentatively explore the essential effect of CNN parameter layout, ıe, the allocation of parameters in the convolution layers, on the discriminative capability of CNN. Instead of enlarging the breadth or depth of networks, we attempt to improve the discriminative ability of CNN by changing its parameter layout under strict size constraint. Toward this end, a novel energy function is proposed to represent the CNN parameter layout, which makes it possible to model the relationship between the allocation of parameters in the convolution layers and the discriminative ability of CNN. According to extensive experimental results with plain CNN models and Residual Nets, we find that the higher the energy of a specific CNN parameter layout is, the better its discriminative ability is. Following this finding, we propose a novel approach to learn the better parameter layout. Experimental results on two public image classification datasets show that the CNN models with the learned parameter layouts achieve the better image classification results under strict size constraint.

Abstract:
Multimedia data is by nature heterogeneous, conveying semantic information through multiple cues. Text analysis of closed captions already brought us understanding of the spoken information. Today's advances in computer vision now enable us to look for relevant semantic information from the visual content of real-world archives. Combining these two levels of extracted information to make sense of an archive still remains a challenge. Multiplex net- works, which model multiple families of interactions in a graph, can capture and combine both sources of semantics. We can leverage on these objects to extract hierarchies and integrate them in an interactive heterogeneous "visual cloud". Inspired by word clouds, these clouds allow to grasp visual and textual semantic information captured from a multimedia collection all at once. The interaction then enables direct access to the relevant video. We demonstrate our system with the exploration of a Japanese news archive.

Abstract:
Telehealth is a healthcare service that relies on exchanging information from one place to another to improve a patient's health status. In this demonstration, we aim to provide similar benefits to instantly bringing a doctor in the field to provide the right treatment at the right time for time-sensitive injuries. We present a telehealth system called Teleconsultant that enables near real-time communication between paramedics and doctors via videos captured from wearable cameras, this is crucial in the acute situations when the paramedic needs immediate assistance from the remote doctor that could help saving patients' lives. Teleconsultant includes capturing the video through body cameras worn by the paramedics, we refer to this video as wearable video. The video is transmitted over a heterogeneous wireless network to the remote doctor. Along the network path, video is analyzed in real-time to: (1) enhance video quality (e.g., video stabilization), and (2) detect time-sensitive injuries (e.g., stroke) so that remote doctors can be alerted and prepared when patient arrives via ambulance to the hospital. We demonstrate an end-to-end system to enable streaming of wearable video from the incident site to the hospital using body cameras worn by paramedics. Additionally, we demonstrate a framework for in-stream processing of the wearable video and we show two real-time video processing functions: stroke detection, and video stabilization.

Abstract:
In this demo, we present a real-time surveillance video parsing (RSVP) system to parse surveillance videos. Surveillance video parsing, which aims to segment the video frames into several labels, e.g., face, pants, left-legs, has wide applications, especially in security filed. However, it is very tedious and time-consuming to annotate all the frames in a video. We design a RSVP system to parse the surveillance videos in real-time. The RSVP system requires only one labeled frame in training stage. The RSVP system jointly considers the segmentation of preceding frames when parsing one particular frame within the video. The RSVP system is proved to be effective and efficient in real applications.

Abstract:
In this demo, we present T2U, a cross-platform video recommendation system with a novel interface. First, we propose a cross-platform content association method using deep neural networks. Based on the proposed deep cross-platform association model, we then propose a cross-platform video recommendation system. Different from existing video websites, T2U bridges user information across social and video platforms and is able to effectively solve data sparsity and cold start problem. In addition, our system also design a novel user-friendly interface. It is designed to show not only the recommended videos but also the reasons to users intuitively, thus significantly enhances user experience.

Abstract:
Online reviews are prevalent. When recounting their experience with a product, service, or venue, in addition to textual narration, a reviewer frequently includes images as photographic record. While textual sentiment analysis has been widely studied, in this paper we are interested in visual sentiment analysis to infer whether a given image included as part of a review expresses the overall positive or negative sentiment of that review. Visual sentiment analysis can be formulated as image classification using deep learning methods such as Convolutional Neural Networks or CNN. However, we observe that the sentiment captured within an image may be affected by three factors: image factor, user factor, and item factor. Essentially, only the first factor had been taken into account by previous works on visual sentiment analysis. We develop item-oriented and user-oriented CNN that we hypothesize would better capture the interaction of image features with specific expressions of users or items. Experiments on images from restaurant reviews show these to be more effective at classifying the sentiments of review images.

Abstract:
The data generated on social media sites continues to grow at an increasing rate with more than 36% of tweets containing images making the dominance of multimedia content evidently visible. This massive user generated content has become a reflection of world events. In order to enhance the ability and effectiveness to consume this plethora of data, summarization of these events is needed. However, very few studies have exploited the images attached with social media events to summarize them using "mid-level visual elements". These are the entities which are both representative and discriminative to the target dataset besides being human-readable and hence more informative.

Abstract:
Topological data analysis (TDA) is a branch of mathematics that analyzes the shape of high-dimensional data sets using geometry and algebra. TDA is used for data visualization which represents the relationship among elements using a network. Traditionally, TDA is quadratic in complexity and not commonly used for natural language processing. In this research, we visualize the relationship among words in a text block, words in a corpus and text blocks in a corpus. Text block represents a unit of a corpus such as, a web page in a web corpus, a chapter or section in a book corpus or a document in media corpus. This research proposes circular topology for representing words both for Local Context (LC) and Global Context (GC). Each text block is a set of sentences forming the LC. We found that feature words are extracted successfully from our LC analysis. The occurrence of extracted featured words in the corpus formed the GC. We evaluate this proposed simplified topological analysis on 3 different corpora: a single book corpus, a book corpus consisting of 7 books having 6020 narrations and a web corpus consisting of 990 web pages. The peripheral nature of the LC reduced the vocabulary size of the corpus significantly in O(nm) time where n is the number of text blocks and m is number of nouns in a sentence. GC analysis of featured words reflected useful properties of featured word movement which can be used to analyze topic evolution. GC analysis of text block points is aimed to find closely related text blocks in a radius. This reflected interesting results that need further supervised investigation. Research on topology driven natural language processing is in its infancy. This article contributes to this research field by introducing a method motivated by TDA to represent and visualize the peripheral nature of text block and corpus, by achieving success in dimensional reduction using local analysis and by simplifying the approach of complex topological analysis through localization.

Abstract:
Real-world multimedia data is often composed of multiple modalities such as an image or a video with associated text (e.g., captions, user comments, etc.) and metadata. Such multimodal data packages are prone to manipulations, where a subset of these modalities can be altered to misrepresent or repurpose data packages, with possible malicious intent. It is therefore important to develop methods to assess or verify the integrity of these multimedia packages. Using computer vision and natural language processing methods to directly compare the image (or video) and the associated caption to verify the integrity of a media package is only possible for a limited set of objects and scenes. In this paper we present a novel deep-learning-based approach that uses a reference set of multimedia packages to assess the semantic integrity of multimedia packages containing images and captions. We construct a joint embedding of images and captions with deep multimodal representation learning on the reference dataset in a framework that also provides image-caption consistency scores (ICCSs). The integrity of query media packages is assessed as the inlierness of the query ICCSs with respect to the reference dataset. We present the MultimodAl Information Manipulation dataset (MAIM), a new dataset of media packages from Flickr, which we are making available to the research community. We use both the newly created dataset as well as Flickr30K and MS COCO datasets to quantitatively evaluate our proposed approach. The reference dataset does not contain unmanipulated versions of tampered query packages. Our method is able to achieve F-1 scores of 0.75, 0.89 and 0.94 on MAIM, Flickr30K and MS COCO, respectively, for detecting semantically incoherent media packages.

Abstract:
Recently image question answering (ImageQA) has gained lots of attention in the research community. However, as its natural extension, video question answering (VideoQA) is less explored. Although both tasks look similar, VideoQA is more challenging mainly because of the complexity and diversity of videos. As such, simply extending the ImageQA methods to videos is insufficient and suboptimal. Particularly, working with the video needs to model its inherent temporal structure and analyze the diverse information it contains. In this paper, we consider exploiting the appearance and motion information resided in the video with a novel attention mechanism. More specifically, we propose an end-to-end model which gradually refines its attention over the appearance and motion features of the video using the question as guidance. The question is processed word by word until the model generates the final optimized attention. The weighted representation of the video, as well as other contextual information, are used to generate the answer. Extensive experiments show the advantages of our model compared to other baseline models. We also demonstrate the effectiveness of our model by analyzing the refined attention weights during the question answering procedure.

Abstract:
Convolutional Neural Networks (CNNs) have revolutionized the research in computer vision, due to their ability to capture complex patterns, resulting in high inference accuracies. However, the increasingly complex nature of these neural networks means that they are particularly suited for server computers with powerful GPUs. We envision that deep learning applications will be eventually and widely deployed on mobile devices, e.g., smartphones, self-driving cars, and drones. Therefore, in this paper, we aim to understand the resource requirements (time, memory) of CNNs on mobile devices. First, by deploying several popular CNNs on mobile CPUs and GPUs, we measure and analyze the performance and resource usage for every layer of the CNNs. Our findings point out the potential ways of optimizing the performance on mobile devices. Second, we model the resource requirements of the different CNN computations. Finally, based on the measurement, profiling, and modeling, we build and evaluate our modeling tool, Augur, which takes a CNN configuration (descriptor) as the input and estimates the compute time and resource usage of the CNN, to give insights about whether and how efficiently a CNN can be run on a given mobile platform. In doing so Augur tackles several challenges: (i) how to overcome profiling and measurement overhead; (ii) how to capture the variance in different mobile platforms with different processors, memory, and cache sizes; and (iii) how to account for the variance in the number, type and size of layers of the different CNN configurations.

Abstract:
Ever since the emergence of digitization, we've used the term multimedia to represent a combination of different kinds of media types, such as images, audio, and videos. As new sensing technologies emerge and are now becoming omnipresent in daily lives, the definition, role and significance of multimedia is changing. Multimedia now represents the means for communicating, cooperating, and also for monitoring numerous aspects of daily life, at various levels of granularity and application, ranging from personal to societal. With this shift, we have since moved from comprehending single media and its state toward comprehending media in terms of its use context.

Abstract:
The seventh Audio-Visual Emotion Challenge and workshop AVEC 2017 was held in conjunction with ACM Multimedia'17. This year, the AVEC series addresses two distinct sub-challenges: emotion recognition and depression detection. The Affect Sub-Challenge is based on a novel dataset of human-human interactions recorded 'in-the-wild', whereas the Depression Sub-Challenge is based on the same dataset as the one used in AVEC 2016, with human-agent interactions. In this summary, we mainly describe participation and its conditions.

Abstract:
Cloud gaming is promising to provide high-quality game services by outsourcing game execution to cloud so that users can access games via thin clients (e.g., smartphones or tablets). However, existing cloud gaming systems su er from low GPU utilization in the virtualized environment. Moreover, GPU resources are scheduled in units of virtual machines (VMs) and this kind of coarse-grained scheduling at the VM-level fails to fully exploit GPU processing capacity. In this paper, we present ShareRender, a cloud gaming sys- tem that o oads graphics workloads within VMs directly to GPUs, bypassing GPU virtualization. For each game running in a VM, ShareRender starts a graphics wrapper to intercept frame rendering requests and assign them to render agents responsible for frame rendering on GPUs. Thanks to the exible workload assignment among multiple render agents, ShareRender enables ne-grained resource sharing at the frame-level to signi cantly improve GPU utilization. Further more, we design an online algorithm to determine workload assignment and migration of render agents, which considers the tradeo between minimizing the number of active server and low agent migration cost. We conduct experiments on real deployment and trace-driven simulations to evaluate the performance of ShareRender under di erent system settings. The results show that ShareRender outperforms the existing video-streaming-based cloud gaming system by over 4 times.

Abstract:
Recently, there has been a significant interest towards 360-degree panorama video. However, such videos usually require extremely high bitrate which hinders their widely spread over the Internet. Tile-based viewport adaptive streaming is a promising way to deliver 360-degree video due to its on-request portion downloading. But it is not trivial for it to achieve good Quality of Experience (QoE) because Internet request-reply delay is usually much higher than motion-to-photon latency. In this paper, we leverage a probabilistic approach to pre-fetch tiles countering viewport prediction error, and design a QoE-driven viewport adaptation system, 360ProbDASH. It treats user's head movement as probability events, and constructs a probabilistic model to depict the distribution of viewport prediction error. A QoE-driven optimization framework is proposed to minimize total expected distortion of pre-fetched tiles. Besides, to smooth border effects of mixed-rate tiles, the spatial quality variance is also minimized. With the requirement of short-term viewport prediction under a small buffer, it applies a target-buffer-based rate adaptation algorithm to ensure continuous playback. We implement 360ProbDASH prototype and carry out extensive experiments on a simulation test-bed and real-world Internet with real user's head movement traces. The experimental results demonstrate that 360ProbDASH achieves at almost 39% gains on viewport PSNR, and 46% reduction on spatial quality variance against the existed viewport adaptation methods.

Abstract:
To guarantee a satisfying Quality of Experience (QoE) for consumers, it is required to measure image quality efficiently and reliably. The neglect of the high-level semantic information may result in predicting a clear blue sky as bad quality, which is inconsistent with human perception. Therefore, in this paper, we tackle this problem by exploiting the high-level semantics and propose a novel no-reference image quality assessment method for realistic blur images. Firstly, the whole image is divided into multiple overlapping patches. Secondly, each patch is represented by the high-level feature extracted from the pre-trained deep convolutional neural network model. Thirdly, three different kinds of statistical structures are adopted to aggregate the information from different patches, which mainly contain some common statistics i.e., the mean & standard deviation, quantiles and moments). Finally, the aggregated features are fed into a linear regression model to predict the image quality. Experiments show that, compared with low-level features, high-level features indeed play a more critical role in resolving the aforementioned challenging problem for quality estimation. Besides, the proposed method significantly outperforms the state-of-the-art methods on two realistic blur image databases and achieves comparable performance on two synthetic blur image databases.

Abstract:
Convolutional Neural Networks (CNNs) have been widely used and achieve amazing performance, typically at the cost of very expensive computation. Some methods accelerate the CNN training by distributed GPUs those deploying GPUs on multiple servers. Unfortunately, they need to transmit a large amount of data among servers, which leads to long data transmitting time and long GPU idle time. Towards this end, we propose a novel hybrid parallelism architecture named "Wheel" to accelerate the CNN training by reducing the transmitted data and fully using GPUs simultaneously. Specifically, Wheel first partitions the layers of a CNN into two kinds of modules: convolutional module and fully-connected module, and deploys them following the proposed hybrid parallelism. In this way, Wheel transmits only a few parameters of CNNs among different servers, and transmits most of the parameters within the same server. The time to transmit data is significantly reduced. Second, to fully run each GPU and reduce the idle time, Wheel devises an alternate strategy deploying multiple workers on each GPU. Once one worker is suspended for receiving data, another one in the same GPU starts to execute the computing task. The workers in each GPU run concurrently and repeatedly like Wheels. Experiments are conducted to show the outperformance of the proposed scheme over the state-of-the-art parallel approaches.

Abstract:
Human beings have developed a diverse food culture. Many factors like ingredients, visual appearance, courses (e.g., breakfast and lunch), flavor and geographical regions affect our food perception and choice. In this work, we focus on multi-dimensional food analysis based on these food factors to benefit various applications like summary and recommendation. For that solution, we propose a delicious recipe analysis framework to incorporate various types of continuous and discrete attribute features and multi-modal information from recipes. First, we develop a Multi-Attribute Theme Modeling (MATM) method, which can incorporate arbitrary types of attribute features to jointly model them and the textual content. We then utilize a multi-modal embedding method to build the correlation between the learned textual theme features from MATM and visual features from the deep learning network. By learning attribute-theme relations and multi-modal correlation, we are able to fulfill different applications, including (1) flavor analysis and comparison for better understanding the flavor patterns from different dimensions, such as the region and course, (2) region-oriented multi-dimensional food summary with both multi-modal and multi-attribute information and (3) multi-attribute oriented recipe recommendation. Furthermore, our proposed framework is flexible and enables easy incorporation of arbitrary types of attributes and modalities. Qualitative and quantitative evaluation results have validated the effectiveness of the proposed method and framework on the collected Yummly dataset.

Abstract:
Spatial and temporal patterns inherent in facial behavior carry crucial information for posed and spontaneous expressions distinction, but have not been thoroughly exploited yet. To address this issue, we propose a novel dynamic model, termed as interval temporal restricted Boltzmann machine (IT-RBM), to jointly capture global spatial patterns and complex temporal patterns embedded in posed and spontaneous expressions respectively for distinguishing between posed and spontaneous expressions. Specifically, we consider a facial expression as a complex activity that consists of temporally overlapping or sequential primitive facial events, which are defined as the motion of feature points. We propose using the Allen's Interval Algebra to represent the complex temporal patterns existing in facial events through a two-layer Bayesian network. Furthermore, we propose employing multi-value restricted Boltzmann machine to capture intrinsic global spatial patterns among facial events. Experimental results on three benchmark databases, the UvA-NEMO smile database, the DISFA+ database, and theSPOS database, demonstrate the proposed interval temporal restricted Boltzmann machine can successfully capture the intrinsic spatial-temporal patterns in facial behavior, and thus outperform state-of-the art work of posed and spontaneous expressions distinction.

Abstract:
The outputs of the higher layers of deep pre-trained convolutional neural networks (CNNs) have consistently been shown to provide a rich representation of an image for use in recognition tasks. This study explores the suitability of such an approach for speech-based emotion recognition tasks. First, we detail a new acoustic feature representation, denoted as deep spectrum features, derived from feeding spectrograms through a very deep image classification CNN and forming a feature vector from the activations of the last fully connected layer. We then compare the performance of our novel features with standardised brute-force and bag-of-audio-words (BoAW) acoustic feature representations for 2- and 5-class speech-based emotion recognition in clean, noisy and denoised conditions. The presented results show that image-based approaches are a promising avenue of research for speech-based recognition tasks. Key results indicate that deep-spectrum features are comparable in performance with the other tested acoustic feature representations in matched for noise type train-test conditions; however, the BoAW paradigm is better suited to cross-noise-type train-test conditions.

Abstract:
An automated process that can suggest a soundtrack to a user-generated video (UGV) and make the UGV a music-compliant professional-like video is challenging but desirable. To this end, this paper presents an automatic music video (MV) generation system that conducts soundtrack recommendation and video editing simultaneously. Given a long UGV, it is first divided into a sequence of fixed-length short (e.g., 2 seconds) segments, and then a multi-task deep neural network (MDNN) is applied to predict the pseudo acoustic (music) features (or called the pseudo song) from the visual (video) features of each video segment. In this way, the distance between any pair of video and music segments of same length can be computed in the music feature space. Second, the sequence of pseudo acoustic (music) features of the UGV and the sequence of the acoustic (music) features of each music track in the music collection are temporarily aligned by the dynamic time warping (DTW) algorithm with a pseudo-song-based deep similarity matching (PDSM) metric. Third, for each music track, the video editing module selects and concatenates the segments of the UGV based on the target and concatenation costs given by a pseudo-song-based deep concatenation cost (PDCC) metric according to the DTW-aligned result to generate a music-compliant professional-like video. Finally, all the generated MVs are ranked, and the best MV is recommended to the user. The MDNN for pseudo song prediction and the PDSM and PDCC metrics are trained by an annotated official music video (OMV) corpus. The results of objective and subjective experiments demonstrate that the proposed system performs well and can generate appealing MVs with better viewing and listening experiences.

Abstract:
360° videos are a new kind of medium that gives the viewers a sense of real immersion as they glimpse the action from all angles and directions. Naturally, professional and amateur film-makers are actively adopting this new medium for transformative storytelling. Despite this phenomenal progress in 360° video creation, current understanding on users' viewing experience of these videos is limited. In this paper, we present the first comparative study on the user experience with 360° videos on mobile devices using different interaction techniques. We observed 18 participants' interaction with six 360°videos with different viewport characteristics (static or moving) on a smartphone, a tablet and a head mounted display (HMD) respectively and measured how they interact with the content. We then conducted semi-structured interviews with the participants in which they explained their interaction with and viewing experience of 360° videos across three devices. Our findings show that 360° videos with moving viewports elicit higher engagement from the viewers, and offer superior viewing experience. However, these videos are cognitively demanding and require constant user attention. Our participants preferred the condition with dynamic peephole interaction on a smartphone for watching 360° videos due to the simplicity in exploration and familiarity with navigation controls. Many participants reported that the HMD offers the most immersive experience however it comes at the expense of higher cognitive burden, motion sickness and physical discomfort.

Abstract:
Similar to the concept of a cocktail or mocktail, we present Vocktail (a.k.a. Virtual Cocktail) - an interactive drinking utensil that digitally simulates multisensory flavor experiences. The Vocktail system utilizes three common sensory modalities, taste, smell, and visual (color), to create virtual flavors and augment the existing flavors of a beverage. The system is coupled with a mobile application that enables users to create customized virtual flavor sensations by configuring each of the stimuli via Bluetooth. The system consists of a cocktail glass that is seamlessly fused into a 3D printed structure, which holds the electronic control module, three scent cartridges, and three micro air-pumps. When a user drinks from the system, the visual (RGB light projected on the beverage), taste (electrical stimulation at the tip of the tongue), and smell stimuli (emitted by micro air-pumps) are combined to create a virtual flavor sensation, thus altering the flavor of the beverage. In summary, this paper discusses 1) technical details of the Vocktail system and 2) user experiments that investigate the influences of these multimodal stimuli on the perception of virtual flavors in terms of five primary tastes (i.e. salty, sweet, bitter, sour, and umami). Our results suggest that the combination of these stimuli delivers richer flavor experiences, as compared to separately simulating individual modalities, and indicates that the types of pairings that can be formed between smell and electric taste stimuli.

Abstract:
Second screen applications are becoming key for broadcasters exploiting the convergence of TV and Internet. Authoring such applications however remains costly. In this paper, we present a second screen authoring application that leverages multimedia content analytics and social media monitoring. A back-office is dedicated to easy and fast content ingestion, segmentation, description and enrichment with links to entities and related content. From the back-end, broadcasters can push enriched content to front-end applications providing customers with highlights, entity and content links, overviews of social network, etc. The demonstration operates on political debates ingested during the 2017 French presidential election, enabling insights on the debates.

Abstract:
Deep neural networks (DNN) have performed impressively in the processing of multimedia signals. Most DNN-based approaches were developed to handle real-valued data; very few have been designed for complex-valued data, despite their being essential for processing various types of multimedia signal. Accordingly, this work presents a complex-valued deep recurrent neural network (C-DRNN) for singing voice separation. The C-DRNN operates on the complex-valued short-time discrete Fourier transform (STFT) domain. A key aspect of the C-DRNN is that the activations and weights are complex-valued. The goal herein is to reconstruct the singing voice and the background music from a mixed signal. For error back-propagation, CR-calculus is utilized to calculate the complex-valued gradients of the objective function. To reinforce model regularity, two constraints are incorporated into the objective function of the C-DRNN. The first is an additional masking layer that ensures the sum of separated sources equals the input mixture. The second is a discriminative term that preserves the mutual difference between two separated sources. Finally, the proposed method is evaluated using the MIR-1K dataset and a singing voice separation task. Experimental results demonstrate that the proposed method outperforms the state-of-the-art DNN-based methods.

Abstract:
In the emerging crowd sourced live cast services, numerous amateur broadcasters live stream their video contents to worldwide viewers and constantly interact with them through chat messages. Live video contents are transcoded into multiple quality versions to better service viewers with different network and device configurations. Cloud computing becomes a natural choice to handle these computational intensive tasks due to its elasticity and the "pay-as-you-go" billing model. However, given the significantly large number of concurrent channel numbers and the diverse viewer geo-distributions in this new crowd sourced live cast service, even the cloud becomes significantly expensive to cover the whole community and inadequate in fulfilling the latency requirement. In this paper, after observing the abundant computational resources residing in end viewers, we propose a Cloud-Crowd collaborative system, C2, which combines end viewers with cloud to perform video transcoding in a cost-efficient way. To quantify the heterogeneity and uncertainty of viewers and pass the asymmetric information barrier, we incorporate statistical descriptions into our bidding language and design truthful auctions to recruit stable viewers with appropriate incentives. We further tailor redundancy strategies for workloads with different Quality of Service requirements to improve the stability of our system. Desirable economic properties, like social efficiency, ex-post incentive compatibility, individual rationality, are proved to be guaranteed in our studied scenarios. Using traces captured from the popular Twitch platform, we show that C2 achieves up to 93% more cost saving than a pure cloud-based solution, and significantly outperforms other baseline approaches in both social welfare and system stability.

Abstract:
We developed a system for visualizing stone trajectories in curling games for live broadcasts. Robustly tracking a moving stone from curling video sequences is difficult because the stone is frequently hidden by the brushes held by the players and the players' bodies during their sweeping actions. Although a number of methods for visual object tracking have been proposed, real-time tracking under heavy occlusion is still a challenging task. We thus propose an online machine learning method for tracking a curling stone to deal with changes in its appearance. The method creates a candidate-object image, which eliminates background noises, and is used as input to the kernelized correlation filter (KCF) tracker. Coordinate transformation is also applied to the system to improve its operability. Experimental results showed that our stone tracker is more accurate and faster than other conventional tracking methods. The developed system was used at All Japan Curling Championships 2017 to display stone trajectories during live broadcasts.

Abstract:
With recent developments and advances in distance learning and MOOCs, the amount of open educational videos on the Internet has grown dramatically in the past decade. However, most of these videos are lengthy and lack of high-quality indexing and annotations, which triggers an urgent demand for efficient and effective tools that facilitate video content navigation and exploration. In this paper, we propose a novel visual navigation system for exploring open educational videos. The system tightly integrates multimodal cues obtained from the visual, audio and textual channels of the video and presents them with a series of interactive visualization components. With the help of this system, users can explore the video content using multiple levels of details to identify content of interest with ease. Extensive experiments and comparisons against previous studies demonstrate the effectiveness of the proposed system.

Abstract:
Messages like "If You Drink Don't Drive", "Each water drop count" or "Smoking causes cancer" are often paired with visual content in order to persuade an audience to perform specific actions, such as clicking a link, retweeting a post or purchasing a product. Despite its usefulness, the current way of discovering actionable images is entirely manual and typically requires marketing experts to filter over thousands of candidate images. To help understand the audience, marketers and social scientists have been investigating for years the role of personality in personalized services by leveraging AI technologies and social network data. In this work, we analyze how personality affects user actions on images in a social network website, and which visual stimuli contained in image content influence actions from users with certain Big Five traits. In order to achieve this goal, we ground this research on psychological studies which investigate the interplay between personality and emotions. Given a public Twitter dataset containing 1.6 million user-image timeline retweet actions, we carried out two extensive statistical analysis, which show significant correlation between personality traits and affective visual concepts in image content. We then proposed a novel model that combines user personality traits and image visual concepts for the task of predicting user actions in advance. This work is the first attempt to integrate personality traits and multimedia features, and moves an important step towards building personalized systems for automatically discovering actionable multimedia content.

Abstract:
Popularity prediction on social media has attracted extensive attention nowadays due to its widespread applications, such as online marketing and economical trends. In this paper, we describe a solution of our team CASIA-NLPR-MMC for Social Media Prediction (SMP) challenge. This challenge is designed to predict the popularity of social media posts. We present a stacking framework by combining a diverse set of models to predict the popularity of images on Flickr using user-centered, image content and image context features. Several individual models are employed for scoring popularity of an image at earlier stage, and then a stacking model of Support Vector Regression (SVR) is utilized to train a meta model of different individual models trained beforehand. The Spearman's Rho of this Stacking model is 0.88 and the mean absolute error is about 0.75 on our test set. On the official final-released test set, the Spearman's Rho is 0.7927 and mean absolute error is about 1.1783. The results on provided dataset demonstrate the effectiveness of our proposed approach for image popularity prediction.

Abstract:
The ability to semantically interpret hand-drawn line sketches, although very challenging, can pave way for novel applications in multimedia. We propose SKETCHPARSE, the first deep-network architecture for fully automatic parsing of freehand object sketches. SKETCHPARSE is configured as a two-level fully convolutional network. The first level contains shared layers common to all object categories. The second level contains a number of expert sub-networks. Each expert specializes in parsing sketches from object categories which contain structurally similar parts. Effectively, the two-level configuration enables our architecture to scale up efficiently as additional categories are added. We introduce a router layer which (i) relays sketch features from shared layers to the correct expert (ii) eliminates the need to manually specify object category during inference. To bypass laborious part-level annotation, we sketchify photos from semantic object-part image datasets and use them for training. Our architecture also incorporates object pose prediction as a novel auxiliary task which boosts overall performance while providing supplementary information regarding the sketch. We demonstrate SKETCHPARSE's abilities (i) on two challenging large-scale sketch datasets (ii) in parsing unseen, semantically related object categories (iii) in improving fine-grained sketch-based image retrieval. As a novel application, we also outline how SKETCHPARSE's output can be used to generate caption-style descriptions for hand-drawn sketches.

Abstract:
This paper presents FaceCollage, a robust and real-time system for head reconstruction that can be used to create easy-to-deploy telepresence systems, using a pair of consumer-grade RGBD cameras that provide a wide range of views of the reconstructed user. A key feature is that the system is very simple to rapidly deploy, with autonomous calibration and requiring minimal intervention from the user, other than casually placing the cameras. This system is realized through three technical contributions: (1) a fully automatic calibration method, which analyzes and correlates the left and right RGBD faces just by the face features; (2) an implementation that exploits the parallel computation capability of GPU throughout most of the system pipeline, in order to attain real-time performance; and (3) a complete integrated system on which we conducted various experiments to demonstrate its capability, robustness, and performance, including testing the system on twelve participants with visually-pleasing results.

Abstract:
This project, with its multiple phases, transforms the 2016 United States Presidential Election Twitter data into a large-scale installation to probe the question of how artificial intelligence via the ways of social media assumes form and transforms the shaping of the future of a nation. By mapping election data into flickering lights, clicking sounds, and the exchange of fluid between IV bags, the installation recounts Twitter election-related activities from February 2016 through the election date of November 8, 2016. By identifying major Twitter influencers in this period, uncovering propagation patterns in the AI-enabled Twitter landscape, and differentiating human tweets from robotic (Twitter bot) tweets, the installation exposes the inner mechanisms of a world where true human activity and artificially intelligent automation mutually influence each other and propagate inseparably as a combined force. The installation allows the examination of machine world infiltration that shifted the generative entropic propagation of campaign messaging on social media, and provides a physical space for contemplating the significant challenges social media pose in our understanding of the social fabric and the radical transformation of the ways in which we now relate to each other.

Abstract:
The 2017 winner of the prestigious ACM Special Interest Group on Multimedia (SIGMM) award for Outstanding Technical Contributions to Multimedia Computing, Communications and Applications is Prof. Dr. Arnold Smeulders. The award is given in recognition of his outstanding and pioneering contributions to defining and bridging the semantic gap in content-based image retrieval. During the early years of his scientific career Dr. Smeulders studied the invariant fundamentals of lines, shapes, textures and colors. It resulted in several PAMI papers (IEEE Transactions on Pattern Analysis and Machine Intelligence) that are still being cited today. Besides the written record, Arnold always had the drive to showcase academic results in real-world systems. In 1989 he introduced the Diagnostic Encyclopedia Workstation: a system containing 3,000 images from pathology combined with what we would now call a handcrafted ontology. Already then he showed his great ability to generalize results as in 1991 he launched one of the world's first image search engines that combined automatic indexing, interactive retrieval, and evaluation. During this period he was also instrumental in building our community, organizing the first conferences, and defining the semantic gap as the fundamental problem of image retrieval. The end of this era culminated in what became the most cited paper of our discipline: Content-based image retrieval at the end of the early years. ....

Abstract:
ACM Special Interest Group on Multimedia (SIGMM) is pleased to present the 2017 SIGMM Outstanding Ph.D. Thesis Award to Dr. Chien-Nan (Shannon) Chen. The award committee considers Dr. Chien-Nan Chen's dissertation entitled "Semantics-Aware Content Delivery Framework For 3D Tele-Immersion" worthy of the recognition as the thesis is the first to consider semantic information in full-aspect with 3DTI systems. Dr. Chen's thesis includes confound semantic factors regarding user (role, preferences, view), activity (range, speed, posture, position, number of participants, repetitiveness), and environment (network capability, computing capability, power limitation) into the design and development of our 3DTI systems. Less than a year after its publication, the methodology advanced by this thesis has been adopted by three real-life multimedia applications. One of them is Facebook's 360 video streaming, which adopted user-objective awareness into its dynamic streaming for 360 content. The second application also comes from Facebook's 360 media. In the latest F8 conference (Facebook's annual developer conference) of 2017, they announced that the new dynamic streaming standard will encompass content-semantics awareness into its new view prediction model, which leads to more effective pre-buffering of 360 video content. The third application is a theatrics social experiment involving hundreds of Shakespeare enthusiasts in The Miracle Theatre, UK, which referenced the 3DTI amphitheater developed in this thesis which adopts semantic awareness on prioritized content dissemination. ....

Abstract:
Real-time bidding (RTB) has become a new norm in display advertising where a publisher uses auction models to sell online user's page view to advertisers. In RTB, the ad with the highest bid price will be displayed to the user. This ad displaying process is biased towards the publisher. In fact, the benefits of the advertiser and the user have been rarely discussed. Towards the global optimization, we argue that all stakeholders' benefits should be considered. To this end, we propose a novel computation framework where multimedia techniques and auction theory are integrated. This doctoral research mainly focus on 1) figuring out the multimedia metrics that affect the effectiveness of online advertising; 2) integrating the discovered metrics into the RTB framework. We have presented some preliminary results and discussed the future directions.

Abstract:
Exploiting both LTE and Wi-Fi links simultaneously enhances the performance of video streaming services in a smartphone. However, it is challenging to achieve seamless and high quality video while saving battery energy and LTE data usage to prolong the usage time of a smartphone. In this paper, we propose REQUEST, a video chunk request policy for Dynamic Adaptive Streaming over HTTP (DASH) in a smartphone, which can utilize both LTE and Wi-Fi. REQUEST enables seamless DASH video streaming with near optimal video quality under given budgets of battery energy and LTE data usage. Through extensive simulation and measurement in a real environment, we demonstrate that REQUEST significantly outperforms other existing schemes in terms of average video bitrate, rebuffering, and resource waste.

Abstract:
The current body of research on Dynamic Adaptive Streaming over HTTP (DASH) contributes various adaptation algorithms aiming to optimize performance metrics such as the Quality of Experience. Intuitively, the heterogeneity of the streaming environment and the underlying technologies lead many of the developed approaches to possess clear performance affinities denoted here as sweet spots. We observe, however, that systematic comparisons of these algorithms are usually conducted within homogeneous player environments.

Abstract:
Detecting leadership while understanding the underlying behavior is an important research topic particularly for social and organizational psychology, and has started to get attention from social signal processing research community as well. It is known that, visual activity is a useful cue to investigate the social interactions, even though previously applied nonverbal features based on head/body actions were not performing well enough for identification of emergent leaders (ELs) in small group meetings. Starting from these premises, in this study, we propose an effective method that uses 2D body pose based nonverbal features to represent the visual activity of a person. Our results suggest that, i) overall, the proposed nonverbal features derived from body pose perform better than existing visual activity based features, ii) it is possible to improve classification results by applying unsupervised feature learning as a preprocessing step, and iii) the proposed nonverbal features are able to advance the EL identification performances of other types of nonverbal features when they are used together.

Abstract:
The ego-noise generated by the motors and propellers of a micro aerial vehicle (MAV) masks the environmental sounds and considerably degrades the quality of the on-board sound recording. Sound enhancement approaches generally require knowledge of the direction of arrival of the target sound sources, which are difficult to estimate due to the low signal-to-noise-ratio (SNR) caused by the ego-noise and the interferences between multiple sources. To address this problem, we propose a multi-modal analysis approach that jointly exploits audio and video to enhance the sounds of multiple targets captured from an MAV equipped with a microphone array and a video camera. We first address audio-visual calibration via camera resectioning, audio-visual temporal alignment and geometrical alignment to jointly use the features in the audio and video streams, which are independently generated. The spatial information from the video is used to assist sound enhancement by tracking multiple potential sound sources with a particle filter. Then we infer the directions of arrival of the target sources from the video tracking results and extract the sound from the desired direction with a time-frequency spatial filter, which suppresses the ego-noise by exploiting its time-frequency sparsity. Experimental demonstration results with real outdoor data verify the robustness of the proposed multi-modal approach for multiple speakers in extremely low-SNR scenarios.

Abstract:
In this paper, we address the challenging problem of vehicle license plate image super-resolution. Different from existing image super-resolution approaches only resorted to one single image, we propose to leverage complementary information from multiple images to recover the license plate numbers. To achieve this goal, we design a principled license plate images super-resolution framework which is composed of two components: progressive vehicle search and Domain Priori GAN (DP-GAN). Particularly, we design a null space based progressive vehicle search approach to retrieve the relevant images captured by different cameras given one vehicle with a low-resolution license plate. To handle the extremely varied license plate images caused by different sensors, times, depths, and viewpoints, we also propose a DP-GAN framework to generate multiple spatial correspondences and high-resolution plate images. In the generator network of DP-GAN, a license plate synthesis pipeline is exploited to generate the nearly canonical license plates. In the discriminator network, a spatial split layer is designed to simultaneously preserve the global and local manufacture standards of the license plate. Finally, a multiple images super-resolution GAN is exploited to combine all the synthetic license plates into one high-resolution image. Different from previous super-resolution criteria mainly focus on pixel-level detail recovery condition, we leverage the downstream tasks, i.e. license plate recognition and vehicle search as criteria. The results on a new collected real-world dataset demonstrate that the proposed method achieves the beyond human-level license plate super-resolution performance for automatic license plate recognition and vehicle search.

Abstract:
The sheer amount of human-centric multimedia content has led to increased research on human behavior understanding. Most existing methods model behavioral sequences without considering the temporal saliency. This work is motivated by the psychological observation that temporally selective attention enables the human perceptual system to process the most relevant information. In this paper, we introduce a new approach, named Temporally Selective Attention Model (TSAM), designed to selectively attend to salient parts of human-centric video sequences. Our TSAM models learn to recognize affective and social states using a new loss function called speaker-distribution loss. Extensive experiments show that our model achieves the state-of-the-art performance on rapport detection and multimodal sentiment analysis. We also show that our speaker-distribution loss function can generalize to other computational models, improving the prediction performance of deep averaging network and Long Short Term Memory (LSTM).

Abstract:
The South African research community has strong individual interests in pattern recognition and machine learning, but to date has had limited interactions with the worldwide multimedia research community. In an attempt to redress this, this workshop aims to introduce a selection of South African researchers to the multimedia community, and expose the multimedia community to a range of multimedia-related work, primarily from South Africa.

Abstract:
Multimedia scientists have largely focused their research on the recognition of tangible properties of data, such as objects and scenes. Recently, the field has started evolving towards the modeling of more complex properties. For example, the understanding of social, affective and subjective attributes of data has attracted the attention of many research teams at the crossroads of computer vision, multimedia, and social sciences. These intangible attributes include, for example, visual beauty, video popularity, or user behavior. Multiple, diverse challenges arise when modeling such properties from multimedia data. Issues concern technical aspects such as reliable groundtruth collection, the effective learning of subjective properties, or the impact of context in subjective perception. The first edition of the ACM MM'17 MUSA2 workshop has gathered together high-quality research works focusing on the computational understanding of intangible properties from multimodal data, including visual emotions, user intent, human relationships, and personality.

Abstract:
In this work, we present a new view on automatic speaker diarisation, i.e., assessing "who speaks when", based on the recognition of speaker traits such as age, gender, voice likability, and personality. Traditionally, speaker diarisation is accomplished using low-level audio descriptors (e.g., cepstral or spectral features), neglecting the fact that speakers can be well discriminated by humans according to various perceived characteristics. Thus, we advocate a novel paralinguistic approach that combines speaker diarisation with speaker characterisation by automatically identifying the speakers according to their individual traits. In a three-tier processing flow, speaker segmentation by voice activity detection (VAD) is initially performed to detect speaker turns. Next, speaker attributes are predicted using pre-trained paralinguistic models. To tag the speakers, clustering algorithms are applied to the predicted traits. We evaluate our methods against state-of-the-art open source and commercial systems on a corpus of realistic, spontaneous dyadic conversations recorded in the wild from three different cultures (Chinese, English, German). Our results provide clear evidence that using paralinguistic features for speaker diarisation is a promising avenue of research.

Abstract:
Organic light-emitting diode (OLED) has been widely recognized as the next-generation mobile display. Recently, smartphone manufacturers have been pushing up the pixel density of OLED display. Unfortunately, such an effort does not necessarily improve the everyday viewing because of the limitation in human visual acuity. Instead, high pixel density OLED can drain the battery power even more quickly since the power dissipation of OLED is determined by the number of displayed pixels and their RGB values, or subpixels. This paper presents a new design dimension to remedy this prevailing issue by leveraging the intuition that shutting off redundant subpixels of the display content on OLED can reduce power consumption without impacting viewing perception. We introduce ShutPix, a power-saving display system for OLED smartphones that can optimally shut off the redundant subpixels before the content is displayed. Inspired by the motivational studies, ShutPix is empowered by a suite of designs based on visual acuity, human perception, and content redundancy. Experimental results show that ShutPix can, on average, reduce 21% of display power and 15% of system power without degrading user viewing experience.

Abstract:
Despite significant research efforts in pedestrian detection over the past decade, there is still a ten-fold performance gap between the state-of-the-art methods and human perception. Deep learning methods can provide good performance but suffers from high computational complexity which prohibits their deployment on affordable systems with limited computational resources. In this paper, we propose a pedestrian detection framework that provides a major fillip to the robustness and run-time efficiency of the recent top performing non-deep learning Filtered Channel Feature (FCF) approach. The proposed framework overcomes the computational bottleneck of existing FCF methods by exploiting vector form filters to efficiently extract more discriminative channel features for pedestrian detection. A novel dual-stage group cost-sensitive RealBoost algorithm is used to explore different costs among different types of misclassification in the boosting process in order to improve detection performance. In addition, we propose two strategies, selective classification and selective scale processing, to further accelerate the detection process at the channel feature level and image pyramid level respectively. Experiments on the Caltech and INRIA datasets show that the proposed method achieves the highest detection performance among all the state-of-the-art non-CNN methods and is about 148X faster than the existing best performing FCF method on the Caltech dataset.

Abstract:
Over the last decade, automatic emotion recognition has become well established. The gold standard target is thereby usually calculated based on multiple annotations from different raters. All related efforts assume that the emotional state of a human subject can be identified by a 'hard' category or a unique value. This assumption tries to ease the human observer's subjectivity when observing patterns such as the emotional state of others. However, as the number of annotators cannot be infinite, uncertainty remains in the emotion target even if calculated from several, yet few human annotators. The common procedure to use this same emotion target in the learning process thus inevitably introduces noise in terms of an uncertain learning target. In this light, we propose a 'soft' prediction framework to provide a more human-like and comprehensive prediction of emotion. In our novel framework, we provide an additional target to indicate the uncertainty of human perception based on the inter-rater disagreement level, in contrast to the traditional framework which is merely producing one single prediction (category or value). To exploit the dependency between the emotional state and the newly introduced perception uncertainty, we implement a multi-task learning strategy. To evaluate the feasibility and effectiveness of the proposed soft prediction framework, we perform extensive experiments on a time- and value-continuous spontaneous audiovisual emotion database including late fusion results. We show that the soft prediction framework with multi-task learning of the emotional state and its perception uncertainty significantly outperforms the individual tasks in both the arousal and valence dimensions.

Abstract:
3D volumetric object generation/prediction from single 2D image is a quite challenging but meaningful task in 3D visual computing. In this paper, we propose a novel neural network architecture, named "3DensiNet", which uses density heat-map as an intermediate supervision tool for 2D-to-3D transformation. Specifically, we firstly present a 2D density heat-map to 3D volumetric object encoding-decoding network, which outperforms classical 3D autoencoder. Then we show that using 2D image to predict its density heat-map via a 2D to 2D encoding-decoding network is feasible. In addition, we leverage adversarial loss to fine tune our network, which improves the generated/predicted 3D voxel objects to be more similar to the ground truth voxel object. Experimental results on 3D volumetric prediction from 2D images demonstrates superior performance of 3DensiNet over other state-of-the-art techniques in handling 3D volumetric object generation/prediction from single 2D image.

Abstract:
A hybrid model for social media popularity prediction is proposed by combining Convolutional Neural Network (CNN) with XGBoost. The CNN model is exploited to learn high-level representations from the social cues of the data. These high-level representations are used in XGBoost to predict the popularity of the social posts. We evaluate our approach on a real-world Social Media Prediction (SMP) dataset, which consists of 432K Flickr images. The experimental results show that the proposed approach is effective, achieving the following performance: Spearman's Rho: 0.7406, MSE: 2.7293, MAE: 1.2475.

Abstract:
The ubiquitous mobile devices have led to the unprecedented growing of personal photo collections on the phone. One significant pain point of today's mobile users is instantly finding specific photos of what they want. Existing applications (e.g., Google Photo and OneDrive) have predominantly focused on cloud-based solutions, while leaving the client-side challenges (e.g., query formulation, photo tagging and search, etc.) unsolved. This considerably hinders user experience on the phone. In this paper, we present an innovative personal photo search system on the phone, which enables instant and accurate photo search by visual query suggestion and joint text-image hashing. Specifically, the system is characterized by several distinctive properties: 1) visual query suggestion (VQS) to facilitate the formulation of queries in a joint text-image form, 2) light-weight convolutional and sequential deep neural networks to extract representations for both photos and queries, and 3) joint text-image hashing (with compact binary codes) to facilitate binary image search and VQS. It is worth noting that all the components run on the phone with client optimization by deep learning techniques. We have collected 270 photo albums taken by 30 mobile users (corresponding to 37,000 personal photos) and conducted a series of field studies. We show that our system significantly outperforms the existing client-based solutions by 10 x in terms of search efficiency, and 92.3% precision in terms of search accuracy, leading to a remarkably better user experience of photo discovery on the phone.

Abstract:
In this paper, we present DeepCADx, a computer-aided prostate detection and diagnosis (CADx) system powered by a novel deep convolutional neural networks (CNNs). Specifically, the developed DeepCADx system processes multi-parametric magnetic resonance imaging (mp-MRI) sequences in three major steps: 1) pre-processing which registers images from different modalities and detect prostates, 2) multimodal CNNs which jointly identifies images containing prostate cancers (PCa) and generate cancer response maps (CRM) with each pixel indicating the probability to be cancerous, and 3) post-processing which localize lesion in CRMs and assess the aggressiveness (i.e. Gleason score) of each localized lesion using multimodal CNN features and a 5-class SVM classifier.

Abstract:
Image splicing is a very common image manipulation technique that is sometimes used for malicious purposes. A splicing detection and localization algorithm usually takes an input image and produces a binary decision indicating whether the input image has been manipulated, and also a segmentation mask that corresponds to the spliced region. Most existing splicing detection and localization pipelines suffer from two main shortcomings: 1) they use handcrafted features that are not robust against subsequent processing (e.g., compression), and 2) each stage of the pipeline is usually optimized independently. In this paper we extend the formulation of the underlying splicing problem to consider two input images, a query image and a potential donor image. Here the task is to estimate the probability that the donor image has been used to splice the query image, and obtain the splicing masks for both the query and donor images. We introduce a novel deep convolutional neural network architecture, called Deep Matching and Validation Network (DMVN), which simultaneously localizes and detects image splicing. The proposed approach does not depend on handcrafted features and uses raw input images to create deep learned representations. Furthermore, the DMVN is end-to-end optimized to produce the probability estimates and the segmentation masks. Our extensive experiments demonstrate that this approach outperforms state-of-the-art splicing detection methods by a large margin in terms of both AUC score and speed.

Abstract:
Foveated rendering leverages human visual system to increase video quality under limited computing resources for Virtual Reality (VR). More specifically, it increases the frame rate and the video quality of the foveal vision via lowering the resolution of the peripheral vision. Optimizing foveated rendering systems is, however, not an easy task, because there are numerous parameters that need to be carefully chosen, such as the number of layers, the eccentricity degrees, and the resolution of the peripheral region. Furthermore, there is no standard and efficient way to evaluate the Quality of Experiment (QoE) of foveated rendering systems. In this paper, we propose a framework to compare the performance of different subjective assessment methods on foveated rendering systems. We consider two performance metrics: efficiency and consistency, using the perceptual ratio, which is the probability of the foveated rendering is perceivable by users. A regression model is proposed to model the relationship between the human perceived quality and foveated rendering parameters. Our comprehensive study and analysis reveal several insights: 1) there is no absolute superior subjective assessment method, 2) subjects need to make more observations to confirm the foveated rendering is imperceptible than perceptible, 3) subjects barely notice the foveated rendering with an eccentricity degree of 7.5 degrees+ and peripheral region of a resolution of 540p+, and 4) QoE levels are highly dependent on the individuals and scenes. Our findings are crucial for optimizing the foveated rendering systems for future VR applications.

Abstract:
Spatial Magnetic Field Visualization is an interactive kinetic art installation driven by magnetic field data from Nature. It is a physical space that emulates the electromagnetic connection between the Sun and Earth, the invisible yet ubiquitous forces in nature, which has a profound effect on us residing on the earth. One purpose behind this project is to create conversations about this scientific topic in the realm of media art through a three dimensional physical visualization. The Earth's magnetic field is like a living organism that goes between ups and downs. The geomagnetic field is ever-changing and requires constant observation. The impact of what is now called "space weather" on the human life and technology (e.g., GPS, radio communication, power transmission, etc) is substantial, significant enough for former president Obama to call for an executive order in preparation for space weather-related disasters. This project uses painted magnetic balls as pixels in the spatial dimension and attempts to visualize the effect of this scientific phenomenon in the three dimensional space.

Abstract:
Photographs are one of the most fundamental ways for human beings to capture their social experiences and smiling is one of the most common actions associated with photo-taking. Photos, thus provide a unique opportunity to study the phenomena of mixing of different people and also the smiles expressed by individuals in these social settings. In this work, we study whether a social media-based computational framework can be employed to obtain smile and diversity scores at very fine, individual relationship resolution, and study their associations. We analyze two data sets from different social networks, Twitter and Instagram, over different time periods. Primarily looking at photographs, using computer vision APIs, we capture the diversity of social interactions in terms of age, gender, and race of those present, and smile levels. Analysis of both data sets suggest similar and significant findings: (a) people, in general, tend to smile more in the presence of others; and (b) people tend to smile more in a more diverse company. The results can help scale, test, and validate multiple theories related to affect and diversity in sociology, psychology, biology, and urban planning, and inform future mechanisms for encouraging people to smile more often in everyday settings

Abstract:
Educational and Knowledge Technologies (EdTech), especially in connection to multimedia content and the vision of mobile and personalized learning, is a hot topic in both academia and the business start-ups ecosystem. The driver and enabler of this is on the one side the development and widespread availability of multimedia materials and MOOCs, which represent multimedia content produced specifically for supporting e-learning; and, on the other side, the ever increasing availability of all sorts on information on the Internet and in social media channels (e. g. lectures, research papers, user-generated videos, news items), which, despite not directly targeting e-learning, can prove to be valuable complements to the more targeted learning materials. Although the availability of such content is not a problem these days, finding the right content and associating different relevant pieces of multimedia so as to enable a comprehensive learning experience on a chosen subject is by no means a trivial task. This workshop provides research in areas related to multimedia-based educational and knowledge technologies and particularly on the use of multimedia search and retrieval, analysis and understanding, browsing, summarization, recommendation, and visualization technologies on multimedia content available in specialized learning platforms, the Web, mobile devices and/or social networks for supporting personalized and adaptive e-learning and training.