Deep neural networks (DNN) have shown unprecedented success in various computer vision applications such as image classification and object detection. However, it is still a common (yet inconvenient) practice to prepare at least tens of thousands of labeled image to fine-tune a network on every task before the model is ready to use. Recent study shows that a DNN has strong dependency towards the training dataset, and the learned features cannot be easily transferred to a different but relevant task without fine-tuning.
In this paper, we propose a simple yet powerful remedy, called Adaptive Batch Normalization(AdaBN), to increase the generalization ability of a DNN. Our approach is based on the well-known Batch Normalization technique which has become a standard component in modern deep learning. In contrary to other deep learning domain adaptation methods, our method does not require additional components, and is parameter-free. It archives state-of-the-art performance despite its surprising simplicity. Furthermore, we demonstrate that our method is complementary with other existing methods. Combining AdaBN with existing domain adaptation treatments may further improve model performance.
In this paper we provide an extensive evaluation of fixation prediction and salient object segmentation algorithms as well as statistics of major datasets. Our analysis identifies serious design flaws of existing salient object benchmarks, called the dataset design bias, by over emphasising the stereotypical concepts of saliency. The dataset design bias does not only create the discomforting disconnection between fixations and salient object segmentation, but also misleads the algorithm designing.
Based on our analysis, we propose a new high quality dataset that offers both fixation and salient object segmentation ground-truth. With fixations and salient object being presented simultaneously, we are able to bridge the gap between fixations and salient objects, and propose a novel method for salient object segmentation. Finally, we report significant benchmark progress on 3 existing datasets of segmenting salient objects.
For an ill-posed problem like boundary detection, human labeled datasets play a critical role. Compared with the active research on finding a better boundary detector to refresh the performance record, there is surprisingly little discussion on the boundary detection benchmark itself. The goal of this paper is to identify the potential pitfalls of today's most popular boundary benchmark, BSDS 300. In the paper, we first introduce a psychophysical experiment to show that many of the "weak" boundary labels are unreliable and may contaminate the benchmark. Then we analyze the computation of f-measure and point out that the current benchmarking protocol encourages an algorithm to bias towards those problematic "weak" boundary labels. With this evidence, we focus on a new problem of detecting strong boundaries as one alternative. Finally, we assess the performances of 9 major algorithms on different ways of utilizing the dataset, suggesting new directions for improvements.
Human labeled datasets, along with their corresponding evaluation algorithms, play an important role in boundary detection. We here present a psychophysical experiment that addresses the reliability of such benchmarks. To find better remedies to evaluate the performance of any boundary detection algorithm, we propose a computational framework to remove inappropriate human labels and estimate the intrinsic properties of boundaries.
We introduce a simple image descriptor referred to as the image signature. We show, within the theoretical framework of sparse signal mixing, that this quantity spatially approximates the foreground of an image. We experimentally investigate whether this approximate foreground overlaps with visually conspicuous image locations by developing a saliency algorithm based on the image signature. This saliency algorithm predicts human fixation points best among competitors on the Bruce and Tsotsos  benchmark data set and does so in much shorter running time. In a related experiment, we demonstrate with a change blindness data set that the distance between images induced by the image signature is closer to human perceptual distance than can be achieved using other saliency algorithms, pixel-wise, or GIST  descriptor methods.
Detecting moving objects against dynamic backgrounds remains a challenge in computer vision and robotics. This paper presents a surprisingly simple algorithm to detect objects in such conditions. Based on theoretic analysis, we show that 1) the displacement of the foreground and the background can be represented by the phase change of Fourier spectra, and 2) the motion of background objects can be extracted by Phase Discrepancy in an efficient and robust way. The algorithm does not rely on prior training on particular features or categories of an image and can be implemented in 9 lines of MATLAB code. In addition to the algorithm, we provide a new database for moving object detection with 20 video clips, 11 subjects and 4785 bounding boxes to be used as a public benchmark for algorithm evaluation.
A visual attention system should respond placidly when common stimuli are presented, while at the same time keep alert to anomalous visual inputs. In this paper, a dynamic visual attention model based on the rarity of features is proposed. We introduce the Incremental Coding Length (ICL) to measure the perspective entropy gain of each feature. The objective of our model is to maximize the entropy of the sampled visual features. In order to optimize energy consumption, the limit amount of energy of the system is re-distributed amongst features according to their Incremental Coding Length. By selecting features with large coding length increments, the computational system can achieve attention selectivity in both static and dynamic scenes. We demonstrate that the proposed model achieves superior accuracy in comparison to mainstream approaches in static saliency map generation. Moreover, we also show that our model captures several less-reported dynamic visual search behaviors, such as attentional swing and inhibition of return.
In this paper, we present a novel approach to generate thumbnail images. Our method crops an image into a smaller but more informative region in the thumbnail representation. From the perspective of information theory, we propose a novel approach to generate bottom-up saliency in a global manner. In our method, we evaluate the statistical distribution of feature maps, and use its coding length as a measurement for image cropping. The experimental results offer viewers a more effective representation of images.
In this paper, we propose a method to manipulate colors of an image. Based on a library of natural color images, our system evolves several prototypes of color distribution of the library, which we call "color concepts". By applying these color concepts on an input image, a user can easily change the mood of image colors in a global manner. Our results of photographs and paintings indicate that this method is capable of high-quality color manipulations.
The ability of human visual system to detect visual saliency is extraordinarily fast and reliable. However, computational modeling of this basic intelligent behavior still remains a challenge. This paper presents a simple method for the visual saliency detection. Our model is independent of features, categories, or other forms of prior knowledge of the objects. By analyzing the log-spectrum of an input image, we extract the spectral residual of an image in spectral domain, and propose a fast method to construct the corresponding saliency map in spatial domain. We test this model on both natural pictures and artificial images such as psychological patterns. The result indicate fast and robust saliency detection of our method.
What a human’s eye tells a human’s brain? In this paper, we analyze the information capacity of visual attention. Our hypothesis is that the limit of perceptible spatial frequency is related to observing time. Given more time, one can obtain higher resolution – that is, higher spatial frequency information, of the presented visual stimuli. We designed an experiment to simulate natural viewing conditions, in which time dependent characteristics of the attention can be evoked; and we recorded the temporal responses of 6 subjects. Based on the experiment results, we propose a person-independent model that characterizes the behavior of eyes, relating visual spatial resolution with the duration of attentional concentration time. This model suggests that the information capacity of visual attention is time-dependent.