### Transcription

Journal of Vision (2008) 8(7):32, 1–20http://journalofvision.org/8/7/32/1SUN: A Bayesian framework for saliency usingnatural statisticsLingyun ZhangDepartment of Computer Science and Engineering,UCSD, La Jolla, CA, USAMatthew H. TongDepartment of Computer Science and Engineering,UCSD, La Jolla, CA, USATim K. MarksDepartment of Computer Science and Engineering,UCSD, La Jolla, CA, USAHonghao ShanDepartment of Computer Science and Engineering,UCSD, La Jolla, CA, USAGarrison W. CottrellDepartment of Computer Science and Engineering,UCSD, La Jolla, CA, USAWe propose a deﬁnition of saliency by considering what the visual system is trying to optimize when directing attention. Theresulting model is a Bayesian framework from which bottom-up saliency emerges naturally as the self-information of visualfeatures, and overall saliency (incorporating top-down information with bottom-up saliency) emerges as the pointwisemutual information between the features and the target when searching for a target. An implementation of our frameworkdemonstrates that our model’s bottom-up saliency maps perform as well as or better than existing algorithms in predictingpeople’s ﬁxations in free viewing. Unlike existing saliency measures, which depend on the statistics of the particular imagebeing viewed, our measure of saliency is derived from natural image statistics, obtained in advance from a collection ofnatural images. For this reason, we call our model SUN (Saliency Using Natural statistics). A measure of saliency based onnatural image statistics, rather than based on a single test image, provides a straightforward explanation for many searchasymmetries observed in humans; the statistics of a single test image lead to predictions that are not consistent with theseasymmetries. In our model, saliency is computed locally, which is consistent with the neuroanatomy of the early visualsystem and results in an efﬁcient algorithm with few free parameters.Keywords: saliency, attention, eye movements, computational modelingCitation: Zhang, L., Tong, M. H., Marks, T. K., Shan, H., & Cottrell, G. W. (2008). SUN: A Bayesian framework for saliencyusing natural statistics. Journal of Vision, 8(7):32, 1–20, http://journalofvision.org/8/7/32/, doi:10.1167/8.7.32.IntroductionThe surrounding world contains a tremendous amountof visual information, which the visual system cannotfully process (Tsotsos, 1990). The visual system thus facesthe problem of how to allocate its processing resources tofocus on important aspects of a scene. Despite the limitedamount of visual information the system can handle,sampled by discontinuous fixations and covert shifts ofattention, we experience a seamless, continuous world.Humans and many other animals thrive using this heavilydownsampled visual information. Visual attention asovertly reflected in eye movements partially reveals thesampling strategy of the visual system and is of greatresearch interest as an essential component of visualcognition. Psychologists have investigated visual attentiondoi: 1 0. 11 67 / 8 . 7 . 32Downloaded from jov.arvojournals.org on 06/17/2020for many decades using psychophysical experiments, suchas visual search tasks, with carefully controlled stimuli.Sophisticated mathematical models have been built toaccount for the wide variety of human performance data(e.g., Bundesen, 1990; Treisman & Gelade, 1980; Wolfe,Cave, & Franzel, 1989). With the development ofaffordable and easy-to-use modern eye-tracking systems,the locations that people fixate when they perform certaintasks can be explicitly recorded and can provide insightinto how people allocate their attention when viewingcomplex natural scenes. The proliferation of eye-trackingdata over the last two decades has led to a number ofcomputational models attempting to account for the dataand addressing the question of what attracts attention.Most models have focused on bottom-up attention, wherethe subjects are free-viewing a scene and salient objectsattract attention. Many of these saliency models useReceived December 4, 2007; published December 16, 2008ISSN 1534-7362 * ARVO

Journal of Vision (2008) 8(7):32, 1–20Zhang et al.findings from psychology and neurobiology to constructplausible mechanisms for guiding attention allocation (Itti,Koch, & Niebur, 1998; Koch & Ullman, 1985; Wolfe et al.,1989). More recently, a number of models attempt toexplain attention based on more mathematically motivatedprinciples that address the goal of the computation (Bruce& Tsotsos, 2006; Chauvin, Herault, Marendaz, & Peyrin,2002; Gao & Vasconcelos, 2004, 2007; Harel, Koch, &Perona, 2007; Kadir & Brady, 2001, Oliva, Torralba,Castelhano, & Henderson, 2003; Renninger, Coughlan,Verghese, & Malik, 2004; Torralba, Oliva, Castelhano, &Henderson, 2006; Zhang, Tong, & Cottrell, 2007). Bothtypes of models tend to rely solely on the statistics of thecurrent test image for computing the saliency of a point inthe image. We argue here that natural statistics (thestatistics of visual features in natural scenes, which anorganism would learn through experience) must also playan important role in this process.In this paper, we make an effort to address theunderlying question: What is the goal of the computationperformed by the attentional system? Our model startsfrom the simple assumption that an important goal of thevisual system is to find potential targets and builds up aBayesian probabilistic framework of what the visualsystem should calculate to optimally achieve this goal. Inthis framework, bottom-up saliency emerges naturally asself-information. When searching for a particular target,top-down effects from a known target emerge in ourmodel as a log-likelihood term in the Bayesian formulation. The model also dictates how to combine bottom-upand top-down information, leading to pointwise mutualinformation as a measure of overall saliency. We developa bottom-up saliency algorithm that performs as well asor better than state-of-the-art saliency algorithms atpredicting human fixations when free-viewing images.Whereas existing bottom-up saliency measures aredefined solely in terms of the image currently beingviewed, ours is instead defined based on natural statistics(collected from a set of images of natural scenes), torepresent the visual experience an organism wouldacquire during development. This difference is mostnotable when comparing with models that also use aBayesian formulation (e.g., Torralba et al., 2006) or selfinformation (e.g., Bruce & Tsotsos, 2006). As a result ofusing natural statistics, our model provides a straightforward account of many human search asymmetries thatcannot be explained based on the statistics of the testimage alone. Unlike many models, our measure ofsaliency only involves local computation on images,with no calculation of global image statistics, saliencynormalization, or winner-take-all competition. Thismakes our algorithm not only more efficient, but alsomore biologically plausible, as long-range connectionsare scarce in the lower levels of the visual system.Because of the focus on learned statistics from naturalscenes, we call our saliency model SUN (Saliency UsingNatural statistics).Downloaded from jov.arvojournals.org on 06/17/20202Previous workIn this section we discuss previous saliency models thathave achieved good performance in predicting humanfixations in viewing images. The motivation for thesemodels has come from psychophysics and neuroscience(Itti & Koch, 2001; Itti et al., 1998), classificationoptimality (Gao & Vasconcelos, 2004, 2007), the task oflooking for a target (Oliva et al., 2003; Torralba et al.,2006), or information maximization (Bruce & Tsotsos,2006). Many models of saliency implicitly assume thatcovert attention functions much like a spotlight (Posner,Rafal, Choate, & Vaughan, 1985) or a zoom lens (Eriksen& St James, 1986) that focuses on salient points of interestin much the same way that eye fixations do. A model ofsaliency, therefore, can function as a model of both overtand covert attention. For example, although originallyintended primarily as a model for covert attention, themodel of Koch and Ullman (1985) has since beenfrequently applied as a model of eye movements. Eyefixations, in contrast to covert shifts of attention, are easilymeasured and allow a model’s predictions to be directlyverified. Saliency models are also compared againsthuman studies where either a mix of overt and covertattention is allowed or covert attention functions on itsown. The similarity between covert and overt attention isstill debated, but there is compelling evidence that thesimilarities commonly assumed may be valid (for areview, see Findlay & Gilchrist, 2003).Itti and Koch’s saliency model (Itti & Koch, 2000,2001; Itti et al., 1998) is one of the earliest and the mostused for comparison in later work. The model is animplementation of and expansion on the basic ideas firstproposed by Koch and Ullman (1985). The model isinspired by the visual attention literature, such as featureintegration theory (Treisman & Gelade, 1980), and care istaken in the model’s construction to ensure that it isneurobiologically plausible. The model takes an image asinput, which is then decomposed into three channels:intensity, color, and orientation. A center-surround operation, implemented by taking the difference of the filterresponses from two scales, yields a set of feature maps.The feature maps for each channel are then normalizedand combined across scales and orientations, creatingconspicuity maps for each channel. The conspicuousregions of these maps are further enhanced by normalization, and the channels are linearly combined to formthe overall saliency map. This process allows locations tovie for conspicuity within each feature dimension but hasseparate feature channels contribute to saliency independently; this is consistent with the feature integrationtheory. This model has been shown to be successful inpredicting human fixations and to be useful in objectdetection (Itti & Koch, 2001; Itti et al., 1998; Parkhurst,Law, & Niebur, 2002). However, it can be criticized asbeing ad hoc, partly because the overarching goal of the

Journal of Vision (2008) 8(7):32, 1–20Zhang et al.system (i.e., what it is designed to optimize) is not specified,and it has many parameters that need to be hand-selected.Several saliency algorithms are based on measuring thecomplexity of a local region (Chauvin et al., 2002; Kadir& Brady, 2001; Renninger et al., 2004; Yamada &Cottrell, 1995). Yamada and Cottrell (1995) measure thevariance of 2D Gabor filter responses across differentorientations. Kadir and Brady (2001) measure the entropyof the local distribution of image intensity. Renninger andcolleagues measure the entropy of local features as ameasure of uncertainty, and the most salient point at anygiven time during their shape learning and matching taskis the one that provides the greatest information gainconditioned on the knowledge obtained during previousfixations (Renninger et al., 2004; Renninger, Verghese, &Coughlan, 2007). All of these saliency-as-variance/entropy models are based on the idea that the entropy ofa feature distribution over a local region measures therichness and diversity of that region (Chauvin et al.,2002), and intuitively a region should be salient if itcontains features with many different orientations andintensities. A common critique of these models is thathighly textured regions are always salient regardless oftheir context. For example, human observers find an eggin a nest highly salient, but local-entropy-based algorithmsfind the nest to be much more salient than the egg (Bruce& Tsotsos, 2006; Gao & Vasconcelos, 2004).Gao and Vasconcelos (2004, 2007) proposed a specificgoal for saliency: classification. That is, a goal of thevisual system is to classify each stimulus as belonging to aclass of interest (or not), and saliency should be assignedto locations that are useful for that task. This was firstused for object detection (Gao & Vasconcelos, 2004),where a set of features are selected to best discriminate theclass of interest (e.g., faces or cars) from all other stimuli,and saliency is defined as the weighted sum of featureresponses for the set of features that are salient for thatclass. This forms a definition that is inherently top-downand goal directed, as saliency is defined for a particularclass. Gao and Vasconcelos (2007) define bottom-upsaliency using the idea that locations are salient if theydiffer greatly from their surroundings. They use differenceof Gaussians (DoG) filters and Gabor filters, measuringthe saliency of a point as the Kullback–Leibler (KL)divergence between the histogram of filter responses at thepoint and the histogram of filter responses in thesurrounding region. This addresses a previously mentioned problem commonly faced by complexity-basedmodels (as well as some other saliency models that uselinear filter responses as features): these models alwaysassign high saliency scores to highly textured areas. In theDiscussion section, we will discuss a way that the SUNmodel could address this problem by using non-linearfeatures that model complex cells or neurons in higherlevels of the visual system.Oliva and colleagues proposed a probabilistic model forvisual search tasks (Oliva et al., 2003; Torralba et al.,Downloaded from jov.arvojournals.org on 06/17/202032006). When searching for a target in an image, theprobability of interest is the joint probability that thetarget is present in the current image, together withthe target’s location (if the target is present), given theobserved features. This can be calculated using Bayes’rule:pðO ¼ 1; LjF; GÞ ¼1pðFjGÞ ﬄﬄﬄ{zﬄﬄﬄ}pðFjO ¼ 1; L; GÞpð LjO ¼ 1; GÞpðO ¼ 1jGÞ;bottom up saliencyas defined byTorralba et al:(1)where O 1 denotes the event that the target is present inthe image, L denotes the location of the target when O 1,F denotes the local features at location L, and G denotesthe global features of the image. The global features Grepresent the scene gist. Their experiments show that thegist of a scene can be quickly determined, and the focus oftheir work largely concerns how this gist affects eyemovements. The first term on the right side of Equation 1is independent of the target and is defined as bottom-upsaliency; Oliva and colleagues approximate this conditional probability distribution using the current image’sstatistics. The remaining terms on the right side ofEquation 1 respectively address the distribution of featuresfor the target, the likely locations for the target, and theprobability of the target’s presence, all conditioned on thescene gist. As we will see in the Bayesian framework forsaliency section, our use of Bayes’ rule to derive saliencyis reminiscent of this approach. However, the probabilityof interest in the work of Oliva and colleagues is whetheror not a target is present anywhere in the test image,whereas the probability we are concerned with is theprobability that a target is present at each point in thevisual field. In addition, Oliva and colleagues condition allof their probabilities on the values of global features.Conditioning on global features/gist affects the meaningof all terms in Equation 1, and justifies their use of currentimage statistics for bottom-up saliency. In contrast, SUNfocuses on the effects of an organism’s prior visualexperience.Bruce and Tsotsos (2006) define bottom-up saliencybased on maximum information sampling. Information, inthis model, is computed as Shannon’s self-information,jlog p(F), where F is a vector of the visual featuresobserved at a point in the image. The distribution of thefeatures is estimated from a neighborhood of the point,which can be as large as the entire image. When theneighborhood of each point is indeed defined as the entireimage of interest, as implemented in (Bruce & Tsotsos,2006), the definition of saliency becomes identical to thebottom-up saliency term in Equation 1 from the work of

Journal of Vision (2008) 8(7):32, 1–20Zhang et al.Oliva and colleagues (Oliva et al., 2003; Torralba et al.,2006). It is worth noting, however, that the feature spacesused in the two models are different. Oliva and colleaguesuse biologically inspired linear filters of different orientations and scales. These filter responses are known tocorrelate with each other; for example, a vertical bar inthe image will activate a filter tuned to vertical bars butwill also activate (to a lesser degree) a filter tuned to45-degree-tilted bars. The joint probability of the entirefeature vector is estimated using multivariate Gaussiandistributions (Oliva et al., 2003) and later multivariategeneralized Gaussian distributions (Torralba et al., 2006).Bruce and Tsotsos (2006), on the other hand, employfeatures that were learned from natural images usingindependent component analysis (ICA). These have beenshown to resemble the receptive fields of neurons inprimary visual cortex (V1), and their responses have thedesirable property of sparsity. Furthermore, the featureslearned are approximately independent, so the jointprobability of the features is just the product of eachfeature’s marginal probability, simplifying the probabilityestimation without making unreasonable independenceassumptions.The Bayesian Surprise theory of Itti and Baldi (2005,2006) applies a similar notion of saliency to video. Underthis theory, organisms form models of their environmentand assign probability distributions over the possiblemodels. Upon the arrival of new data, the distributionover possible models is updated with Bayes’ rule, and theKL divergence between the prior distribution and posterior distribution is measured. The more the new dataforces the distribution to change, the larger the divergence. These KL scores of different distributions overmodels combine to produce a saliency score. Itti andBaldi’s implementation of this theory leads to analgorithm that, like the others described here, definessaliency as a kind of deviation from the features present inthe immediate neighborhood, but extending the notion ofneighborhood to the spatiotemporal realm.Bayesian framework for saliencyWe propose that one goal of the visual system is to findpotential targets that are important for survival, such asfood and predators. To achieve this, the visual systemmust actively estimate the probability of a target at everylocation given the visual features observed. We proposethat this probability, or a monotonic transformation of it,is visual saliency.To formalize this, let z denote a point in the visual field.A point here is loosely defined; in the implementationdescribed in the Implementation section, a point corresponds to a single image pixel. (In other contexts, a pointcould refer other things, such as an object; Zhang et al.,Downloaded from jov.arvojournals.org on 06/17/202042007.) We let the binary random variable C denotewhether or not a point belongs to a target class, let therandom variable L denote the location (i.e., the pixelcoordinates) of a point, and let the random variable Fdenote the visual features of a point. Saliency of a point zis then defined as p(C 1 ª F fz, L lz) where fzrepresents the feature values observed at z, and lrepresents the location (pixel coordinates) of z. Thisprobability can be calculated using Bayes’ rule:sz ¼ pðC ¼ 1jF ¼ fz ; L ¼ lz Þ¼pðF ¼ fz ; L ¼ lz jC ¼ 1ÞpðC ¼ 1Þ:pðF ¼ fz ; L ¼ lz Þð2ÞWe assume1 for simplicity that features and location areindependent and conditionally independent given C 1:pðF ¼ fz ; L ¼ lz Þ ¼ pðF ¼ fz ÞpðL ¼ lz Þ;ð3ÞpðF ¼ fz ; L ¼ lz jC ¼ 1Þ¼ pðF ¼ fz jC ¼ 1ÞpðL ¼ lz jC ¼ 1Þ:ð4ÞThis entails the assumption that the distribution of afeature does not change with location. For example,Equation 3 implies that a point in the left visual field isjust as likely to be green as a point in the right visual field.Furthermore, Equation 4 implies (for instance) that a pointon a target in the left visual field is just as likely to begreen as a point on a target in the right visual field. Withthese independence assumptions, Equation 2 can berewritten as:sz ¼pðF ¼ fz jC ¼ 1ÞpðL ¼ lz jC ¼ 1ÞpðC ¼ 1ÞpðF ¼ fz ÞpðL ¼ lz Þð5Þ¼pðF ¼ fz jC ¼ 1Þ pðL ¼ lz jC ¼ 1ÞpðC ¼ 1ÞpðF ¼ fz ÞpðL ¼ lz Þð6Þ¼1pðF ¼ fz Þ ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ}pðF ¼ fz jC ¼ 1Þ pðC ¼ 1jL ¼ lz Þ ﬄﬄﬄﬄﬄﬄﬄﬄ} ﬄﬄﬄﬄﬄﬄﬄﬄ}LikelihoodLocation prior ��ﬄﬄﬄﬄﬄﬄ}Independentof targetDependent on targetðbottom up saliencyÞðtop down knowledgeÞ. ð7ÞTo compare this probability across locations in animage, it suffices to estimate the log probability (sincelogarithm is a monotonically increasing function). For this

Journal of Vision (2008) 8(7):32, 1–20Zhang et al.reason, we take the liberty of using the term saliency torefer both to sz and to log sz, which is given by:log sz ¼ jlog pðF ¼ fz Þ þ log pðF ¼ fz jC ¼ 1Þ ﬄﬄﬄﬄﬄﬄ} elf informationpðF ¼ fz ; C ¼ 1Þ¼ log:pðF ¼ fz ÞpðC ¼ 1Þ �ﬄﬄﬄ}ð11ÞPointwise mutual informationðoverall saliencyÞLog likelihoodþ log pðC ¼ 1jL ¼ lz Þ : �8ÞLocation priorThe first term on the right side of this equation, jlogp(F fz), depends only on the visual features observed atthe point and is independent of any knowledge we haveabout the target class. In information theory, jlog p(F fz)is known as the self-information of the random variableF when it takes the value fz. Self-information increaseswhen the probability of a feature decreasesVin otherwords, rarer features are more informative. We havealready discussed self-information in the context ofprevious work, but as we will see later, SUN’s use ofself-information differs from that of previous approaches.The second term on the right side of Equation 8, logp(F fz ª C 1), is a log-likelihood term that favorsfeature values that are consistent with our knowledge ofthe target. For example, if we know that the target isgreen, then the log-likelihood term will be much larger fora green point than for a blue point. This corresponds to thetop-down effect when searching for a known target,consistent with the finding that human eye movementpatterns during iconic visual search can be accounted forby a maximum likelihood procedure for computing themost likely location of a target (Rao, Zelinsky, Hayhoe, &Ballard, 2002).The third term in Equation 8, log p(C 1 ª L lz), isindependent of visual features and reflects any priorknowledge of where the target is likely to appear. It hasbeen shown that if the observer is given a cue of where thetarget is likely to appear, the observer attends to thatlocation (Posner & Cohen, 1984). For simplicity andfairness of comparison with Bruce and Tsotsos (2006),Gao and Vasconcelos (2007), and Itti et al. (1998), weassume location invariance (no prior information aboutthe locations of potential targets) and omit the locationprior; in the Results section, we will further discuss theeffects of the location prior.After omitting the location prior from Equation 8, theequation for saliency has just two terms, the self-informationand the log-likelihood, which can be combined:log sz ¼ jlog pðF ¼ fz Þ þ log pðF ¼ fz jC ¼ 1Þ ﬄﬄﬄﬄﬄﬄ} elf informationðbottom up saliencyÞ¼ log5Log likelihoodðtop down knowledgeÞpðF ¼ fz jC ¼ 1ÞpðF ¼ fz ÞDownloaded from jov.arvojournals.org on 06/17/2020ð9Þð10ÞThe resulting expression, which is called the pointwisemutual information between the visual feature and thepresence of a target, is a single term that expresses overallsaliency. Intuitively, it favors feature values that are morelikely in the presence of a target than in a target’s absence.When the organism is not actively searching for aparticular target (the free-viewing condition), the organism’s attention should be directed to any potential targetsin the visual field, despite the fact that the featuresassociated with the target class are unknown. In this case,the log-likelihood term in Equation 8 is unknown, so weomit this term from the calculation of saliency (this canalso be thought of as assuming that for an unspecifiedtarget, the likelihood distribution is uniform over featurevalues). In this case, the overall saliency reduces to justthe self-information term: log sz jlog p(F fz). We takethis to be our definition of bottom-up saliency. It impliesthat the rarer a feature is, the more it will attract ourattention.This use of jlog p(F fz) differs somewhat from how itis often used in the Bayesian framework. Often, the goalof the application of Bayes’ rule when working withimages is to classify the provided image. In that case, thefeatures are given, and the jlog p(F fz) term functionsas a (frequently omitted) normalizing constant. When thetask is to find the point most likely to be part of a targetclass, however, jlog p(F fz) plays a much moresignificant role as its value varies over the points of theimage. In this case, its role in normalizing the likelihood ismore important as it acts to factor in the potentialusefulness of each feature to aid in discrimination.Assuming that targets are relatively rare, a target’s featureis most useful if that feature is comparatively rare in thebackground environment, as otherwise the frequency withwhich that feature appears is likely to be more distractingthan useful. As a simple illustration of this, consider thateven if you know with absolute certainty that the target isred, i.e. p(F red ª C 1) 1, that fact is useless ifeverything else in the world is red as well.When one considers that an organism may be interested ina large number of targets, the usefulness of rare featuresbecomes even more apparent; the value of p(F fz ª C 1)will vary for different target classes, while p(F fz) remainsthe same regardless of the choice of targets. While thereare specific distributions of p(F fz ª C 1) for whichSUN’s bottom-up saliency measure would be unhelpful infinding targets, these are special cases that are not likely tohold in general (particularly in the free-viewing condition,where the set of potential targets is largely unknown).That is, minimizing p(F fz) will generally advance thegoal of increasing the ratio p(F fz ª C 1) / p(F fz),

Journal of Vision (2008) 8(7):32, 1–20Zhang et al.implying that points with rare features should be found“interesting.”Note that all of the probability distributions describedhere should be learned by the visual system throughexperience. Because the goal of the SUN model is to findpotential targets in the surrounding environment, theprobabilities should reflect the natural statistics of theenvironment and the learning history of the organism,rather than just the statistics of the current image. (This isespecially obvious for the top-down terms, which requirelearned knowledge of the targets.)In summary, calculating the probability of a target ateach point in the visual field leads naturally to theestimation of information content. In the free-viewingcondition, when there is no specific target, saliencyreduces to the self-information of a feature. This impliesthat when one’s attention is directed only by bottom-upsaliency, moving one’s eyes to the most salient points inan image can be regarded as maximizing informationsampling, which is consistent with the basic assumption ofBruce and Tsotsos (2006). When a particular target isbeing searched for, on the other hand, our model impliesthat the best features to attend to are those that have themost mutual information with the target. This has beenshown to be very useful in object detection with objectssuch as faces and cars (Ullman, Vidal-Naquet, & Sali,2002).In the rest of this paper, we will concentrate on bottomup saliency for static images. This corresponds to the freeviewing condition, when no particular target is of interest.Although our model was motivated by the goal-orientedtask of finding important targets, the bottom-up component remains task blind and has the statistics of naturalscenes as its sole source of world knowledge. In the nextsection, we provide a simple and efficient algorithm forbottom-up saliency that (as we demonstrate in the Resultssection) produces state-of-the-art performance in predicting human fixations.6Gabor filters (e.g., Itti & Koch, 2001; Itti et al., 1998;Oliva et al., 2003; Torralba et al., 2006). In Bruce andTsotsos (2006), the features are calculated as theresponses to filters learned from natural images usingindependent component analysis (ICA). In this paper, weconduct experiments with both kinds of features.Below, we describe the SUN algorithm for estimatingthe bottom-up saliency that we derived in the Bayesianframework for saliency section, jlog p(F fz). Here, apoint z corresponds to a pixel in the image. For theremainder of the paper, we will drop the subscript z fornotational simplicity. In this algorithm, F is a randomvector of filter responses, F [F1, F2, I], where therandom variable Fi represents the response of the ith filterat a pixel, and f [ f1, f2, I] are the values of these filterresponses at this pixel location.Method 1: Difference of Gaussians ﬁltersAs noted above, many existing models use a collectionof DoG (difference of Gaussians) and/or Gabor filters asthe first step of processing the input images. These filtersare popular due to their resemblance to the receptive fieldsof neurons in the early stages of the visual system, namelythe lateral geniculate nucleus of the thalamus (LGN)and the primary visual cortex (V1). DoGs, for example,give the well-known “Mexican hat” center-surround filter.Here, we apply DoGs to the intensity and color channelsof an image. A more complicated feature set composed ofa mix of DoG and Gabor filters was also initiallyevaluated, but results were similar to those of the simpleDoG filters used h

review, see Findlay & Gilchrist, 2003). Itti and Koch’s saliency model (Itti & Koch, 2000, 2001; Itti et al., 1998) is one of the earliest and the most used for comparison in later work. The model is an implementati