Integrating Text and Image: Determining Multimodal Document Intent inInstagram PostsJulia Kruk1 , Jonah Lubin2 , Karan Sikka1 , Xiao Lin1 ,Dan Jurafsky3 , Ajay Divakaran1†, equal contribution1SRI International, Princeton, NJ 2 The University of Chicago, Chicago, Illinois3Stanford University, Stanford, CAAbstractComputing author intent from multimodaldata like Instagram posts requires modeling acomplex relationship between text and image.For example, a caption might evoke an ironiccontrast with the image, so neither caption norimage is a mere transcript of the other. Instead they combine—via what has been calledmeaning multiplication Bateman (2014)—tocreate a new meaning that has a more complexrelation to the literal meanings of text and image. Here we introduce a multimodal datasetof 1299 Instagram posts labeled for three orthogonal taxonomies: the authorial intent behind the image-caption pair, the contextual relationship between the literal meanings of theimage and caption, and the semiotic relationship between the signified meanings of the image and caption. We build a baseline deepmultimodal classifier to validate the taxonomy,showing that employing both text and imageimproves intent detection by 9.6% comparedto using only the image modality, demonstrating the commonality of non-intersective meaning multiplication. The gain with multimodality is greatest when the image and caption diverge semiotically. Our dataset offers a newresource for the study of the rich meanings thatresult from pairing text and image.1IntroductionMultimodal social platforms such as Instagramlet content creators combine visual and textualmodalities. The resulting widespread use oftext image makes interpreting author intent inmultimodal messages an important task for NLPfor document understanding.There are many recent language processingstudies of images accompanied by basic text labels Work done while Julia (from Cornell University) andJonah were interns at SRI International.†Corresponding author, [email protected] 1: Image-Caption meaning multiplication: Achange in the caption completely changes the overallmeaning of the image-caption pair.or captions (Chen et al., 2015; Faghri et al., 2018,inter alia). But prior work on image–text pairs hasgenerally been asymmetric, regarding either image or text as the primary content, and the otheras mere complement. Scholars from semiotics aswell as computer science have pointed out that thisis insufficient; often text and image are not combined by a simple addition or intersection of thecomponent meanings (Bateman, 2014; Marsh andDomas White, 2003; Zhang et al., 2018).Rather, determining author intent withtext image content requires a richer kind ofmeaning composition that has been called meaning multiplication (Bateman, 2014): the creationof new meaning through integrating image andtext. Meaning multiplication includes simplemeaning intersection or concatenation (a pictureof a dog with the label “dog”, or the label “Rufus”). But it also includes more sophisticatedkinds of composition, such as irony or indirection,where the text image integration requires inference that creates a new meaning. For example inFigure 1, a picture of a young woman smokingis given two different hypothetical captions thatresult in different composed meanings. In PairingI, the image and text are parallel, with the picture

used to highlight relaxation through smoking.Pairing II uses the tension between her image andthe implications of her actions to highlight thedangers of smoking.Computational models that detect complex relationships between text and image and how theycue author intent could be significant for many areas, including the computational study of advertising, the detection and study of propaganda, andour deeper understanding of many other kinds ofpersuasive text, as well as allowing NLP applications to news media to move beyond pure text.To better understand author intent given suchmeaning multiplication, we create three novel taxonomies related to the relationship between textand image and their combination/multiplicationin Instagram posts, designed by modifying existing taxonomies (Bateman, 2014; Marsh and Domas White, 2003) from semiotics, rhetoric, andmedia studies. Our taxonomies measure the authorial intent behind the image-caption pair andtwo kinds of text-image relations: the contextualrelationship between the literal meanings of theimage and caption, and the semiotic relationshipbetween the signified meanings of the image andcaption. We then introduce a new dataset, MDID(Multimodal Document Intent Dataset), with 1299Instagram posts covering a variety of topics, annotated with labels from our three taxonomies.Finally, we build a deep neural network modelfor annotating Instagram posts with the labelsfrom each taxonomy, and show that combiningtext and image leads to better classification, especially when the caption and the image diverge.While our goal here is to establish a computationalframework for investigating multimodal meaningmultiplication, in other pilot work we have begun to consider some applications, such as using intent for social media event detection and foruser engagement prediction. Both these directionshighlight the importance of the intent and semioticstructure of a social media posting in determiningits influence on the social network as a whole.2Prior WorkA wide variety of work in multiple fields has explored the relationship between text and imageand extracting meaning, although often assigninga subordinate role to either text or images, ratherthan the symmetric relationship in media such asInstagram posts. The earliest work in the Barthe-sian tradition focuses on advertisements, in whichthe text serves as merely another connotative aspect to be incorporated into a larger connotativemeaning (Heath et al., 1977). Marsh and Domas White (2003) offer a taxonomy of the relationship between image and text by considering image/illustration pairs found in textbooks ormanuals. We draw on their taxonomy, although aswe will see, the connotational aspects of Instagramposts require some additions.For our model of speaker intent, we draw onthe classic concept of illocutionary acts (Austin,1962) to develop a new taxonomy of illocutionaryacts focused on the kinds of intentions that tendto occur on social media. For example, we rarelysee commissive posts on Instagram and Facebookbecause of the focus on information sharing andconstructions of self-image.Computational approaches to multi-modal document understanding have focused on key problems such as image captioning (Chen et al., 2015;Faghri et al., 2018), visual question answering(Goyal et al., 2017; Zellers et al., 2018; Hudsonand Manning, 2019), or extracting the literal orconnotative meaning of a post (Soleymani et al.,2017). More recent work has explored the roleof image as context for interaction and pragmatics, either in dialog (Mostafazadeh et al., 2016,2017), or as a prompt for users to generate descriptions (Bisk et al., 2019). Another importantdirection has looked at an image’s perlocutionaryforce (how it is perceived by its audience), including aspects such as memorability (Khosla et al.,2015), saliency (Bylinskii et al., 2018), popularity (Khosla et al., 2014) and virality (Deza andParikh, 2015; Alameda-Pineda et al., 2017).Some prior work has focused on intention. Jooet al. (2014) and Huang and Kovashka (2016)study prediction of intent behind politician portraits in the news. Hussain et al. (2017) study theunderstanding of image and video advertisements,predicting topic, sentiment, and intent. Alikhaniet al. (2019) introduce a corpus of the coherence relationships between recipe text and images. Our work builds on Siddiquie et al. (2015),who focused on a single type of intent (detecting politically persuasive video on the internet)and even more closely on Zhang et al. (2018),who study visual rhetoric as interaction betweenthe image and the text slogan in advertisements.They categorize image-text relationships into par-

allel equivalent (image-text deliver same point atequal strength), parallel non-equivalent (imagetext deliver the same point at different levels) andnon-parallel (text or image alone is insufficient inpoint delivery). They also identify the novel issueof understanding the complex, non-literal ways inwhich text and image interacts. Weiland et al.(2018) study the non-literal meaning conveyed byimage-caption pairs and draw on a knowledgebase to generate the gist of the image-caption pair.3TaxonomiesAs Berger (1972) points out in discussing the relationship between one image and its caption:It is hard to define exactly how thewords have changed the image but undoubtedly they have. (p. 28).We propose three taxonomies in an attempt to answer Berger’s implicit question, two (contextualand semiotic) to capture different aspects of the relationship between the image and the caption, andone to capture speaker intent.3.1Figure 2: Examples of multimodal document intent:advocative, provocative, expressive and promotive contentIntent TaxonomyThe proposed intent taxonomy is a generalizationand elaboration of existing rhetorical categoriespertaining to illocution, that targets multimodalsocial networks like Instagram. We developed aset of eight illocutionary intents from our examination and clustering of a large body of representative Instagram content, informed by previous studies of intent in Instagram posts. Thereis some overlap between categories; to bound theburden on the annotators, however, we asked themto identify intent for the image-caption pairing asa whole and not for the individual componentsFor example drawing on Goffman’s idea of thepresentation of self (Goffman, 1978), Mahoneyet al. (2016) in their study of Scottish political Instagram posts define acts like Presentation of Self,which, following Hogan (2010) we refer to as exhibition, or Personal Political Expression, whichwe generalize to advocative. Following are our final eight labels; Figure 2 shows some examples.1. advocative: advocate for a figure, idea,movement, etc.2. promotive: promote events, products, organizations etc.3. exhibitionist: create a self-image reflectingthe person, state etc. for the user using selfies,pictures of belongings (e.g. pets, clothes) etc.4. expressive: express emotion, attachment, oradmiration at an external entity or group.5. informative: relay information regarding asubject or event using factual language.6. entertainment: entertain using art, humor,memes, etc.7. provocative/discrimination: directly attackan individual or group.8. provocative/controversial: be shocking.3.2The Contextual TaxonomyThe contextual relationship taxonomy captures therelationship between the literal meanings of theimage and text. We draw on the three top-level categories of the Marsh and Domas White (2003) taxonomy, which distinguished images that are minimally related to the text, highly related to thetext, and related but going beyond it. These threeclasses— reflecting Marsh et al.’s primary interestin illustration—frame the image only as subordinate to the text. We slightly generalize the three

IntentCategory# emioticCategory # l RelationshipCategory# SamplesMinimal372Close585Transcendent147Table 1: Counts of different labels in the Multimodal Document Intent Dataset (MDID).top-level categories taxonomy of Marsh and Domas White (2003) to make them symmetric for theInstagram domain:Minimal Relationship: The literal meanings ofthe caption and image overlap very little. Forexample, a selfie of a person at a waterfallwith the caption “selfie”. While such a tersecaption does nevertheless convey a lot of information, it still leaves out details such asthe location, description of the scene, etc.that are found in typical loquacious Instagram captions.Close Relationship: The literal meanings of thecaption and the image overlap considerably.For example, a selfie of a person at a crowdedwaterfall, with the caption “Selfie at Hemlockfalls on a crowded sunny day”.Transcendent Relationship: The literal meaningof one modality picks up and expands on theliteral meaning of the other. For example,a selfie of a person at a crowded waterfallwith the caption “Selfie at Hemlock Falls ona sunny and crowded day. Hemlock falls isa popular picnic spot. There are hiking andbiking trails, and a great restaurant 3 milesdown the road .”.Note that while the labels “minimal” and ”close”could be thought of as lying on a continuous scaleindicating semantic overlap, the label “transcendent” indicates an expansion of the meaning thatcannot be captured by such a continuous scale.3.3The Semiotic TaxonomyThe contextual taxonomy described above doesnot deal with the more complex forms of “meaning multiplication” illustrated in Figure 1. For example, an image of three frolicking puppies withthe caption “My happy family,” sends a messageof pride in one’s pets that is not directly reflectedin either modality taken by itself. First, it forcesthe reader to step back and consider what is beingsignified by the image and the caption, in effect offering a meta-comment on the text-image relation.Second, there is a tension between what is signified (a family and a litter of young animals respectively) that results in a richer idiomatic meaning.Our third taxonomy therefore captures the relationship between what is signified by the respective modalities, their semiotics. We draw onthe earlier 3-way distinction of Kloepfer (1977)as modeled by Bateman (2014) and the two-way(parallel vs. non-parallel) distinction of Zhanget al. (2018) to classify the semiotic relationshipof image/text pairs as divergent, parallel and additive. A divergent relationship occurs when the image and text semiotics pull in opposite directions,creating a gap between the meanings suggested bythe image and text. A parallel relationship occurswhen the image and text independently contributeto the same meaning. An additive relationship occurs when the image and text semiotics amplify ormodify each other.The semiotic classification is not always homologous to the contextual. For example, an imageof a mother feeding her baby with a caption “Mynew small business needs a lot of tender lovingcare”, would have a minimal contextual relationship. Yet because both signify loving care andthe image intensifies the caption’s sentiment, thesemiotic relationship is additive. Or a lavish formal farewell scene at an airport with the caption“Parting is such sweet sorrow”, has a close contextual relationship because of the high overlap inliteral meaning, but the semiotics would be additive, not parallel, since the image shows only theleave-taking, while the caption suggests love (orironic lack thereof) for the person leaving.

night this photo was taken. ICP IV is an aestheticphoto of a young woman, paired with a captionthat has no semantic elements in common with thephoto. The caption may be a prose excerpt, theauthor’s reflection on what the image made themthink or feel, or perhaps just a pairing of pleasant visual stimulus with pleasant literary material.This divergent relationship is often found in photography, artistic and other entertainment posts.ICP V uses one of the most common divergent relationships, in which exhibitionist visual materialis paired with reflections or motivational captions.ICP V is thus similar to ICP III, but without theinside jokes/hidden meanings common to ICP III.ICP VI is an exhibitionist post that seems to becommon recently among public figures on Instagram. The image appears to be a classic selfie oroften a professionally taken image of the individual, but the caption refers to that person’s opinionsor agenda(s). This relationship is divergent—thereare no common semantic elements in the imageand caption—but the pair paints a picture of theindividual’s current state or future plans.Figure 3: The top three images exemplify the semioticcategories. Images I-VI show instances of divergentsemiotic relationships.Figure 3 further illustrates the proposed semiotic classification. The first three image-captionpairs (ICP’s) exemplify the three semiotic relationships. To give further insights into the rich complexity of the divergent category, the six ICP’s below showcase the kinds of divergent relationshipswe observed most frequently on Instagram.ICP I exploits the tension between the reference to retirement expressed in the caption andthe youth projected by the two young women inthe image to convey irony and thus humor in whatis perhaps a birthday greeting or announcement.Many ironic and humorous posts exhibit divergentsemiotic relationships. ICP II has the structure of aclassic Instagram meme, where the focus is on theimage, and the caption is completely unrelated tothe image. This is also exhibited in the divergent“Good Morning” caption in the top row. ICP IIIis an example of a divergent semiotic relationshipwithin an exhibitionist post. A popular communicative practice on Instagram is to combine selfies with a caption that is some sort of inside joke.The inside joke in ICP III is a lyric from a songa group of friends found funny and discussed the4The MDID DatasetOur dataset, MDID (the Multimodal Document Intent Dataset) consists of 1299 public Instagramposts that we collected with the goal of developing a rich and diverse set of posts for each of theeight illocutionary types in our intent taxonomy.For each intent we collected at least 16 hashtagsor users likely to yield a high proportion of poststhat could be labeled by that heading.For the advocative intent, we selected mostlyhashtags advocating and spanning political or social ideology such as #pride and #maga. For thepromotive intent we relied on the #ad tag that Instagram has recently begun requiring for sponsored posts. For exhibitionist intent we used tagsthat focused on the self as the most important aspect of the post such as #selfie and #ootd (outfitof the day). The expressive posts were retrievedvia tags that actively expressed a stance or an affective intent, such as #lovehim or #merrychristmas. Informative posts were taken from informative accounts such as news websites. Entertainment posts drew on an eclectic group of tagssuch as #meme, #earthporn, #fatalframes. Finally,provocative posts were extracted via tags that either expressed a controversial or provocative message or that would draw people into being influ-

enced or provoked by the post (#redpill, #antifa,#eattherich, #snowflake).Data Labeling: Data was pre-processed (for example to convert all albums to single imagecaption pairs). We developed a simple annotationtoolkit that displayed an image–caption pair andasked the annotator to confirm whether the pairwas relevant (contains both an image and text inEnglish) and if so to identify the post’s intent (advocative, promotive, exhibitionist, expressive, informative, entertainment, provocative), contextualrelationship (minimal, close, transcendent), andsemiotic relationship (divergent, parallel, additive). Two of the authors collaborated on the labelers manual and then labeled the data by consensus,and any label on which the authors disagreed after discussion was removed. Dataset statistics areshown in Table 1; see intent.html for the data.5Computational ModelWe train and test a deep convolutional neural network (DCNN) model on the dataset, both to offer a baseline model for users of the dataset, andto further explore our hypothesis about meaningmultiplication.Our model can take as input either image (Img),text (Txt) or both (Img Txt), and consists ofmodality specific encoders, a fusion layer, and aclass prediction layer. We use the ResNet-18 network pre-trained on ImageNet as the image encoder (He et al., 2016). For encoding captions,we use a standard pipeline that employs a RNNmodel on word embeddings. We experiment withboth word2vec type (word token-based) embeddings trained from scratch (Mikolov et al., 2013)and pre-trained character-based contextual embeddings (ELMo) (Peters et al., 2018). For our purpose ELMo character embeddings are more usefulsince they increase robustness to noisy and oftenmisspelled Instagram captions. For the combinedmodel, we implement a simple fusion strategy thatfirst linearly projects encoded vectors from boththe modalities in the same embedding space andthen adds the two vectors. Although naive, thisstrategy has been shown to be effective at a variety of tasks such as Visual Question Answering(Nguyen and Okatani, 2018) and image-captionmatching (Ahuja et al., 2018). We then use thefused vector to predict class-wise scores using afully connected layer.6ExperimentsWe evaluate our models on predicting intent, semiotic relationships, and image-text relationshipsfrom Instagram posts, using image only, text only,and both modalities.6.1Dataset, Evaluation and ImplementationWe use the 1299-sample MDID dataset (section 4). We only use corresponding image andtext information for each post and do not use othermeta-data to preserve the focus on image-captionjoint meaning. We perform basic pre-processingon the captions such as removing stopwords andnon-alphanumeric characters. We do not performany pre-processing for images.Due to the small dataset, we perform 5-foldcross-validation for our experiments reporting average performance across all splits. We reportclassification accuracy (ACC) and also area under the ROC curve (AUC) (since AUC is morerobust to class-skew), using macro-average acrossall classes (Jeni et al., 2013; Stager et al., 2006).We use a pre-trained ResNet-18 model as theimage encoder. For word token based embeddings we use 300 dimensional vectors trained fromscratch. For ELMo we use a publicly availableAPI 1 and use a pre-trained model with two layersresulting in a 2048 dimensional input. We use a bidirectional GRU as the RNN model with 256 dimensional hidden layers. We set the dimensionality of the common embedding space in the fusionlayer to 128. In case there is a single modality,the fusion layer only projects features from thatmodality. We train with the Adam optimizer witha learning rate of 0.00005, which is decayed by0.1 after every 15 epochs. We report results withthe best model selected based on performance ona mini validation set.6.2Quantitative ResultsWe show results in Table 2. For the intent taxonomy images are more informative than (word2vec)text (76% for Img vs 72.7% for Txt-emb) but withELMo text outperforms just using images (82.6%for Txt-ELMo, 76.0% for Img). ELMo similarlyimproves performance on the contextual taxonomy but not the semiotic taxonomy.For the semiotic taxonomy, ELMo andword2vec embeddings perform similarly, ster/tutorials/how to/

MethodChanceImgTxt-embTxt-ELMoImg Txt-embImg Txt-ELMoIntentACCAU C28.150.042.9 ( 0.0) 76.0 ( 0.5)42.9 ( 0.0) 72.7 ( 1.5)52.7 ( 0.0) 82.6 ( 1.2)48.1 ( 0.0) 80.8 ( 1.2)56.7 ( 0.0) 85.6 ( 1.3)SemioticACCAU C64.550.061.5 ( 0.0) 59.8 ( 3.0)58.9 ( 0.0) 67.8 ( 1.7)61.7 ( 0.0) 66.5 ( 1.9)60.4 ( 0.0) 69.7 ( 1.8)61.8 ( 0.0) 67.8 ( 1.8)ContextualACCAU C53.050.052.5 ( 0.0) 62.5 ( 1.3)60.7 ( 0.5) 74.9 ( 3.0)65.4 ( 0.0) 78.5 ( 2.1)60.8 ( 0.0) 76.0 ( 2.5)63.6 ( 0.5) 79.0 ( 1.4)Table 2: Table showing results with various DCNN models– image-only (Img), text-only (Txt-emb and TxtELMo), and combined model (Img Txt-emb and Img Txt-ELMo). Here emb is the model using standard word(token) based embeddings, while ELMo is the pre-trained ELMo based word embeddings (Peters et al., 2018). Thenumbers in Table2 are standard deviations across 5 folds.for Txt-emb vs. 66.5% for Txt-ELMo), suggesting that individual words are sufficient for thesemiotic labeling task, and the presence of thesentence context (as in ELMo) is not needed.Combining visual and textual modalities helpsacross the board. For example, for intent taxonomy the joint model Img Txt-ELMo achieves85.6% compared to 82.6% for Txt-ELMo. Imagesseem to help even more when using a word embedding based text model (80.8% for Img Txt-embvs. 72.7% for Txt-emb). Joint models also improve over single-modality on labeling the imagetext relationship and the semiotic taxonomy. Weshow class-wise performances with the single- andmulti-modality models in Table 3. It is particularlyinteresting that in the semiotic taxonomy, multimodality helps the most with divergent semiotics(gain of 4.4% compared to the image-only model).Figure 4: Confusion between intent classes for the intent classification task. The confusion matrix was obtained using the Img Txt-ELMo model and the resultsare averaged over the 5 splits.6.3DiscussionIn general, using both text and image is helpful,a fact that is unsurprising since combinations oftext and image are known to increase performanceon tasks such as predicting post popularity or userpersonality (Hessel et al., 2017; Wendlandt et al.,2017). Most telling, however, were the differencesin this helpfulness across items. In the semioticcategory the greatest gain came when the textimage semiotics were “divergent”. By contrast,multimodal models help less when the image andtext are additive, and help the least when the imageand text are parallel and provide less novel information. Similarly, with contextual relationships,multimodal analysis helps the most with the “minimal” category (1.6%). This further supports theidea that on social media such as Instagram, therelation between image and text can be richly divergent and thereby form new meanings.The category confusion matrix in Figure 4 provides further insights. The least confused categoryis informative. Informative posts are least similarto the rest of Instagram, since they consist of detached, objective posts, with little use of first person pronouns like “I” or “me.” Promotive posts arealso relatively easy to detect, since they are formally informative, telling the viewer the advantages and logistics of an item or event, with the addition of a persuasive intent reflecting the poster’spersonal opinions (“I love this watch”.). We foundthat the entertainment label is often misapplied;perhaps to some extent all posts have a goal of entertaining, and any analysis must account for thisfilter of entertainment. The Exhibitionist intenttends to be predicted well, likely due to its visualand textual signifiers of individuality (e.g. selfiesare almost always exhibitionist, as are captions

nmentExhibitionistExpressivePromotiveMeanIntentImg .478.678.772.088.582.6Img .8ContextualImg Txt-ELMo60.960.566.162.579.773.882.078.5Img Txt-ELMo77. Txt-ELMo81.374.681.279.0Table 3: Class-wise results (AUC) for the three taxonomies with different DCNN models on the MDID dataset.Except for the semiotic taxonomy we used ELMo text representations (based on the performance in Table 2).like “I love my new hair”). There is a great dealof confusion, however, between the expressive andexhibitionist categories, since the only distinctionlies in whether the post is about a general topic orabout the poster, and between the provocative andadvocative categories, perhaps because both oftenseek to prove points in a similar way.With the contextual and semiotic taxonomies,some good results are obtained with text alone. Inthe “transcendent” contextual case, it is not necessarily surprising that using text alone enables 82%AUC because whenever a caption is really long orhas many adverbs, adjectives or abstract concepts,it is highly likely to be transcendent. In the “divergent” semiotics case, we were surprised that textalone would predict divergence with 72.7% AUC.Examining these cases showed that many of themhad lexical cues suggesting irony or sarcasm, allowing the system to infer that the image will diverge in keeping with the irony. There is, however, a consistent improvement when both modalities are used for both taxonomies.6.4Sample OutputsWe show some sample successful outputs of the(multimodal) model in Figure 5, in which the highest probability class in each of the three dimension corresponds to our gold labels. The top-leftImage-caption pair (Image I) is classified as exhibitionist, closely followed by expressive; it is apicture of someone’s home with a caption describing a domestic experience. The semiotic relationship is classified as additive; the image and captiontogether signify the concept of spending winter athome with pets before the fireplace. The contextual relationship is classified as transcendental; thecaption indeed goes well beyond the image.The top-right image-caption pair (Image II) isclassified as entertainment; the image caption pairworks as an ironic reference to dancing (“yeet”)grandparents, who are actually reading, in language used usually by young people that a typical grandparent would never use. The semioticrelationship is classified as divergent and the contextual relationship is classified as minimal; thereis semantic and semiotic divergence of the imagecaption pair caused by the juxtaposition of youthful references with older people.To further understand the role of meaning multiplication, we consider the change in intent andsemiotic relationships when the same image of theBritish Royal Family is matched with two different captions in the bottom row of Figure 5 (ImageIV). In both cases the semiotic relationship is parallel, perhaps due to the match between the multifigure portrait setting and the word family. Butthe other two dimensions show differences. Whenthe caption is “the royal family” our system classifies the intent as

ics, either in dialog (Mostafazadeh et al.,2016, 2017), or as a prompt for users to generate de-scriptions (Bisk et al.,2019). Another important direction has looked at an image’s perlocutionary force (how it is perceived by its audience), includ-ing aspects such as memorability (Khosla