Transformer-Based Zero-Shot Detection via Contrastive Learning

Liu, Wei; Chen, Hui; Ma, Yongqiang; Wang, Jianji; Zheng, Nanning

doi:10.1007/978-3-031-08333-4_26

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 646))

Included in the following conference series:

IFIP International Conference on Artificial Intelligence Applications and Innovations

1953 Accesses
1 Citation

Abstract

Zero-Shot Detection (ZSD) is a challenging computer vision problem that enables simultaneous classification and localization of previously unseen objects via auxiliary information. Most of the existing methods learn a biased visual-semantic mapping function, which prefers predicting seen classes during testing, and they only focus on region of interest and ignore contextual information in an image. To tackle these problems, we propose a novel framework for ZSD named Transformer-based Zero-Shot Detection via Contrastive Learning (TZSDC). The proposed TZSDC contains four components: transformer-based backbone, Foreground-Background (FB) separation module, Instance-Instance Contrastive Learning (IICL) module, and Knowledge-Transfer (KT) module. The transformer backbone encodes long-range contextual information with less inductive bias. The FB module separates foreground and background by scoring objectness from images. The IICL module optimizes the visual structure in embedding space to make it more discriminative and the KT module transfers knowledge from seen classes to unseen classes via category similarity. Benefiting from these modules, the accurate alignment between the contextual visual features and semantic features can be achieved. Experiments on MSCOCO well validate the effectiveness of the proposed method for ZSD and generalized ZSD.

You have full access to this open access chapter, Download conference paper PDF

Zero-Shot Object Detection

Enhancing generalized zero-shot learning through semantic contrast and feature aggregation

Article 10 September 2025

Zero-shot learning via categorization-relevant disentanglement and discriminative samples synthesis

Article 29 April 2024

1 Introduction

In recent years, deep learning has made great progress in object detection [1, 8]. However, these methods strongly rely on large-scale annotated data. When lacking sufficient annotated data, the performance of these methods drops rapidly [12, 13]. In reality, it is difficult for detectors to generalize to new target domains where annotated data is scarce or absent. However, it’s easier for humans to recognize a new class by analogy with similar objects they know.

In order to solve the above problems, Zero-Shot Object detection (ZSD) [7, 9, 14, 16, 17] is proposed to classify and locate unseen classes with only seen classes contained during training. Most ZSD models [9, 14, 16] usually learn a visual-semantic mapping function using visual data and related semantic information of seen classes. At the testing stage, they use the learned model to map visual features into an embedding space and perform the nearest neighbor search to predict unseen classes. Several studies [7, 25] use a generative model to synthesize features of unseen classes, and then retrain a classifier of unseen classes, turning zero-shot learning into supervised learning.

These methods [9, 16] learn the model on seen classes while ignoring semantic information available for unseen classes, making the model significantly biased towards the seen classes when testing, which will greatly degrade the performance of ZSD and generalized ZSD (GZSD). Besides, current zero-shot detection networks that are based on one-stage or two-stage detection methods for secondary design, only focus on local information near an object’s region of interest and do not explicitly encode long-range dependencies between objects, which are crucial to detect multiple objects in an image.

In this paper, we develop a novel framework for ZSD called Transformer-based Zero-Shot Detection via Contrastive Learning (TZSDC), which consists of four modules: transformer-based detector named Deformable DETR [26], Foreground-Background (FB) separation module, Instance-Instance Contrastive Learning (IICL) module, and Knowledge-Transfer (KT) module. We use the Deformable DETR to encode the input images for contextual features. To alleviate the confusion between unseen classes and backgrounds, the FB module makes full use of the existing visual background to compute an objectness score for the output query embeddings. Meanwhile, in order to make visual features in the embedding space more discriminative, the IICL module performs contrastive learning between instances to optimize the visual manifold structure, so that the intra-class spacing is more compact and the inter-class distances are far away from each other. To alleviate the bias problem that the learned model prefers seen classes, the KT module realizes the knowledge transfer from seen classes to unseen classes via category similarity.

The main contributions of the paper can be summarized as (i) We propose a novel framework TZSDC that integrates the transformer and contrastive learning into zero-shot detection, achieving an accurate visual-semantic alignment. (ii) We design a FB module to alleviate the confusion of unseen classes and the background, and a KT module to realize the knowledge transfer from seen classes to unseen classes through category similarity. (iii) Experiments on MSCOCO verify that the proposed method can effectively improve the performance on ZSD and GZSD tasks.

2 Related Work

Object Detection. In the past few years, object detection has received huge attention and developed rapidly. For traditional object detection frameworks, there are mainly two types, one-stage methods such as SSD [11], YOLO [18], FCOS [21], and two-stage methods such as Faster R-CNN [19], R-FCN [3]. Their general methods are to generate bounding boxes, determine which box contains objects, and then classify high-confidence boxes. However, due to the design of convolution, they only focus on local information near the region of interest. In recent years, Transformer [2, 26] is developing rapidly in the field of computer vision, DETR [2] applies the transformer to the field of target detection, and Deformable DETR [26] adopts the idea of deformable convolution [4], which integrates multi-scale information and accelerates the convergence speed of DETR. DETR [2] can encode long-range dependencies at multi-scales to enrich contextual information. In this work, we choose Deformable DETR as our basic detection framework.

Zero-Shot Learning (ZSL). ZSL is a classic task in computer vision. It aims to use seen examples to train networks and reason about unseen classes with the help of semantic information. Zero-shot learning can be divided into embedding models and generative models. The embedding models [5, 20] mainly learn a mapping function to convert visual features and semantics into an embedding space and then classify by searching the nearest semantic descriptor in the embedding space. In our work, we adopt a basic visual-semantic embedding model, take the latent space as the embedding space, and exploit the similarity between seen and unseen classes to explicitly transfer knowledge from the source class to the target class, promoting better visual semantic alignment.

Zero-Shot Object Detection (ZSD). ZSD is a recently proposed task that can identify and localize unseen objects. Most of them focus on learning embedding functions from visual space to semantic space. MS-Zero [6] designed an asymmetric mapping method to reduce the impact of new noise on the classifier, which first maps visual features to semantic space respectively, and then maps semantic features to visual space. Polarity Loss [15] was proposed to find a more suitable alignment of visual and semantic information, which is an improvement on the basis of Focal Loss to solve the problem of imbalance between positive and negative samples. BLRPN-ZSD [24] designed a background perceptron to use external annotations to solve the confusion of unseen classes and backgrounds. In our work, we introduce the foreground objectness branch to learn from existing visual background data to better separate unseen classes from the background. At the same time, we introduce a contrastive network in the classification branch to explicitly transfer knowledge from the source class to the target class, which is helpful to alleviate the domain transfer problem and the visual-semantic gap.

3 Method

Problem Settings. In ZSD, we are given S seen classes in $\mathcal {Y}^{s}$ and U seen classes in $\mathcal {Y}^{u}$, where seen classes and unseen classes are disjoint. We can denote that $\mathcal {Y}^{s} \cap \mathcal {Y}^{u}=\emptyset $, $\mathcal {Y}^{s} \cup \mathcal {Y}^{u} = \mathcal {Y}$. We use $\mathcal {Y}^{s}=\left\{ Y_{1}, Y_{2}, \cdots , Y_{S}\right\} $ to represent the seen classes and $\mathcal {Y}^{u}=\left\{ Y_{S+1}, Y_{S+2}, \cdots , Y_{S+U}\right\} $ to represent unseen classes. Let $\mathcal {D}^{tr} = \left\{ \left( {x}_{i}, y_{i}\right) \mid {x}_{i} \in \mathcal {X}, y_{i} \in \mathcal {Y}^{s}\right\} _{i=1}^{N}$ be the training dataset containing N images. During training and testing, semantic word-vectors $A=\left\{ \boldsymbol{a}_{c}\right\} _{c=1}^{S+U}$ are provided for each class $c \in \mathcal {Y}^{s} \cup \mathcal {Y}^{u}$ to conduct a knowledge transfer. The task of ZSD is to learn a detector to recognize and localize unseen classes during testing.

3.1 Overall Architecture

The overall framework for zero-shot detection is shown in Fig. 1. It adopts the standard Deformable DETR [26] as a backbone for ZSD by introducing (i) a Foreground-Background (FB) separation module to reduce confusion of unseen classes and backgrounds; (ii) an Instance-Instacne/Instance-Semantic Contrastive Learning (IICL/ISCL) module to optimize the visual manifold structure in the embedding space; (iii) a Knowledge Transfer (KT) module to transfer knowledge from seen to unseen classes via category similarity.

Given an input image $\boldsymbol{x} \in \mathcal {X}$, ResNet 101 extracts multi-scale features F with a sine-cosine position encoding E added to preserve the position information. Then multi-scale features with position encoding are fed into the transformer encoder and decoder which contain deformable convolution [4]. Driven by cross-attention and self-attention mechanism, the decoder converts a set of M learnable object queries into a set of M query embeddings $\mathcal {Q}=\left\{ \boldsymbol{q}_{i}\right\} _{i=1}^{M},\boldsymbol{q}_{i}\in \mathbb {R}^{D} $ which contain the relative positional relationship between objects. And then the query embeddings $\mathcal {Q}$ are fed into the regressor, FB module, IICL module, and KT module. The FB module computes an objectness score for the output query embeddings, using the existing visual background data to achieve foreground and background separation by binary cross-entropy loss. The mapping functions $E_{v}$ and $E_{S}$ are used to map query embeddings ${Q}=\left\{ \boldsymbol{q}_{i}\right\} _{i=1}^{M}$ and semantic embeddings $A=\left\{ \boldsymbol{a}_{c}\right\} _{c=1}^{S+U}$ into a common embedding space. Then in IICL module, query embeddings with the same label are regarded as positive samples $\boldsymbol{z}^{+}$, the others are regarded as negative samples $\boldsymbol{z}^{-}$, and instance-instance contrastive learning is performed to optimize the visual manifold structure. Besides, the ISCL module selects semantic feature $\boldsymbol{s}_{i}$ corresponding to the class of $\boldsymbol{z}_{i}$ as the only positive sample, the remaining $S-1$ semantic features as negative samples to perform instance-semantic contrastive learning. Meanwhile, unseen semantic embeddings are used in the KT module to enable knowledge transfer via category similarity between seen and unseen embeddings.

3.2 FB Module

The decoder outputs $\mathcal {Q}=\left\{ \boldsymbol{q}_{i}\right\} _{i=1}^{M},\boldsymbol{q}_{i}\in \mathbb {R}^{D} $ from M learnable object queries, each of which has a corresponding bounding box and category. And then the classifier recognizes the query embeddings into $S+U+1$ classes: S seen classes, U unseen classes, and background. However, most query embeddings will be predicted as background due to a lack of supervision from visual images of unseen classes. In order to alleviate the confusion between unseen classes and backgrounds, we introduce $FB: \mathbb {R}^{D} \rightarrow [0,1]$ to separate the foreground and background.

Considering M is generally larger than the number of categories $S+U$, for those queries without actual categories, we regard them as backgrounds. FB module computes an objectness score $o_{i}$ for query embeddings ${q}_{i}$. The objective of the FB module is to assign higher confidence to query embeddings corresponding to foreground objects than to those corresponding to the backgrounds. Therefore, the foreground and background separation loss function is defined as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{{FB}}=- \sum _{i=1}^{M} m_{j} \log o_{i}+\left( 1-m_{i}\right) \log \left( 1-o_{i}\right) \end{aligned} \end{aligned}$$

(1)

where:

$$\begin{aligned} m_{i j}= {\left\{ \begin{array}{ll}1, &{} y_{i}\ is\ the\ foreground \\ 0, &{} y_{i}\ is\ the\ background\end{array}\right. } \end{aligned}$$

(2)

3.3 IICL/ISCL Module and KT Module

Due to the gap between visual and semantic features, using visual space or semantic space as a common embedding space is not ideal. In order to align the two spaces, we use two functions $E_{v}$, $E_{s}$ to map query embeddings ${Q}=\left\{ \boldsymbol{q}_{i}\right\} _{i=1}^{M}$ and semantic embeddings $A=\left\{ \boldsymbol{a}_{c}\right\} _{c=1}^{S+U}$ into a common embedding space to optimize the manifold of visual and semantic features.

$$\begin{aligned} \boldsymbol{z}_{i}=E_{v}\left( \boldsymbol{q}_{i}\right) \end{aligned}$$

(3)

$$\begin{aligned} \boldsymbol{s}_{j}=E_{s}\left( \boldsymbol{a}_{j}\right) \end{aligned}$$

(4)

where $\boldsymbol{q}_{i}$ represents the query feature of the i-th class, $\boldsymbol{a}_{j}$ represents the semantic feature of the j-th class.

Instance-Instance Contrastive Learning (IICL). Instances with similar semantic attributes are usually close together in the embedding space, thus leading to misclassifications. To reduce such misclassifications, we need to optimize the manifold structure of visual features in the embedding space. Inspired by the alignment (closeness of features from positive pairs) and uniformity of the feature distribution to contrastive loss [22], we utilize contrastive loss to learn discriminative features. Given an input image, query embeddings with the same label are regarded as positive samples $\boldsymbol{z}^{+}$, and the others are regarded as negative samples $\boldsymbol{z}^{-}$. We assume that there are $P_{i}$ positive samples and $N_{i}$ negative samples for i-th object in the input image. The IICL loss $\mathcal {L}_{II}$ is as follows:

$$\begin{aligned} \mathcal {L}_{II}=\mathbb {E}\left[ -\log \frac{\exp \left( \boldsymbol{z}_{i} \cdot \boldsymbol{z}^{+})/\tau _{v}\right) }{\sum _{k=1}^{P_{i}}\exp \left( \boldsymbol{z}_{i} \cdot \boldsymbol{z}_{k}^{+}/\tau _{v}\right) +\sum _{j=1}^{N_{i}} \exp \left( \boldsymbol{z}_{i} \cdot \boldsymbol{z}_{j}^{-}/\tau _{v}\right) }\right] \end{aligned}$$

(5)

where $\tau _{v}$ is the temperature parameter of $\mathcal {L}_{II}$.

Instance-Semantic Contrastive Learning (ISCL). The above loss drives positive pairs between visual features compact. Besides, in order to achieve an accurate visual-semantic alignment, we use the semantic information of the source class for supervision and select semantic feature $\boldsymbol{s}_{i}$ corresponding to the class of $\boldsymbol{z}_{i}$ as the only positive sample, the remaining $S-1$ semantic features as negative samples. The instance-semantic contrastive loss $\mathcal {L}_{{IS}}$ can be calculated as follows:

$$\begin{aligned} \mathcal {L}_{{IS}}=\mathbb {E}\left[ -\log \frac{\exp \left( \boldsymbol{z}_{i} \cdot \boldsymbol{s}^{+})/\tau _{s}\right) }{\exp \left( \boldsymbol{z}_{i} \cdot \boldsymbol{s}^{+}/\tau _{s}\right) +\sum _{j=1}^{S-1} \exp \left( \boldsymbol{z}_{i} \cdot \boldsymbol{s}_{j}^{-}/\tau _{s}\right) }\right] \end{aligned}$$

(6)

where $\tau _{s}$ is the temperature parameter of $\mathcal {L}_{{IS}}$ and S is the number of seen classes.

Knowledge Transfer (KT). The task of ZSD is to recognize and locate unseen classes. If only visual information and semantic embeddings of seen classes are used during training, it is easy to bias the model to seen classes. In order to alleviate the bias problem, we use category similarity between seen classes and unseen classes to transfer knowledge from the source classes to the target classes.

We assume that the semantic attribute of the unseen class can be obtained by the linear combination of the attributes of seen classes, which is also widely adopted in ZSL [23]. For example, “zebra” has a shape like “horse”, and the color is black and white like “panda”. Inspired by this observation, we use least square regression (LSR) to obtain the reconstruction coefficient of each seen class semantic attribute. The reconstruction coefficient is the category similarity, which is calculated as follows:

$$\begin{aligned} \boldsymbol{d}_{u}=\arg \min _{\boldsymbol{d}_{u}}\left\| \boldsymbol{a}_{u}-\sum _{k=1}^{S} \boldsymbol{a}_{k} d_{u k}\right\| _{2}^{2}+\beta \left\| \boldsymbol{d}_{u}\right\| _{2} \end{aligned}$$

(7)

where $d_{u k}$ is the category similarity between the u-th unseen class and k-th seen class, and ${a}_{u}\in \left\{ \boldsymbol{a}_{c}\right\} _{c=S+1}^{S+U}$, ${a}_{k}\in \left\{ \boldsymbol{a}_{c}\right\} _{c=1}^{S}$, $\beta $ is the regularization coefficient. After we get the similarity $\boldsymbol{d}_{u}$ between unseen classes and seen classes, we can use the images of seen classes to learn the similar unseen classes. The knowledge transfer loss $\mathcal {L}_{\mathrm {KT}}$ is defined as:

$$\begin{aligned} \mathcal {L}_{{KT}}=-\frac{1}{N} \sum _{i=1}^{N} \sum _{j=S+1}^{S+U} d_{j y_{i}} \log \widetilde{\zeta _{i j}}+\left( 1-d_{j y_{i}}\right) \log \left( 1-\widetilde{\zeta _{i j}}\right) \end{aligned}$$

(8)

where $\zeta _{i j} = \boldsymbol{z}_{i} \cdot \boldsymbol{s}_{j}$, $\boldsymbol{z}_{i}$, $\boldsymbol{s}_{j}$ are calculated by Eq. (3) and Eq. (4). $\widetilde{\zeta _{i j}}$ is the normalization of $\zeta _{i j}$.

Regression Loss. As with Deformable DETR [26], we use a linear combination of the L1 loss and the IOU loss as our regression loss $\mathcal {L}_{reg}$:

$$\begin{aligned} \mathcal {L}_{reg}=\lambda _{ iou } \mathcal {L}_{i o u}\left( b_{i}, \hat{b}_{i}\right) +\lambda _{L 1}\left\| \left( b_{i}-\hat{b}_{i}\right) \right\| _{1} \end{aligned}$$

(9)

3.4 Training and Inference

Training. The proposed method includes FB loss $\mathcal {L}_{FB}$, regression loss $\mathcal {L}_{reg}$, IICL loss $\mathcal {L}_{II}$, ISCL loss $\mathcal {L}_{{IS}}$, KT loss $\mathcal {L}_{{KT}}$. The total loss function is as follows:

$$\begin{aligned} \mathcal {L}_{total} = \mathcal {L}_{reg} + \mathcal {L}_{{IS}} + \alpha \mathcal {L}_{{KT}} + \gamma \mathcal {L}_{II} +\lambda \mathcal {L}_{FB}. \end{aligned}$$

(10)

where $\alpha ,\gamma ,\lambda $ are hyper-parameters to balance each loss term. We train our model using a two-stage training approach. In the first stage, we use all seen classes to train a Deforamble DETR framework that only contains the FB module. In the second stage, we replace the classifier in Deformable DETR with our current IICL module, ISCL module, and KT module. Then the model is finetuned based on first-stage parameters.

Inference. Given a test image I, M object query embeddings ${Q}=\left\{ \boldsymbol{q}_{i}\right\} _{i=1}^{M}$ are computed, and then bounding boxes are obtained with the regressor. Next, object query embeddings are mapped into the common embedding space and are used to predict the class by nearest neighbor search.

4 Experiments

4.1 Experimental Settings

Datasets. We evaluate our method on MSCOCO 2014 [10] which contains 82,783 training images and 40,504 validation images. For MSCOCO 2014 with 80 categories, We follow the 65/15 split [15]. As for semantic embeddings in the classification subnet, we use 300-dimensional vectors from word2vec [15] for MSCOCO 2014.

Evaluation Protocol. For MSCOCO 2014, we choose mAP and Recall@100 as our evaluation metrics. We conduct experiments under both standard and generalized settings and evaluate the Harmonic Mean (HM) to show the performance of GZSD.

Implementation Details. We choose ResNet101 which is pretrained on ImageNet to extract multi-scale features. The transformer encoder-decoder structure is consistent with the standard Deformable DETR. The dimension of object queries is 512 and M is set to 100. The regressor consists of 3 multi-layer perceptrons (MLP), the FB module is one fully-connected layer and the mapping function $E_{s}$, $E_{v}$ are accomplished by one fully-connected layer. Hyperparameters $\alpha , \gamma , \lambda $ in Eq. (10) are set as 0.2, 0.3, 0.1. And the temperature parameter $\tau _{v}, \tau _{s}$ in Eq. (5), Eq. (6) and Hyperparameters $\lambda _{ iou },\lambda _{L 1}$ in Eq. (9) are set to 0.1, 0.1, 2.0, 5.0, and The TZSDC framework is trained using SGD optimizer with the learning rate of 0.01 and momentum of 0.9 for 50 epochs in the first stage and the learning rate of 0.002 and momentum of 0.999 for 20 epochs in the second stage.

4.2 Comparison with Other Methods

As shown in Table 1, we compare the performance of the proposed model with TL-ZSD [14], PL [15], BLC [24], SU-ZSD [7] on MSCOCO for both ZSD and GZSD. As can be seen, our method achieves the best performance on both mAP and recall for ZSD. Compared with the second-best method SU-ZSD [7], the mAP of our method is improved from 19.00% to 19.58%, and the recall is improved from 54.00% to 56.45%, which indicates that our method improves the discriminatory ability for unseen classes. For GZSD, our method achieves the best performance in the unseen class. The unseen performance is improved without sacrificing the seen accuracy too much, and our HM value is competitive to the generative model SU-ZSD [7]. This shows that our model has learned a good visual-semantic alignment model, which realizes knowledge transfer from seen classes to unseen classes.

As can be seen from Table 2, which shows the class-wise AP performance for ZSD, our method improves the mAP of “mouse”, “hotdog”, “hairdrier” which are not similar to the seen classes at all, indicating that our model extracts the contextual information of the images and is able to understand the scenario, for example, where there is a computer or keyboard, there is usually a mouse. What’s more, our method achieves the best performance in 8 out of 15 categories, further demonstrating the superiority of our method.

Table 1. Comparison with other methods for ZSD/GZSD on MSCOCO dataset. We report both mAP(%) and recall@100. Bold represents the best result.

Full size table

Table 2. Class-wise AP comparison with other methods on unseen classes of MSCOCO with 65/15 split for ZSD.

Full size table

4.3 Ablation Studies

To further verify the effectiveness of each component, we conduct ablation studies on the MSCOCO dataset with the 65/15 split. Table 3 shows the mAP of our model for ZSD and GZSD under different combinations of components. $\surd $ indicates the model with corresponding module loss.

The Effect of FB Module. In order to verify the contribution of the FB module to the model, we remove the FB module during training. It can be observed that the performance of ZSD and the performance of unseen in GZSD drop from 19.58% to 19.25%, and 19.20% to 18.97% respectively, while the performance of seen is only improved by 0.06%. The result shows that after adding the FB module, the model can effectively reduce the confusion between unseen classes and backgrounds.

Table 3. Effectiveness of each loss term for both ZSD and GZSD, measured by the mAP on MSCOCO 2014 with 65/15 split.

Full size table

The Effect of KT Module. During training, we remove the loss function $\mathcal {L}_{{KT}}$, that is, only visual features and semantic attributes of seen classes are used, while semantic features of unseen classes are not involved. The result in Table 3 shows that the performance of ZSD has dropped by 2.35% and the performance of unseen in GZSD has dropped sharply by 5.29%. If we don’t explicitly transfer knowledge from seen classes to unseen classes through category similarity, both ZSD and GZSD performance will drop, and GZSD performance drops more sharply. It indicates that knowledge transfer has a greater impact on GZSD and can effectively alleviate the problem that the learned model will bias toward seen classes in GZSD.

The Effect of IICL Module. After removing the IICL Module, ZSD performance and unseen performance in GZSD drop by 1.44% and 1.23%, respectively. The result shows that IICL can optimize the visual feature distribution in embedding space, enabling the model to learn more discriminative features.

4.4 Qualitative Result

In order to qualitatively evaluate our results, we show the detection results of our method on MSCOCO in Fig. 2. For ZSD, the image only contains unseen classes, for GZSD, the image may contain both seen classes and unseen classes. The results show that the proposed model is able to detect seen and unseen classes in different complex scenes, and it can detect multi-scale objects, such as large-scale “train”, “bed” and small-scale “traffic light”, “suitcase”, which verifies the effectiveness of the proposed model.

5 Conclusion

In this paper, we propose a novel framework for ZSD named Transformer-based Zero-Shot Detection via Contrastive Learning (TZSDC), which includes Deformable DETR, FB module, IICL module, and KT module. Deformable DETR extracts multi-scale contextual features, FB module separates the foreground objects from the background to alleviate the confusion of unseen classes and the background, IICL module optimizes the visual manifold structure in the embedding space to make the visual feature more discriminative, and KT module transfers knowledge from seen to unseen classes via category similarity. Experiments on MSCOCO well validate the effectiveness of the proposed method for ZSD and GZSD.

References

Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 29 (2016)
Google Scholar
Dai, J., et al.: Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773 (2017)
Google Scholar
Frome, A., Corrado, G., Shlens, J., et al.: A deep visual-semantic embedding model. Proceedings of the Advances in Neural Information Processing Systems pp. 2121–2129 (2013)
Google Scholar
Gupta, D., Anantharaman, A., Mamgain, N., Balasubramanian, V.N., Jawahar, C., et al.: A multi-space approach to zero-shot object detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1209–1217 (2020)
Google Scholar
Hayat, N., Hayat, M., Rahman, S., Khan, S., Zamir, S.W., Khan, F.S.: Synthesizing the unseen for zero-shot object detection. In: Proceedings of the Asian Conference on Computer Vision (2020)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
Li, Y., Shao, Y., Wang, D.: Context-guided super-class inference for zero-shot detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, pp. 944–945 (2020)
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Liu, X., Liu, X., Zhang, W., Wand, J., Wang, F.: Parallel data: from big data to data intelligence. Pattern Recogn. Artif. Intell. 30(8), 9 (2017)
Google Scholar
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra R-CNN: towards balanced learning for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 821–830 (2019)
Google Scholar
Rahman, S., Khan, S., Barnes, N.: Transductive learning for zero-shot object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6082–6091 (2019)
Google Scholar
Rahman, S., Khan, S., Barnes, N.: Improved visual-semantic alignment for zero-shot object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11932–11939 (2020)
Google Scholar
Rahman, S., Khan, S., Porikli, F.: Zero-shot object detection: learning to simultaneously recognize and localize novel concepts. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11361, pp. 547–563. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-20887-5_34
Rahman, S., Khan, S.H., Porikli, F.: Zero-shot object detection: joint recognition and localization of novel concepts. Int. J. Comput. Vis. 128(12), 2979–2999 (2020)
Article Google Scholar
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015)
Google Scholar
Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through cross-modal transfer. Adv. Neural Inf. Process. Syst. 26 (2013)
Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019)
Google Scholar
Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: International Conference on Machine Learning, pp. 9929–9939. PMLR (2020)
Google Scholar
Xie, G.S., et al.: Region graph embedding network for zero-shot learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 562–580. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_33
Chapter Google Scholar
Zheng, Y., Huang, R., Han, C., Huang, X., Cui, L.: Background learnable cascade for zero-shot object detection. In: Proceedings of the Asian Conference on Computer Vision (2020)
Google Scholar
Zhu, P., Wang, H., Saligrama, V.: Don’t even look once: synthesizing features for zero-shot detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11693–11702 (2020)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

Download references

Acknowledgments

This work is supported by the National Science Foundation of China (No. 62088102), China National Postdoctoral Program for Innovative Talents from China Postdoctoral Science Foundation (No. BX2021239).

Author information

Authors and Affiliations

Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, 710049, Shaanxi, China
Wei Liu, Hui Chen, Yongqiang Ma, Jianji Wang & Nanning Zheng

Authors

Wei Liu
View author publications
Search author on:PubMed Google Scholar
Hui Chen
View author publications
Search author on:PubMed Google Scholar
Yongqiang Ma
View author publications
Search author on:PubMed Google Scholar
Jianji Wang
View author publications
Search author on:PubMed Google Scholar
Nanning Zheng
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Nanning Zheng.

Editor information

Editors and Affiliations

University of Piraeus, Piraeus, Greece
Ilias Maglogiannis
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Sunderland, Sunderland, UK
John Macintyre
Universidade do Minho, Guimaraes, Portugal
Paulo Cortez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, W., Chen, H., Ma, Y., Wang, J., Zheng, N. (2022). Transformer-Based Zero-Shot Detection via Contrastive Learning. In: Maglogiannis, I., Iliadis, L., Macintyre, J., Cortez, P. (eds) Artificial Intelligence Applications and Innovations. AIAI 2022. IFIP Advances in Information and Communication Technology, vol 646. Springer, Cham. https://doi.org/10.1007/978-3-031-08333-4_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-08333-4_26
Published: 10 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08332-7
Online ISBN: 978-3-031-08333-4
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science

Keywords

Publish with us

Policies and ethics

Profiles

Hui Chen View author profile

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Transformer-Based Zero-Shot Detection via Contrastive Learning

Abstract

Similar content being viewed by others

Zero-Shot Object Detection

Enhancing generalized zero-shot learning through semantic contrast and feature aggregation

Zero-shot learning via categorization-relevant disentanglement and discriminative samples synthesis

Explore related subjects

1 Introduction

2 Related Work

3 Method

3.1 Overall Architecture

3.2 FB Module

3.3 IICL/ISCL Module and KT Module

3.4 Training and Inference

4 Experiments

4.1 Experimental Settings

4.2 Comparison with Other Methods

4.3 Ablation Studies

4.4 Qualitative Result

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Keywords

Publish with us

Profiles

Societies and partnerships