The success of deep learning models raises the question of how they achieve these impressive results and to what degree performance is similar to that of humans. Existing studies comparing human and machine performance have so far focused on the task of object recognition—a detailed comparison for the task of object detection is missing. Here, we compare state-of-the-art detectors with humans using accuracy and similarity in attention processing. Human data was taken from COCO-Search18 and COCO-FreeView datasets for three different tasks (detection targets present, absent, or no task given). We benchmarked accuracy and saliency maps of six models (Faster R-CNN, YOLO (v5 and v8), DETR, MDETR (in detection and VQA modes)) against human saliency and DeepGaze, which directly predicts human saliency maps. Although we found that the best models equaled or surpassed human accuracy, their performance pattern was different. Likewise, our saliency comparison revealed crucial differences: even though MDETR modes, when prompted to simulate different task contexts, were most similar to human saliency, they mostly failed to reach DeepGaze levels. Interestingly, increased similarity to human saliency went along with better model accuracy only for MDETR modes. Overall, current detectors process images differently to humans, however, adding semantic information yields better alignment with human performance patterns, potentially enabling the development of more trustworthy, human-like AI systems.