The current technology surrounding Autonomous Driving Cars (ADC) keeps improving at a fast pace, providing plenty of food for thought and R&D, and approaching day by day an inevitable inflexion point. I recently talked with people close to me, not well versed or informed in matters technological, and I was saying that what is happening and will happen with these technologies will force us all to reconsider a lot of what we now take for granted. And one of the inevitable transformations that ADC might bring to our lives will concern urban development and transport systems. My interlocutor was and is still not completely convinced, but for me the transformations might come sooner rather later, and sooner even than the median optimist expects.
This serves well as introduction to the paper I want to review here today in The Information Age. The name of the paper is Joint Attention in Autonomous Driving (JAAD), and is a nice analysis of precisely the challenges that are still hovering over the development of ADC, for it to become a staple in the streets of every country and/or cities people choose to live or work in.
Despite such success stories in autonomous control systems, designing fully autonomous vehicles for urban environments still remains an unsolved problem. Aside from challenges associated with developing suitable infrastructures and regulating the autonomous behaviors , in order to be usable in urban environments autonomous cars must have a high level of precision and meet very high safety standards .
Today one of the major dilemmas faced by autonomous vehicles is how to interact with the environment including infrastructure, cars, drivers or pedestrians , , . The lapses in communication can be a source of numerous erroneous behaviors  such as failure to predict the movement of other vehicles ,  or to respond to unexpected behaviors of other drivers .
The authors of the paper stress that in order for ADC take over as a mainstream technology, the major issues will be around these challenges of communication between all agents involved, the way the interaction between humans with their perceptual errors and the devices is properly optimized to minimize “perception discrepancy”, and what might in the future really work and solve permanently those issues:
There have been a number of recent developments to address these issues. A natural solution is establishing wireless communication between traffic participants. This approach has been tested for a number of years using cellular technology , . This technique enables vehicle to vehicle (V2V) and vehicle to infrastructure (V2I) communication allowing tasks such as Cooperative Adaptive Cruise Control (CACC), improving positioning technologies such as GPS, and intelligent speed adoption in various roads. Peer to peer traffic communication is expected to enter the market by 2019.
Although V2V and V2I communications are deemed to solve a number of issues in autonomous driving, they also have a number of drawbacks. This technology relies heavily on cellular technology which is costly and has much lower reliability compared to traditional sensors such as radars and cameras. In addition, communication highly depends on all parties functioning properly. A malfunction in any communication device in any of the systems involved can lead to catastrophic safety issues.
Their proposal – after judging that current technology is inadequate to solve these issues -, revolves around a novel dataset applied in a context of viewing the challenge as one of an instance of joint attention in the communication between the ADC and its perceptual environment:
In an attempt to better understand the problem of vehicle to vehicle (V2V) and vehicle to pedestrian (V2P) communication in the autonomous driving context we suggest viewing it as an instance of joint attention and discuss why existing approaches may not be adequate in this context. We propose a novel dataset that highlights the visual and behavioral complexity of traffic scene understanding and is potentially valuable for studying the joint attention issues.
Definition of Joint Attention
The definition of joint attention presented by the authors in this paper is well worth to highlight here, as I think it to be wider than a technological setting and significant to other fields of study. I would include in those: Robotics, Cognitive Psychology, Neuroscience, Economics and Educational Studies:
According to a common definition, joint attention is the ability to detect and influence an observable attentional behavior of another agent in social interaction and acknowledge them as an intentional agent . However, it is important to note that joint attention is more than simultaneous looking, attention detection and social coordination, but also includes an intentional understanding of the observed behavior of others .
Since joint attention is a prerequisite for efficient communication, it has been gaining increasing interest in the fields of robotics and human-robot interaction. Kismet  and Cog , both built at MIT in the late 1990s, were some of the first successes in social robotics. These robots were able to maintain and follow eye gaze, reacted to the behavior of their caregivers and recognized simple gestures such as declarative pointing. More recent work in this area is likewise concerned with gaze following , , , pointing ,  and reaching , turn-taking  and social referencing . With a few exceptions , , almost all joint attention scenarios are implemented with stationary robots or robotic heads according to a recent comprehensive survey .
The JAAD Dataset and proposals
The joint attention involves quite a broad set of different scenarios and perceptual challenges, coupled with the right judgement from the cues that agents (social or artificial), must make to maximize security in ambiguous situations:
While these are fairly typical behaviors for marked crossings, there are many more possible scenarios of communication between the traffic participants. Humans recognize a myriad of “social cues” in everyday traffic situations. Apart from establishing eye contact or waving hands, people may be making assumptions about the way a driver would behave based on visual characteristics such as the car’s make and model . Understanding these social cues is not always straightforward. Aside from visual processing challenges such as variation in lighting conditions, weather or scene clutter, there is also a need to understand the context in which the social cue is observed. For instance, if the autonomous car sees someone waving his hand, it needs to know whether it is a policeman directing traffic, a pedestrian attempting to cross the street or someone hailing a taxi.
And the technologies being deployed are somewhat still in a stage of trial and error, despite he promise they surely have given us of further breakthroughs.
Today, automotive industry giants such as BMW, Tesla, Ford and Volkswagen, who are actively working on autonomous driving systems, rely on visual analysis technologies developed by Mobileye to handle obstacle avoidance, pedestrian detection or traffic scene understanding. Mobileye’s approach to solving visual tasks is to use deep learning techniques which require a large amount of data collected from hundreds of hours of driving. This system has been successfully tested and is currently being used in semi-autonomous vehicles. However, the question remains open whether deep learning suffices for achieving full autonomy in which tasks are not limited to detection of pedestrians, cars or obstacles (which are not still fully reliable , ), but also involve merging with ongoing traffic, dealing with unexpected behaviors such as jaywalking, responding to emergency vehicles, and yielding to other vehicles or pedestrians at intersections.
To answer this question we need to consider the following characteristics of deep learning algorithms. First, even though deep learning algorithms perform very well in tasks such as object recognition, they lack the ability to establish causal relationships between what is observed and the context in which it has occurred . This problem also has been empirically demonstrated by training neural networks over various types of data . The second limitation of deep learning is the lack of robustness to changes in visual input . This problem can occur when a deep neural network misclassifies an object due to minor changes (at a pixel level) to an image  or even recognizes an object from a randomly generated image 
Further still, the lack of proper public datasets, not widely available as a way to improve the work of researchers in the field, compounds those problems. But the proposal being made by this paper is precisely a novel dataset, more appropriate to address all of the challenges. This dataset consists mainly of video clips featuring the kind of situational settings where a deep articial neural network might be able to improve on judging the ambiguity it encounters in real world urban settings (with added few rural settings for robustness):
The JAAD dataset was created to facilitate studying the behavior of traffic participants. The data consists of 346 high-resolution video clips (5-15s) with annotations showing various situations typical for urban driving. These clips were extracted from approx. 240 hours of driving videos collected in several locations. Two vehicles equipped with wide-angle video cameras were used for data collection (Table I). Cameras were mounted inside the cars in the center of the windshield below the rear view mirror. The video clips represent a wide variety of scenarios involving pedestrians and other drivers. Most of the data is collected in urban areas (downtown and suburban), only a few clips are filmed in rural locations. Many of the situations resemble the ones we have described earlier, where pedestrians wait at the designated crossings. In other samples pedestrians may be walking along the road and look back to see if there is a gap in traffic (Figure 4c), peek from behind the obstacle to see if it is safe to cross Figure 4d, waiting to cross on a divider between the lanes, carrying heavy objects or walking with children or pets. Our dataset captures pedestrians of various ages walking alone and in groups, which may be a factor affecting their behavior. For example, elderly people and parents with children may walk slower and be more cautious. The dataset contains fewer clips of interactions with other drivers, most of them occur in uncontrolled intersections, in parking lots or when another driver is moving across several lanes to make a turn.
For further analysis of the figures, Tables and references, I encourage the reader to read the full paper, which is concluded by its authors this way:
In this paper we presented a new dataset for the purpose of studying joint attention in the context of autonomous driving. Two types of annotations accompanying each video clip in the dataset make it suitable for pedestrian and car detection, as well as other areas of research, which could benefit from studying joint attention and human non-verbal communication, such as social robotics.
Featured & Inserted Images : Autonomous driving in urban environments: approaches, lessons and challenges