Intelligent Surveillance with Computer Vision

How is Computer vision used?

Computer vision refers to a set of techniques for analyzing and understanding digital images and videos, using computers acting as “eyes” with the ability to see. Computer vision has many commercial applications such as mobile phones with facial recognition capabilities, object tracking algorithms used in video surveillance to detect shoplifters, etc., automatic number plate recognition systems used by law enforcement to identify cars involved in criminal activities, and more. The technology offers great value where security and SOP (standard operating process) compliance are critical, it works with existing video surveillance to generate intelligence from 24/7 generated video footage. As per the Seagate trend report, 88% of surveillance cameras are recording 24/7, and computer vision-based platforms offer an excellent solution to mine this intelligence. However, it is important to understand the practical challenges that CV has and how they can be overcome. If you are deploying an AI-based video analytics solution in your enterprise read on to see what you should be expecting from it and how you ensure to minimize the false claims.

Challenges in Computer Vision Models

1. Accuracy in detection

Computer vision based on CNN models promises greater accuracy in object detection compared to simple image-based processing as it takes into consideration various parameters for detection. Accuracy of object detection depends on several factors such as training dataset, lighting conditions, image resolution being captured from the camera, CCTV angles, etc. The model also needs to be separately trained with day and night lighting for greater accuracy. Another major challenge that computer vision algorithms face is the lack of diverse and annotated data sets that are specifically trained for a particular environment. The readily available algorithms use models that are trained on foreign or fair-skinned people that may result in a lot of bias due to huge variations in characteristics. These factors may lead to false positives or negatives, making the inferences complex.

2. Compute and Bandwidth resources

Often, Computer vision models can get fussy or tricky, requiring specific dependencies and sometimes even need specialized hardware, if they get bulky, such as selecting off-the-shelf available models then they may end up consuming a lot of resources and computing power. Also, these deep learning models can be deployed on the cloud, along with edge. They provide benefits such as real-time remote access, detection of video footage and can work with any IP-based cameras that support ONVIF protocol. AI solutions deployed in cloud environments offer the advantage of simplified management and scalability of computing assets but requires a considerable amount of processing power. It requires every image of a video recorded and transferred to the cloud before processing is possible. Also, an Internet connection is required at all times. Hence, the bandwidth required for such a data load is immense as the system is meant to capture and perform inferences for 30 images/sec/camera (FPS30). For an average setup of 100 cameras, we estimate to process a volume of 259.2 million images per day, making it voluminous and extremely expensive.

How do we Overcome these Challenges?

1. Model Selection based on the use case

With several available deep learning model libraries, model selection should be focussed on how best the model can be applied to the business problem. Once, selected a model, there is a cost attached with this decision. A deep learning model needs to be evaluated in terms of its dependencies, complexity in deployment or maintainability, and limitations. The selection is also done considering the expected training time of the model. A developer may need to assess even model’s performance in terms of its predictive accuracy, noise, its training data availability or the model extensibility, etc

2. Training on Specific Dataset

Some algorithms require specialized data preparation in order to provide a comprehensive scenario of the problem to the learning algorithm. Hence, before considering a model, its data training needs should also be carefully evaluated. Thus, custom deep learning models trained on a specific environment with a full-feature dataset tend to have a better accuracy rate. The model training phase requires performing the following steps over a raw training dataset and outputs a model that can be then validated and tested with separate validation and testing datasets.
  1. Data filtering
  2. Data transformation
  3. Feature selection
  4. Feature engineering

3. Model Performance optimization

In a real-world applied scenario, for a particular selected model, the performance of a custom deep learning model can be optimized in terms of
  1.  Maximizing accuracy of inferences
  2.  Reducing compute hardware or bandwidth for communication

1. Maximizing accuracy of inferences

Improving the accuracy of model inference can be done by improving data quality and quantity, different algorithms can be evaluated and fine-tuned, or else problems may be re-framed, or specific features can be focused. For example, the latest deep learning libraries evaluated can be viz YOLO, MobileNet for handling object identification and classification in a single pass of the network. But they do need to be fine-tuned and implemented for better performance in terms of computation speed and accuracy.

2. Reducing compute hardware or bandwidth for communication

Compute and bandwidth are critical for the performance of deep learning models, and it does require deep knowledge of the use case else you may end up selecting a model that will completely use up the resources and become very expensive to maintain. These two important parameters can be optimized by
A. Resolution and FPS
Frames per second (FPS) can’t be compromised when the action is important but can be reduced when certain use cases require only count, hence an image with reduced size and low FPS as input can reduce 6 to 8 % of CPU usage. For eg head detection model, a lighter architecture – tiny yolo that is deep enough can provide accurate results, will take less computational power.
B. Optimize bandwidth
Intelligent load sharing between central and edge with scheduled load balancing, for batch processing and real-time processing, helps optimize the bandwidth needed, only sending metadata to master/ central can save on the processing power. AIVID BOTs move machine-learning tasks from the cloud to high-performance servers that are connected to cameras or NVR. Hence, it allows the processing of the data, remarkably close to the source, reducing any time delays. Thereafter, only results are sent back to the cloud for further analysis thus reducing the load on bandwidth. This also ensures that processing does not stop in-between if the internet connection is disconnected and there is always a backup available.

4. Scaling CV models with Distributed Architecture

A distributed architecture can easily scale out to support 1000s of locations. AIVID’s platform builds on containers. It allows to easily extend new sites with additions of new Node, all easily managed through centralized system.

Real-world application of Computer vision

People In out and Gender detection

Now, let’s say for this particular business use case, we would require a head detection or a face detection model, in sync with the age gender recognition model. Although various models are freely available, they fail to perform in real-time scenarios. Most of the available face detection models work best with 100*120 image size, but in real CCTV scenarios, we get faces with image sizes, only as small as 30*40 matrix range. Also, the available face detection models are trained on cropped and clean-face images and hardly consider “hair as a feature”, which is very essential for gender prediction. Similarly, the current available age/gender recognition models are trained on foreign people images, and they do not perform well with the Indian population. They work best only with zoomed-in faces, plus some of them are gender and age-biased. And to add to this, if we utilize the freely available, deep architecture models that claim results with better accuracy, the size of the model becomes bulky, and getting a real-time inference from such a model requires a lot of computation power. Also, some other models require a high GPU configuration to deploy or perform efficiently in real-time scenarios. Consequently, as a final solution, it becomes expensive for the customer. In the above scenario, AIVID maximizes the accuracy of AIVID BOT by training them on the database of the region-specific, demographic features of the visitors. AIVID BOT works on existing cameras for age/gender recognition, we have built a balanced, region-specific dataset with all age categories and gender images.


Computer Vision and deep learning-based enables transform surveillance into intelligent surveillance is the next step being adopted across industries as it is humanly impossible to go through hours of video footage manually. Most of the surveillance cameras capture the video 24/7 but it’s only used for post-incident analysis. Computer vision & AI-based platforms enable derive rich insights from this valuable data that is left unused and also automate inspection and alerting for incidents, however, it is important to understand how the technology works and what can get the best output, leveraging existing infrastructure and not becoming another overhead to manage. AIVID- an AI based Visual inspection automation platform can be easily integrated with your existing surveillance camera setup. AIVID BOTS, are our custom deep learning models for different domains, trained to detect specific activities and send out real-time alerts. To know more on how we can enable you write to us
Scroll to Top