Artificial Intelligence (AI) and Data

Artificial intelligence is here and it’s radically changing every aspect of how a business functions. There's no industry where AI is not accelerating the automation and transformation of business processes. Today’s smartest minds are calling AI the new electricity or the new oil. We are hearing about a future where rapid advances in artificial intelligence will lead to singularity where machine intelligence will start getting closer to even surpass how humans think, analyze and make decisions.

Machine learning is already being successfully applied in uses cases such as recognizing objects in images, facial analysis, detecting spam emails, ad matching, fraud detection, recommendations, speech recognition, automatic natural language processing and more. For example, Netflix uses AI to recommend personalized movies to watch. Amazon.com surfaces customized product recommendations based on your search and purchase history so you know what to buy next. Facebook and Instagram use algorithms to suggest friends and customize newsfeeds to help grow your social network and push relevant content. Google uses ML for better ad targeting and search result ranking. Innovators in manufacturing industry are using AI for supply and demand simulation to optimize inventory and operations. Startups such as Palantir are using visualization, search, and analytics to discover hidden patterns in large data sets.

ML is a very exciting interdisciplinary field with robots, language, speech, computer vision and a range of science and technologies all coming together. Everyone is interested in applying ML methods to various problems. Today, it is the most sought after skill and is generating the highest interest. In the last 10 years, learning algorithms have become incredibly good at reading handwritten characters, autonomous driving, virtual assistants, computer vision, search ranking and database mining. For instance, applying algorithms to electronic medical records helps detect trends and increases knowledge based on historical data. Analyzing credit card and electronic transactions helps identify potential fraudulent activity.

These recent AI technological advances are largely due to progress in big data, cloud utility computing, and improvements in deep learning and algorithms. Data can now be gathered and stored at low costs on cloud storage from services like Amazon AWS, Microsoft Azure and Google Cloud. Compute power and speed has got a lot better with GPUs performing faster than standard CPUs. Machine learning models can now run in hours leveraging multiple GPUs on a grid of cheap servers. The wide adoption of connected IoT devices and sensors is increasing data by huge magnitude. ML platforms and frameworks will continue to advance and make it easier to create models without being an expert at algorithms and AI technology. This current momentum has already led to AI getting a lot of attention and investment dollars. The coming together of these forces at the same time is helping the AI ecosystem grow at a fast pace.

Supervised, Unsupervised and Semi-supervised Learning

The most common use case for machine learning is supervised learning which is an input-output mapping. You feed in the algorithm input data and it learns from this mapping to predict results on new data. This approach has been used successfully for use cases such as spam email detection, qualifying loan applicants based on attributes that match past history, and so on. Over the last few years, there have been significant advances in machines being able to sort through unstructured data like images, audio and video. Detection of objects in images has become fairly accurate and at times catches things that the human eye may miss. Translating text into audio, and between languages has also advanced considerably.

To get started with machine learning, you need to identify existing data sets and prepare it to be used for training the algorithms. Companies have tons of structured data in their internal analytical and transactional databases. This data set is the easiest and quickest to use as input for training neural networks and deep learning models in areas like product recommendations, personalization and fraud detection. You have to make sure you give the correct data as input if you want the training to produce high quality outcome.

Unsupervised learning involves running high volume unlabeled data through algorithms to detect structure, identify patterns and reach conclusions without human guidance.

Semi-supervised ML sends the difficult cases to humans and handles the rest.

The human-in-the-loop approach uses human input to make algorithms improve their accuracy and predictions. For some use cases it makes sense to use a hybrid approach where human experts review recommendations and predictions made by machines before they are sent out. This helps catch any misjudgments made by algorithms and also provides constant learning to help improve the accuracy of ML models.

The Importance of Data

Data is critical to achieve high levels of accuracy of the ML models and improve quality of predictions. The training data used to make algorithms work as intended needs to be typically cleansed, labeled and enriched before it can be ready to do its job. There is just no substitute for quality and depth of data used for training algorithms. Without properly labeled and organized large data sets, you will end up with incorrect or biased recommendations. The human-computer interaction is much more important for artificial intelligence than we normally think. Artificial intelligence depends on machine learning, which in turn relies on high quality training data. Machine learning algorithms can predict outcomes only as good as the data they train on and need new training data continually to improve and update their models in order to achieve high accuracy.

Building a good ML model requires deep technical and domain expertise. The typical data cleansing and enrichment process involves continuous iterations through the following steps:

  • Integrate and merge data from disparate sources into a data lake such as S3 or a data warehouse service like RedShift.
  • Clean, standardize, normalize, transform and de-duplicate data using SQL, Hive and Python scripts
  • Enrich missing data using a combination of manual and automated data collection.
  • Verify, test and measure results.

Most engineers and data scientists spend countless days preparing the data before their ML models can use it. To make sure your AI initiatives are successful, you need to invest resources in building an independent internal or external team that focuses on data collection and cleansing. This will allow your AI experts to focus on writing sophisticated algorithms instead of grappling with cleaning and preparing data.

AI and Humans

We believe in a future where humans and machines will work together harmoniously. We need to leverage AI to augment humans so we can all be more productive, creative and happier. It has to be partnership. Humans can work with computers to analyze the patterns found and come up with actionable insights and interpretations. Computers predict based on past learning. This is where human judgment is critical to help see if the outcome needs to be changed if the base parameters or future scenario has changed. Tomorrow’s industry leaders will be organizations who will be smart to use AI along with humans to innovate faster and make every process run in an efficient manner through increased productivity. The good news is that we are still in the early days of AI and there is a lot of advancement and progress ahead of us. The consumer will be the eventual winner and will benefit from a far superior and personalized experience at lower price points – and that is what excites us!