The why and the how behind er², our entity resolution tool
Entity resolution, also known as record linkage, is the task of disambiguating real world entities from data. That is to say, it’s the process of identifying and resolving multiple occurrences of a single entity to reveal a clearer picture of the information within the data. It’s simple enough conceptually, but exceedingly difficult to achieve in practice and at scale, which is why there aren’t many master data management solutions available.
Last year at ThinkData, we formed the DataLabs team to tackle this problem and provide our users with entity resolution tools that are performant, flexible, and customized to their particular use case.
Our Namara platform and its suite of tools constantly evolve, and our users have access to more than 250,000 datasets in addition to whatever data they’re using internally. With that many datasets in the mix, automating the process of solving for data variety becomes an absolute necessity.An example of duplicate records within one dataset - there are so many different ways to record a company name, for example, that automated entity resolution becomes a necessity
We often find duplicated entities that need to be resolved or linked, whether within a single source of data or from multiple data sources. A reliable entity resolution tool was critical to our users so that the most refined and aggregated information was available to our clients and their entire organizations. After many trials and iterations (and a few failures...) we now proudly offer our entity resolution tool, er², as a core component of the new ThinkData Works Master Data Management system.
Why does anybody need entity resolution?
A typical data scientist spends 80% of their time on cleansing and preparing data. This is a shocking statistic. As processing and refining data is required for producing data-driven insights, a growing number of industries are adopting machine learning approaches to improve their productivity.
We deployed er² as an enterprise solution to help efficiently refine and manage data. As more and more data is added to an ecosystem, the need for an operational layer to tie it all together becomes increasingly important – it also becomes impossible for humans to manage manually. There are limitless applications for record linkage to have an impact on every sector, and through our current client deployments, we understand the need for it as the world of data grows. Data is good and more data is great, but connecting data is the key to learning from it.
How does er² resolve entities?
For entity disambiguation, we first classify entity type (e.g. organization, address, etc.) and preprocess the data in a way that best suits the entity itself. The data type informs how we optimize the tokens (the sequence of characters used to demarcate the input) within the entity and how we distribute the computing load. We then translate each feature into a vector representation and use a compressed sparse matrix to compute pairwise similarity and link duplicate entities on a graph structure.
The example in the image gives you an idea of the disparity data scientists face when it comes to the way a single company can be recorded – typos, short forms, omissions, and variations abound when working with data, within a single dataset or across multiple.
Instead of using conventional word embedding models, we designed our own embedding model for more accurate, efficient, and scalable entity resolution.
We built er² to optimally distribute workload across multiple compute nodes using Spark so that users can efficiently work with large datasets.
Connecting data from any number of sources at scale
As we grow our data variety, our data refinement capabilities using entity resolution will improve in parallel. This means that data scientists, instead of processing datasets on a case-by-case basis, will have an automated solution that creates links between data points, producing robust master data records that lead to deeper insight. The less time data scientists spend mired in prep and processing, the more time they can spend on actual data science.
The future of er²
We've achieved many of the goals we set out to hit, including besting the accuracy of leading entity resolution tools when handling dirty data. However, that doesn't mean we haven't set new goals. We're working hard to automate the process moving forward, and to create an automatic data-knowledge transformation pipeline.
By training er² on one of the largest catalogues of public data in the world, we have designed our solution around the real-world dirty data environment that most data scientists face daily. Rather than building a product in a sanitary, synthetic environment, we're designing tools that manage and neutralize data variety quickly, effectively, and at scale.
Are you working with a database that needs to be threaded together? Interested in master data records and deduplication? Contact us to learn more or book a demo, and see how ThinkData can help your data scientists be data scientists.