Every piece of the puzzle is needed in order to see the big picture
For example, one of our banking clients is interested in ingesting new streams of import/export data that we provide on Namara. By linking external import/export data to their internal client data, the bank will be able to uncover new client trade activities that were otherwise unknown, identify true wallet size, investigate trade trends, monitor growth, map clients, and ultimately gain insight into a previously opaque part of their business. Every one of these advantages is dependent on linking the bank's internal client data to the external data feed provided by ThinkData.
UK companies data on Namara
To link this valuable export/import data feed to the bank's internal client data, a third external data source is needed. This third data feed is an extremely useful public repository of all businesses in the UK which can be linked to external trade data feeds. This connection provides us with the key to connect to the bank's internal data. This data set was harvested and organized by ThinkData and can be accessed on Namara Marketplace.
Sample of company data being gathered and organized by the Namara Platform
Global exports data feed
Let's consider a scenario in which we want to use the export data feed, filter it on UK based companies, and connect this data to the UK companies information on Namara. First, let's take a look at this data set. A sample of this data feed contains various trade information in 20 columns, such as:
- Exporter information (name, id, address)
- Consignee information (name, id, address)
- Information on exported goods (weight, quantity, value, etc.)
Sample of global exports data being filtered and connected to the UK companies information on Namara
Results: Connecting global export data to UK companies data
Since both the global export data and UK companies data are missing information in some columns, we decided to use the company name field in the UK companies data. Additionally, we used the consignee name (representing UK companies involved in the deal) field in the export data feed to make the connection.
Since these two columns are always populated in both data sets, the first attempt in connecting these data sets was based on matching these unstructured text fields.
An unstructured text field could contain anything and is not necessarily cleanly recorded company names.
The first step is performing the required pre-processing steps to make these data sets ready for the matching pipeline. We designed and performed a fuzzy matching algorithm which includes the following steps:
- An indexing algorithm is used to extract a good subset of similar fields. This helps with the performance of the algorithm and saves time by choosing pairs from two data sets that are more likely to be a match.
- The candidate pairs are then processed by the text-matching algorithm. Our approach breaks down the names and takes into account many different scenarios in which a specific name can appear. This algorithm computes a score of match for each record in the global export data set to its closest match in the UK companies data.
- The generated scores reflect the probability of the match which then is used to extract the most likely matches.
- The final result would be the global export data set with an extra column which has the unique ID of corresponding company in the UK companies data set.
The following image shows 20 randomly selected records with the corresponding match score. We can see that the score appropriately reflects the level of confidence in match, and the matching algorithm has a good tolerance for small typos and variations.
Map of UK Exporters Companies House created on Namara
For modern companies, data is the key to unlocking major opportunities, better understanding your market, and developing advanced analytics. As we move toward AI and ML-driven tech, the quality of insights will be reflected in the quality of the data that is being used. By enriching internal company data with external feeds of data, fast-moving businesses are establishing a pipeline that can deliver actionable information on demand which gives them the time and resources needed to grow their business.
Want to learn more about er²?
Request a consultation with one of our data experts to talk about our entity resolution toolkit and how ThinkData’s tech can advance your projects. If you're interested in learning more, read how MaRs Discovery District applied er² to Ontario Businesses.