From data wrangling to AI & ML enabled data curation

Authors: Peter Spicer – Chief Technology Officer & Scott Matthews – Chief Data Scientist

There is little debate we are at a stage where demand for data and data availability is growing at unprecedented levels. This is placing strain on organizations’ legacy technology and processes to remain competitive. For Capital Markets firms, new compliance and regulatory requirements, recognition of the value of new and unique data to find alpha and the need to reduce operational costs – all require the need to innovate.

Business infrastructure is under increasing pressure to service a growing number of use cases where a wealth of broad and large-scale data is required.

Traditional data driven organizations have specific requirements of their data needs. Data is often sourced from multiple vendors who supply in different formats and can involve proprietary elements unique to each data vendor.

This presents a common challenge – “Data Curation”. A discipline to clean, normalize, categorize and label data as it is ingested, organized, and stored before making it available for analysis. This is an essential activity that eliminates poor data from getting into the firm’s analysis activity and requires a complex set of activities and skills.

Data Curation Challenges

For data users, curation is tedious and considered mundane or repetitive. It is a task that does not in itself realize additional value to a firm, has a high opportunity cost and goes often unrecognized or appreciated. It is estimated that analysts spend 80% of their time on data management (curation, organization, securities master management, and more). The process to ensure data is organized, and particularly with unstructured data, is usable can be extremely time intensive and require specialized skills.

5 key challenges facing data curation and management today

1. Data Volumes

Businesses operating on legacy hardware based on physical infrastructure will be finding it increasingly difficult to deal with organizational demands and yet remain cost effective and flexible. With legacy platforms, organizations face inefficiencies and duplication from operating multiple siloed platforms and disparate data systems. With greater demand for storage, managing a growing fleet of physical on-premises hardware increases operational costs and will only get worse as the demand will not reduce any time soon.

For global firms trading across markets and assets classes, tick data in particular can easily become petabyte scale when dealing with decades of history across hundreds of markets and at full order book level of detail.

Clearly the issue is not new or unrecognized by organizations as more and more migrate to cloud solutions. That said, though cloud now provides almost infinite scalability, it also needs to be managed so that costs do not run away while also implementing the appropriate controls and data life cycle management processes. This ultimately ensures fast and efficient access for data consumers whether in a compliance team or a quant or researcher doing analytics.

In addition, as the volume and variety of data in the organization grows, there is a need to also gain greater control over data contracts and ensure they are earning more utility from each data license and avoid costly duplication or proliferation across business units.

2. Reference Data

Vendors of market data typically collect data from hundreds of venues. Over time, vendors and industry consortiums have developed a range of coding schemes to uniquely identify tradeable instruments. Some are global (ISIN), some are country specific (Valoren), some are proprietary (e.g., RIC – Reuters Instrument Code, CUSIP), some are free open source (e.g., FIGI, LEI). Some vendors have created their own schemes, perhaps initially due the need to fill a data management requirement, but today they have become a lock-in making it difficult for users to migrate to alternative vendors as the codes have proliferated business processes and across systems.

Due to the range of data sources and providers and the various forms they are delivered, today trying to analyze, join or interpret data can be a significant challenge as one data source may use an ISIN, another a CUSIP and another perhaps just having a company name as the key to the data record.

There is little information available in the public domain that explains the logic for creation of these codes, which makes it extremely difficult for a user of this data to map an instrument code on an exchange to its equivalent RIC or FIGI. A significant investment of time is required to generate a heuristic to map codes, and ongoing maintenance to stay relevant. This data interoperability through identifier mapping or linking is vital to ensure organizations can maximize use of all data available.

3. Symbology

Companies change their name often during mergers or for other reasons. Ticker name-change events also need to be considered as some exchanges allow ticker name re-use which can make understanding the history of a company difficult if not understood. Searching just based on a company name can also be problematic as it is quite feasible that a company name could be reused especially across different countries.

Organizations that require historical data for such use cases as back-testing or research, need to be able to distinguish between two companies that have had the same ticker code over various date ranges. This issue compounds when you have 2 different data sets sourced from different vendors having proprietary codes. Symbology mapping over the history of a data set is important and a key factor in effectively executing compliance and analysis activity.

4. Data Nuances

When organizations source data (and particularly when direct from source) it is critical to ensure ways to deal with issues such as timestamp inconsistencies and accuracy (accuracy / number of significant digits, when, where and how the timestamp is captured), time zones and daylight-saving changes, how to backfill missing data when provided by vendors, how various market metrics are calculated and on what basis and explicitly highlighting areas where a data archive is permanently lost. These scenarios require not only domain and data expertise but mastery of data science, cloud computing and knowledge of how to integrate these elements in an efficient way for rapid deployment. Being able to effectively leverage data science to identify and deal with data anomalies in a scalable way is critical in today’s environment.

5. Alternative Data Sets

The ability to produce data catalogues and enable data interoperability for analysis are building blocks to successful alpha or signal to trade identification. Systems today not only need to support various common data sets but also make it easy to on-board new ones on demand and establish linkage, through identifiers or entities. As data consumption grows, many alternate data sets are in semi or unstructured states. From extracting data in images documents through to entity recognition, and understanding the context of that entity in text, are all challenges. Applying the best, and most appropriate, AI & machine learning methods to effectively extract meaningful data, and link recognized entities back to structured data formats in use, via a symbol or instrument code, is where recent advances have been made.

A Way Forward

These issues are faced not only in the capital markets sector but many are grappling with how to move forward in this data prolific age, how to chose the right technology and who to trust in supporting migration. With over 15 years building and managing petabyte-scale time series data platforms, RZT has worked with clients across multiple sectors to find practical, scalable solutions. Operating the Thomson Reuters Tick History platform for example, RZT ingested billions of transactions per day from over 450 exchanges, to deliver trustworthy and reliable analysis-ready data to thousands of quants and researchers around the world.

Given the scale of the Thomson Reuters service, it was critical to optimize AI and ML technology throughout automated quality control and data management workflows. This involved solutions for detecting errors, corrupted messages, and missing data, through to recognizing unexpected changes made by an exchange to schemas, formats, and codes, before the impact is felt downstream.

The problem continues to build in complexity, as more breadth and scale of accessible data is unleashed. However, as data challenges grow, relief from technology has also advanced. It is now possible to apply increasingly sophisticated anomaly detection and machine learning techniques to enforce quality control, arrange and curate data, link across sets, as well as deliver in a way that can be surfaced in a user-friendly consumable way.

Modern platforms, specifically in cloud technology also deliver cost benefits, adding scalability, and presenting the tipping point for organizations to now take action (or suffer the risk of being left behind). It is possible to practically and with low risk, build a roadmap to increase data access and interoperability at scale and speed. RoZetta Technology has a range of solutions developed for clients who seek a practical and effective next step:

DataHex – Data Management SaaS Platform

RZT’s DataHex is a 4th gen cloud-based data management SaaS platform able to ingest and manage a range of data types. It fuses proprietary data handling algorithms and management automation with leading-edge applications, to ensure optimized performance in a scalable, cost-effective operation.

Securities Master

Fuel data interoperability by linking identifiers across data sets and enable symbology mapping over the historical tick data set back to 2003. Extensive Reference data and concordance: 52+ million instruments; 12m+ principal updates annually, across 250 exchanges; 9 asset classes and all major identifiers: ISIN; CUSIP; SEDOL; FIGI and more.

Analytics as a Service

DataHex can provision data directly into a range of leading analytic tool solutions such as Databricks, Snowflake, Big Query and more.

Entity recognition and mapping

Complex AI & machine learning methods are employed to recognize and extract data elements (for entities and entity-relationships) in semi and unstructured data to enable mapping to industry identifiers and open the potential of linked data.

RoZetta Technology enables complex problem solving by providing advanced analytical expertise with proven technology platform capability. We use our deep data science knowledge and over 20 years’ experience in cloud and data technologies to design and deliver transformative solutions. Turning high-volume, complex data situations with structured and unstructured data into clear insights for decision makers.

Contact us to find out how RoZetta Technology can empower your business to make better decisions.