Data Validation for Fun & Profit

Mike Prettejohn
Entrepreneur & Investor

Sun 21 July 2024

Below I'll explain what I've done post Netcraft and what my plans are.

My starting position in January was as a one man band; accustomed to being supported by some of the best computer scientists in the country and having control of a substantial, growing dataset, this felt problematic.

I learnt that this was more a question of perception when someone told me that “cash rich, data poor and people poor” sounded just perfect and not to change anything.

Undeterred, the base data I chose to work with is a stockmarket data service by Stockopedia which in turn is based on data provided by LSEG Data & Analytics, used by many personal finance sites including Yahoo Finance.

Internet professionals wondering why I didn't start with a cyber security dataset where my experience might count for more should note this is a small dataset with high value content conveniently available to monetise without the prerequisite of spending time collecting & cultivating data, recruiting or prospecting for & negotiating with customers and so suited my circumstances.

Investment professionals wondering why I didn't buy a more expensive service; I tried and was told - I paraphrase slightly - “We don't sell our service to the likes of you”.

Since January I've produced a gain of 24%, calculating performance from the point at which cash was invested to avoid understatement from having a large proportion in cash at the start of the period. This compares favourably with the World (14%), Nasdaq (20%), S&P 500 (16%), and FTSE-100 (5%) indices. Were the portfolio a cyber security company and these were its annual profits, I think it would be the third largest in the UK after Darktrace and PortSwigger; the NCC Group has turnover in excess of £300M but made a loss last year.

How are you doing it?

It is a diversified portfolio of profitable companies valued at over £1bn traded on public exchanges in well developed countries with no use of debt, derivatives nor cryptocurrencies. Companies are chosen using classical methodologies:

Growth at a Reasonable Price: the key idea is that the company's price/earnings ratio should not be much higher than its growth in earnings and, ceteris paribus, the lower the ratio the better.
Thematic or Strategic Investing: here the investor seeks to identify macroeconomic changes in the world which are so significant that companies involved in those industries cannot help but make more money. An example is the effect of the Russian invasion of Ukraine: these days, happiness is a viable air defence system. Pre-2022 nobody was thinking about air defence systems.
Value Investing: This could be described as investing in companies which have been out of favour for so long that they have become, literally, stupidly cheap. These companies are most likely to operate in industries which, for good reasons are shunned by investors. The value investor aspires to invest at the point that the company's valuation has fallen so low that not investing becomes more stupid than investing.

However, the big problem and therefore the big opportunity, is not investment methodology but data validation.

Data Validation

The bane of investors' lives is data which is missing, incorrect or inconsistent with other data, often from the same source.

So much of an investor's time is spent validating data it is surprising so little is written about how to do it. Counterintuitively, investment textbooks mostly assume the investor has reliable data to work with.

The investor operates amongst all kinds of bad data from typographical errors & simple omissions through to outright fraud. This includes stale forecasts and lazy projections for the second year out “I'll add 10% to this year's numbers for now” which increase in significance with the passage of time. Corporate brokers where “Buy” means “they're a client” and “Hold” means “they're going bust” are an Orwellian class of bad data.

Further, although the base data is expressed numerically; corroborating information can come in text, graphics and video; it is voluminous and time consuming to read.

The biggest part of the investment process is incrementally improving and automating data validation, thereby raising the abstraction level, convenience and efficiency with which decision making can take place.

If this sounds like a relentless chore, you've not tried validating phishing attacks whilst keeping pace with terabytes per minute of spam.

Colleagues from Hewlett Packard Labs may be reading this and thinking “This sounds like what you were doing in 1985-7”. It is. For readers who didn't know me until more recently; I spent much of my time at HP as an AI researcher; I attended conferences with names like “Advances in Artificial Intelligence” & “Expert Systems 85” and had a Symbolics Lisp Machine which cost more than the house I lived in. I didn't meet Geoffrey Hinton, but met some of his colleagues from Sussex who were responsible for the delightful Pop11 programming language.

After leaving Hewlett Packard I switched focus to the Internet which I thought might happen sooner than AI; that turned out to be correct.

What technology are you using?

GPT-4o.

Whereas the strength in my previous position was the repository of code and data available to me such that any attempt at a new service was able to incrementally make use of resources we already had, the strength in my current position is that I am starting from scratch with no existing investment in technology and can make use of the latest opportunities.

Analogously, when starting Netcraft thirty years ago it was an advantage to be starting without pre-internet business processes and technology. I was:

Within the first few hundred people in the UK with a dial up internet connection and consequently had access to information most others were oblivious to.
Communicating via electronic mail when the competition was sending faxes and making phone calls.
Selling subscription services when the existing industry was selling shrink wrapped software.
Using a web browser as the user interface to my services when the competition was writing user interfaces in proprietary windowing systems.

The clearest analogy is that the advantage of using LLMs today is similar to the advantage gained by using a scripting language, regular expressions and HTTP/HTML in 1994 when everyone else was writing locally hosted applications in C & C++.

Best of all, James Stanley joined in March which has made everything easier. James is a special person who can write reliably working code as fast as he, or I, can think.

How much can you delegate to LLMs?

LLMs are convenient for creating structured data from unstructured data. A lot of investment material is made available as pdfs; processing them with GPT-4o is a big step forward from regex.

One of the issues is how to handle persistence such that you do not repeatedly ask the LLM to analyse the same data. As well as the natural option of maintaining state, calling the LLM with data and then updating with results, it is also possible to provide the LLM with an API to your own data so that it can make the updates, analogous to using callbacks.

It is helpful to think of ways in which LLMs can participate beyond automating existing processes. Public company annual reports often run to a couple hundred pages or more and human investors would, at most, skim them. However, it is cost effective for an LLM to process, analyse, parse, summarise & rank annual reports for all of the roughly 5000 companies in my investment universe.

I expect that the most purposeful use of LLMs for investment analysis is proprietary and tucked away out of public sight inside hedge funds, but to illustrate some opportunities, this is an interesting paper which caught the attention of the Financial Times & Quartr provides a service which seems designed to be convenient for LLMs to use.

Since January I've used GPT, Perplexity, Gemini, Copilot, GPT, Anthropic and back again to GPT. All of them have had an advantage at the time. It's reasonable to expect that the LLMs will continue to improve at speed; coding in anticipation of improvements is a good idea.

What next?

I'm enjoying discovering how much more can be seen in data when using LLMs and how to monetise the things that I see. Just like back in January, some people might think “Why change anything?”

Happiness is Best Shared

The answer is that happiness is best shared. It would be a shame if the only outcome was a slightly bigger number in an investment portfolio; I expect that with more data and more people there will be proportionately more fun.

Please get in touch if you;

Would like to work with me;
Your organisation has data which has more potential than you are currently able to realise and you think you might like to be a customer or sell me a dataset.

I appreciate that making a profit first before starting a business may seem backwards; so far, so good.

Credits

It's a great privilege & tremendous fun to work with James Stanley again. And James is really getting the hang of pasties, North Coast beaches and cream teas.

Stockopedia's service has been a joy to use.