Technical Blog | Vinayak Mathur

Machine Learning in the FinTech world

The ML special interest group at the Indian Institute of Science organised a talk by Mr Mayur Thakur and Sreenath Maikkara (Goldman Sachs). Here is a summary of the talk and the key takeaways as interpreted by me.

Speaker Bio: Mayur Thakur is head of the Surveillance Analytics Group in the Global Compliance Division. The Group serve as quantitative experts designing and implementing risk-based surveillances models, on large scale data of the firm. He joined Goldman Sachs as a managing director in 2014. Prior to joining the firm, Mayur worked at Google, where he designed search algorithms for more than seven years. Previously, he was an assistant professor of computer science at the University of Missouri.

The team presenting here, the surveillance team is part of a bigger block called the compliance team. This talk was a wonderful opportunity for me to understand the applications of machine learning and data science in general in the fin tech world.

The speaker started the talk by mentioning how the stakes were pretty high in the financial markets.

Fines imposed on corporations are disproportionate to the amount of money made committing the offence.

The number of regulators in the market is pretty high. As these companies have international operations in different markets, they have to deal with multiple regulators in each of these markets.

The main thesis of the speaker that he wanted his audience to take away was that “Building the data pipeline is much more difficult than applying a machine learning algorithm on it”, which sounded ridiculous at the start especially to someone working on ML algorithms in a research setting but as the talk progressed, the thesis started to make sense.

The key challenges faced by their team, which I am extrapolating to be true across the fin tech world are as follows:

Diverse data sets and formats: Data is collated from diverse sources and each of these sources have their own set of formats that they follow. For example the European desk may follow a different schema when compared to the Tokyo desk.
The size of the data that they have to deal with is huge and it is updated very frequently.
Data from the past can change: This can happen in multiple scenarios one of them being when a manual trade correction is made, your pipeline needs to be built in such a way so that these kind of changes can flow through it without breaking the system.
Surveillance decisions need to be debuggable: This is mostly because of the regulations, one fine day the feds may come knocking on your day asking ‘Why was trade X on Oct 25 2015 not flagged’ and you need to have enough data to be able to explain those decisions taken by your algorithm.
Not real time but need some time guarantees are needed (say T+1).

There are multiple lines of defence. Each transaction goes through the stack shown below, but generally speaking you don’t want the SEC to come knocking at your doorstep (public safety tip)

Auditing Stack

The architecture that the team uses to overcome these challenges is shown below.

Architecture

The Preprocessing pipeline is basically a map reduce job. Each of the Flattened table is a H base table.

This architecture combats the key challenges in the following manner:

Diverse datasets & formats -> Preprocessing + common format
Size of data -> Hadoop + preprocessing
Data from past changes -> Hbase + versioning
Decisions need to be debuggable -> Bookkeeping
Time guarantees -> MapReduce

Spoofing a case study

To understand the kind of problems that the surveillance team seeks to remedy are diverse, spoofing is a good representative example of this set.

What is spoofing:

You want to sell 300 shares of XYZ at a high price
Artificially inflating the price before you sell would earn a nice profit
Exchanges allow you to submit orders and later cancel them. If you submit many ‘ spoofs’ (fake orders) buy orders first. You’ll create an illusion of buying pressure, to drive up the price.
You then sell your 300 shares of XYZ and then quickly cancel the many fake buy orders.
One of the unique challenges in solving this problem and detecting such transaction is the lack of training data. So supervised training cannot be used in this scenario. The Goldman Sachs team analysed 6 regulatory enforcement cases and identified 4 characteristic features of transactions that attempt spoofing:
Order imbalance
Time to cancel post execution < 1 sec
Level of profit generated by the trades
Marketability

No model for you
The model developed by GS is proprietary and not meant for public consumption (booo)

This talk provided me with key insights into how the FinTech world works, hope this blog post does the same for you. A lot of my assumptions were challenged, for instance who knew that markets around the world could not agree on time, creating the missing millisecond problem.
Apparently some of the exchanges do not keep track of the milliseconds when timestamping the trades wreaking havoc on algorithms such as the one developed by GS to detect spoofing as in the world of algorithmic trading the market changes in milliseconds and they need to keep track of the millisecond on the timestamp to classify a trade as spoofing. Saved by a millisecond.

Why The Machines Won’t Win – This is a story of human ambition.

I have been working on the bleeding edge of AI technology for more than 4 months now. I am working on a project that seeks to further the state of the art in conversational agents and I have been critically reading and critiquing research papers published in top tier conferences like NIPS, ICLR and AAAI. I state all of this to establish that I know what I am talking about.

Whenever I tell someone I am working in deep learning almost inevitable the conversation veers of to their fears of how they believe that technology is beating mankind, how this is the apocalyptic prediction coming true and some version of the suggestion that the progress we’ve made is good enough and we must slow down if not stop all together. A lot of this stems from the way AI advancements are portrayed by the media; ‘ Last bastion of human intelligence falls’ , ‘Man vs Machine’, ‘How the Computer Beat the Go Master’, ‘The battle between humans and machines’.

What they do not tell you is that humans built the machine in the first place! So it is not AplhaGo that beat Lee Se-dol in the game of Go but the engineers at deep mind. This is a story of human triumph not defeat. There is no need for skepticism or cautious optimism we should be celebrating these advancements and welcoming them with open arms.

The machines are not out there to get you, these technologies are going to augment human ability as technology has been doing since the stone ages. Advancements in artificial intelligence make us smarter and more productive. Better search results help you find the information you need faster, more accurate speech recognition allows you to go hands free, more reliable machine translation helps you consume information that would have been outside your reach. Both technologies and humans are the most productive when their respective strengths are combined to give rise to an unstoppable force. Machine have memory, humans have experience; machines have precision, humans have compassion and empathy; machines bring the tools, humans bring the purpose.

These tools combined with human ambition, perseverance and the super power to dream big are going to open doors to worlds we cannot have imagined in our wildest dreams. So the next time you hear about an AI advancement (which is most definitely going to be tomorrow, given the crazy pace of development) instead of fearing it, think of how you can use this advancement to augment your efforts in making the world a better place.

Effective communication for scientists

This week I had the unique opportunity to attend a talk by national geographic magazine’s acclaimed photographer Anand Verma at the National Center for Biological Sciences in Bangalore.

It was an amazing talk that featured Anand’s work for the National Geographic magazine, some of it was even unpublished work. He spoke about how he was training to be a biologist at Berkeley and took up photography as a summer job, one thing led to another and he found that he could do the most fun parts of a biologist’s job while being a photographer.

He talked about how his projects have been about abstract even revolting topics such as host manipulating parasites and bats*. And how his job is to use beauty as a weapon against apathy. The power of the image is to hold the viewer’s attention and make him realise in those few precious seconds that facts matter, and the world is more beautiful than what any of us can imagine.

Photography is not just flipping out your camera and taking a photo. A lot of research goes into each assignment, with months of background reading and observing the subject in their habitat, so much so that you know how it will react in a given situation so that you can place your equipment accordingly. Each assignment is like a postdoc for him.

He found that photography is not just a medium of dissemination of information but can also act as a means of discovery. Using his skills and equipment he was able to photograph young honeybees as they developed into bees from their larvae, the process showed newly formed bees exchanging oral fluids, the scientists had not known that this exchange happened at such an early stage of a bee’s life.

After the talk I went up to him for an intimate question answer round and asked him that he has been acting as a conduit of dissemination of information between scientists and the general public, what would his advice be to the scientists to get better at communicating?

He said that the key to communication according to him is knowing your audience, most scientists are trained to communicate amongst themselves through papers, conferences and seminars. So when they are faced with a new audience such as high school students they are stumped. They end up using their subject jargon and lose the attention of their audience. It is important to know what details are important and what are just additions. Once you figure that out, remove all the distractions that are not needed so that the audience can focus on the important details. He used a story on bats that he has been covering as an example. When he talks about that story he gauges the audience first, so if only 5% of the audience is ‘bat-scientists’ he would leave out most of the technical details, but it is a dynamic process. Suppose in his presentation he is showing a photograph which is bound to hold the attention of the audience for the next 5 seconds he may add certain titbits that the bat scientists would truly appreciate but at the same time he won’t lose the rest of the audience.

I have tried to present an interpretation of the talk which may be helpful to us as computer scientists. If you would like to know the details of his works that he covered in the talk feel free to ping me. The talk surely converted me into a NatGeo magazine subscriber.

*Below I present some representative images of Anand’s work, all rights of these images belong to Anand and the national geographic magazine.

cricket _worm — Host manipulating parasite worm that forces the host cricket to search for a water body and then go drown itself so that the worm can emerge in an aquatic environment

Lady_bug — The parasite forces the host ladybug to stand guard on it’s cocoon until it is ready to come out into the world.

honey_bees — Photograph of the honeybees as they form into bees from larvae, that amazed the scientists

Google Brain AMA Learnings

Last year the Google brain team organised a Ask-Me-Anything on reddit. It is an amazing AMA which I encourage everyone to read. However in case you do not have the time to go through the whole thing, I present some of the key take aways and learnings from the AMA.

“our research directions have definitely shifted and evolved based on what we’ve learned. For example, we’re using reinforcement learning quite a lot more than we were five years ago, especially reinforcement learning combined with deep neural nets. We also have a much stronger emphasis on deep recurrent models than we did when we started the project, as we try to solve more complex language understanding problems.”

“Machine learning is equal parts plumbing, data quality and algorithm development. (That’s optimistic. It’s really a lot of plumbing and data :).“

Underrated methods:

Random Forests and Gradient Boosting
Evolutionary approaches
The general problem of intelligent automated collection of training data
Treating neural nets as parametric representations of programs, rather than parametric function approximators.
NEAT
Careful cleanup of data, e.g. pouring lots of energy into finding systematic problems with metadata

Exciting Work:

The problem of robotics in unconstrained environments is at the perfect almost-but-not-quite-working spot right now, and that deep learning might just be the missing ingredient to make it work robustly in the real world.
Architecture search is an area we are very excited about. We could be getting to the point where it may soon be computationally feasible to deploy evolutionary algorithms in large scale to complement traditional deep learning pipelines.
Excited by the potential for new techniques (particularly generative models) to augment human creativity. For example, neural doodle, artistic style transfer, realistic generative models, the music generation work being done by Magenta.
All the recent work in unsupervised learning and generative models.
anything related to deep reinforcement learning and low sample complexity algorithms for learning policies. We want intelligent agents that can quickly and easily adapt to new tasks.
Moving beyond supervised learning. I’m especially excited to see research in domains where we don’t have a clear numeric measure of success. But I’m biased… I’m working on Magenta, a Brain effort to generate art and music using deep learning and reinforcement learning

Resources:

https://keras.io/ : Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.
http://www.arxiv-sanity.com/ : Get the best of airxiv; also find similar papers according to tf-idf
/r/MachineLearning
https://nucl.ai/blog/neural-doodles/ : Neural Doodles!!

Difference between Reinforcement Learning and Supervised Learning

During the introductory class of our Neural Networks course a classmate asked me this question. This is a really good question. I thought I knew the answer, until I sat down to write it. So I digged a little deeper and came across this paper “Reinforcement Learning and its Relationship to Supervised Learning ANDREW G. BARTO and THOMAS G. DIETTERICH”. A tldr version from my understanding can be found below:

http://www-anw.cs.umass.edu/pubs/2004/barto_d_04.pdf

Supervised Learning

In supervised learning, the learner is given training examples of the form (x_i, y_i), where each input value x_i is usually an n-dimensional vector and each output value y_i is a scalar (either a discrete-valued quantity or a real-valued quantity). It is assumed that the input values are drawn from some fixed probability distribution D(x) and then the output values y_i are assigned to them.

A supervised learning algorithm takes a set of training examples as input and produces a classifier or predictor as output.

The best possible classifier/predictor for data point x would be the true function f(x) that was used to assign the output value y to x. However, the learning algorithm only produces a “hypothesis” h(x). The difference between y and h(x) is measured by a loss function, L(y, h(x)). The goal of supervised learning is to choose the hypothesis h that minimizes the expected loss.

Reinforcement Learning:

RL comes in when examples of desired behaviour are not available but it is possible to score examples of behaviour according to some performance criteria.

For example, if you are in an area of poor cellular network coverage, you move around and check the signal strength. You will keep doing this until you find a place with adequate signal strength or till you find the best place in the given circumstances. Here the information we receive is not telling us where we should go or in which direction should we move to obtain a better signal. Each reading just allows us to evaluate the goodness of our current situation. We have to move around and explore in order to determine where we should go.

Given a location x in the world, R(x) the reward at that position, the goal of RL is to determine the location x* that maximizes R and yield the maximum reward R(x*). A RL system is not given R, nor is it given training examples; instead it has the ability to take actions (choose values of x) and observe the resulting reward R(x).

RL combines search and long term memory. Search results are stored in such a way that search effort decreases and possibly disappears,with continued experience.

Difference:

1. In RL there is no fixed distribution D(x) from which the data points x are drawn.

2. The goal in RL is not to predict the output values y for a given input x, instead to find a single value x* that gives maximum reward.

super tldr;

Reinforcement Learning: Examples of correct behaviour not given, but ‘goodness of current situation’ known. –> Maximize unknown reward function.

Supervised Learning: Examples of correct behaviour given, find the hypothesis function h which best maps input to output. At the same time taking care to avoid overfitting.