We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Analysis Summary
Worth Noting
Positive elements
- This video provides a detailed look at the architectural decisions and specific Clojure libraries (like Flop) used to handle real-time personalized news feeds at scale.
Be Aware
Cautionary elements
- The presentation frames 'rolling your own' abstractions as universally superior, which may downplay the significant maintenance burden and 'bus factor' risks for smaller or less specialized teams.
Influence Dimensions
How are these scored?About this analysis
Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.
This analysis is a tool for your own thinking — what you do with it is up to you.
Related content covering similar topics.
Starting a fresh Clojure project with Neovim REPL driven development
Olical
Datomic Cloud - Datoms
ClojureTV
The Taming of the Deftype Baishampayan Ghose
Zhang Jian
Exercism Summer of Sexp - solving challenges with Clojure
Practicalli
Create a URL shortner with Clojure and MySQL
Daniel Amber
Transcript
we're talking today about why do we use closure and why do we think would go faster with it it's mostly gonna be a software engineering heavy kind of talk there'll be a few parts where it gets slightly technical but don't worry about that much if you don't understand certain things it's it's for the engineering purpose that we're bringing them up so the main thing we're we're here to talk about is making these very tiny little fine-grained abstractions as libraries and building up your code that way is really what allows us to go very fast and at the beginning you pay some dues because you build some things where you know some people might say why would you ever write that in-house just use X instead but then shortly afterward as you start to tackle more and more complicated problems these little small abstractions that you built up come in handy in a way that it wouldn't from a framework and you can see over time these things evolving like Hadoop for example that whole ecosystem was originally this really as I understand it terrible and gnarly codebase that doug cutting put together which then was kind of factored out over time into several different projects and now they've pulled out zookeeper and et cetera so even with these things that start out very monolithic I think they head in this direction over time and it's toward their detriment when they don't closure is particularly well-suited to this I think functional programming in general is but closure in particular is very good for some specific reasons that we're gonna go into later in terms of the syntax simplicity etc and for that reason we love closure we use it almost exclusively we have almost no other code in a language other than closure with the exception of JavaScript obviously an objective-c for the iPhone apps so about us what is prismatic what does this thing do so as Alex was saying this thing is a Content discovery engine essentially it's sort of like a post search a you know machine learning oriented kind of it is a consumer product and it's meant to be basically the newspaper of the world so it's meant to grow to be very very large-scale and what it does is quite complicated so we've thought from the beginning about how to cut corners and also how to make sure we don't paint ourselves into a corner which would be very easy to deal with a system like this so that it wouldn't scale past say like a hundred thousand users we personalize feeds based on your interest so we learned about your interests by the way I forgot to mention this from Twitter Facebook etc you sign up we learn about you we give you feeds based on your interest it's a pretty simple proposition and then we let you explore out to do interest which is also really important that means we have to tackle search that means we have to know similarities between topics and so on and let you navigate along those edges so we'll get into that all with a live demo at the end but here's like a quick snapshot of what the product looks like so you can get an idea it's very minimal it's it's well designed we also focus a lot on design you can see you just basically have a core newsfeed here there's nothing revolutionary there it's just showing you articles you can notice a couple of interesting things though so notice on that second article you have the ten related stories so that's clustering which we'll get into in a little bit and that's very very hard to do in the way that we do it and also you can see that all the articles on the lower left corner there in the header have been labeled with their topics and those are basically correct for the most part you can even see here on the bottom we have bowling an interior design you're not gonna find a lot of systems in the world that are going to show you articles in the connection of those but presume attic will right that's what it's all about right there so who builds this thing it's me it's just only me really no I'm just kidding it's not I actually don't do anything these three are the smart ones and this is it this is our hole closure team we have obviously you know front end product facing its work we have designers and so on but the whole back end is all built by these four people and mostly these three I actually don't do any work anymore I'm more important though so you no I was just kidding I don't actually what what you don't know is that you I don't do any of the fun stuff anymore these guys get to do all the fun stuff and all I do is like invite campaign code and stuff like that right so because I can't focus long enough to tackle the hard thing so I only get to do like trivial bug fixes and stuff like that now it's awesome and we don't have any of these guys either yeah so there's none of the there's none of those guys the mix attentionally it's tough but yeah we screen for that during the hiring process so so what do we build so we build these crawlers which are somewhat like traditional web crawlers but they're a bit different because they mostly only cross social networks and also because all the activity that we're going after is changing all the time there it's a mix of crawling and polling right you're constantly finding new things in you're constantly pulling them to see what's getting updated and we pulled smartly based on how often things change so the crawlers in a sense are actually a far more efficient way of getting all of Twitter so we have most of Twitter and memory on servers but you know we don't need to take the firehose and that the firehose thing has lots and lots of additional spammy data right most of the data is just rubbish and you'd rather not deal with processing that and learning how to drop it on the floor if you don't have to better to not even have it come in at all and so you're if you're smartly crawling around we've found ways to basically avoid having a lot of the rubbish coming in the first place social graph analysis this is just understanding who you are and how you're connected to your friends and how that relates to your interests because we use all that for ranking we also use it for bootstrapping if any of you have signed up when you have the initial onboarding suggestions these are largely based on both your social graph and all the stuff that you share topic modeling this is you've seen right we apply labels to documents as we showed in the first part there and then relevance ranking so all this information and the end of the day is taken together to rear anq these feeds for you every on each request in real-time news feeds re ranked so we'll get into that in a bit but that that's pretty tough to do okay so news feed so the first thing that we have to do to build news feeds is as documents come in we need to know have we seen this before or not and if we haven't then we need to do a lot to work on it we need to go and extract all the entities we need to do all the tokenization and everything like that we need to apply topic labels we need to cluster them and see if it's a member of a cluster if it starts a new one etc so there's all these kinds of things happening in this real-time indexing pipeline IRA meant to the clustering already we mentioned the real-time personal ranking the personal ranking is interesting what happens really is that every time you make a request we have all the information about you in memory on the machine which also has all the documents that you could potentially see before that feed so if you make a request say for a topic feed because we have feeds by topic or for your home feed we take all this information into account we also take into account your interest that you explicitly add in the app and all these things are used to personally be ranked it's really quite interesting problem because there's social signals there's information about your interest one of the things that we found that's really neat about interests is that when you get a home news feed in order for this to really be a new experience and not just like you know a rejiggered version of your Twitter feed or whatever which is what's been done a lot but not really what we're trying to go one of the biggest problems is to diversify over your interest right I think a lot of us here in this room probably follow lots of tech stuff whether it be technical or about startups or tech as a business in general and so we see lots and lots of this kind of stuff and if I give you just a better version of that well that's not really very cool to be honest right there's so many other things out there for that and it's it's I mean one thing that could be cool is if getting like a closure feet in a Haskell feed and a functional programming feed and stuff like that and you're not able to get this somewhere else that's pretty cool um but what's even cooler is being able to diversify over all your different interests and make sure that you're getting a good mix of other stuff into there right and so decaying weights as you include stories and a feed so such that you can see things across all your interest is really important problem so the real-time personalized ranking is actually a really really big deal it's very difficult to get right and it's an example of a problem that our team is really well-suited to deal with because all of us running product and running research are kind of the self-same people right so we get to think about the kind of vibe that we're trying to create and so on and that comes back around and influences the research so that's really cool and of course all the tests happen very fast right that all that rewriting is happening in under 200 milliseconds so all the obviously it's all in memory and it's all very very customized so we tend to roll our own and I know you're thinking that's tobacco air in the in the photo so yes libraries are greater than frameworks I think everybody mostly agrees with this at this point I don't think this is a very controversial point don't don't take your your your precious functions and and give them off to some weird thing to run that you don't understand it's a terrible idea and we are almost entirely closure we really really don't use hardly any Java at all and this is really important to stress because at the beginning this wasn't the case but we found over time that you really don't need much if with some clever use of macros and some of the recent changes to closure over the past year you really don't need to do hardly any job at all so take a break for a second the first the first thing we're going to talk about is flop and this is really hardcore fast math right so this is the era you would think like yeah just you have to drop out of closure here but we don't at all and stuff some 100% closure with tiny tiny bit of Java for one particular thing that actually might be able to go we haven't revisited it for a while but it actually might be able to go away so oftentimes when you're doing fast math for machine learning and so on you're obviously doing things with mutation it's a necessary evil but it can be a pain in the ass sometimes in closure and here's an example of something really simple and the big thing that I don't like here is is really the indexing in right like manually indexing into collections when you're processing two of them in parallel like this is is really kind of tedious and error-prone thing that I'd rather not do to be honest about it so and I should mention the specifics here so this is we're just adding 5 plus J to the J element of V right here so now we're gonna look at it in flop so even when your type hinting we've also found you can get an efficient code there's nothing that guarantees that something's gonna blow up if you want it to run fast there's nothing currently in closure that that you can that can guarantee this will run fast and it will like barf on you if you if you're causing reflection but we actually want to do that for some of our core code we actually want it to blow up if it's gonna yield code that reflects right we don't want to run that code so one of the things we've done in flop is exactly that so these all these core macros like the a fill you see here um these things will puke if there's reflection inside of them so you can't even run the code if it reflects so here you see we we weasel out of the indexing right is it that's kind of all you want to get to you just want to get rid of all the syntax that you can and get it behind some place where you can smartly check for certain things and it can't you for example reflection right that's really the big thing that you want to avoid um this is one of the rare uses of macros that we have I mean we don't use macros much in general they're mostly they're mostly bad except for when you need them right all the typical stuff and here's here's our do array this is the guy who's used all over the place you can just see he's doing these bindings and here we're just printing them out we can even bind out the index if we need to use that to access data in some way in the computation and that's it it's really like it's just like a do seek thing except you don't really want to be Cartesian here you just want to go over the two sequences in parallel right which happens all the time this is what you always do and here's the canonical example is a dot product we're gonna get into a little bit of math here for a couple slides but it's gonna be pretty light and if you get lost at any point don't worry about it this part here is really simple so dot product is just the pairwise product of two vectors like math vector it's not closure vectors here we go so use it all the time in machine learning for prediction with prediction all we're trying to do is maximize the dot product the we're trying to basically select the feature which maximizes the dot product with some weight vector right that's really all we're trying to do and we also do it all the time in training where oftentimes probabilities are estimated with exponentiated dot products so that basically means you're doing even more dot products right so dot products you do them all the time the whole point of this is dot products happen all the time it's in the inner loop and it has to be very fast so that's it that's the whole point and here we see dot product of native closure again it's not that big of a deal but the a gets I'd rather not have and the a reduce is a bit funky right I don't know if any of you have used a reduce but it's a bit funky it's one of these functions that you're kind of you're you're never quite sure what the argh order is you know you go to throw down the neighbor dear senior yeah like I don't know I didn't know whether I'm not sure what order these things go it right it you have to look it up at the docks it's not cool so so and then here it is a native Java again I mean here the situation is no better right and I've been right writing this kind of code for a very long time and I really hate the indexing part right it's a pain in the ass it's error-prone and we're using really simple examples here to illustrate a point but and obviously if you're just if this is trivial right you're just multiplying together two numbers and you have two indexes there but if you have something really complicated this starts to become non-trivial and it's really easy to screw something up and so that's the point here it's not just about fat about speed it's also about correctness so in flop this is all we have right it looks like the math that's about as close to the math as you're going to get you're summing over all the W's from W X and X from X of s and you're taking the product that's it all right a little bit more complicated example as expected log probabilities so here what we're doing is we're taking these sides each of these sides is the expectation of the log of the theta i given the alpha right so and don't worry about if you don't understand what this means with this fancy - lay thing don't worry about that the important part here is below is this funky function there this die gamma so for each of the sides we're basically doing the gamma of alpha minus that gamma of the sum of all the alphas right so might look a little funky don't worry about that the important part is this die gamma function here this guy is really expensive to compute so now we're in the very similar situation as the dot product right it's a pretty trivial looking top-level calculation but the problem is that gamma guy is pretty gnarly that's not multiplication and so you can write this kind of thing and flop again but it's going to be very fast and this is very important because this guy is the inner loop of topic modeling so this you're going to be doing as much as you're doing dot products but the problem is that the gamma function is very expensive to compute and it's only an estimation as it is right so these things were always pretty burly that right there is basically the core of what of a lot of our code looks like that underlies a lot of our machine learning libraries this very simple series of functions like this that are written in terms of wap abstractions and then we're encapsulating the details of some more difficult function behind a simple call like so so this this flop library achieves the same perf as Java more or less it's pretty fast it's it's so fast that we have no problems with any of this stuff anymore and best of all you can write a lot of really complicated code and very small number of lines so optimization under 180 lines LD is style topic modeling under 180 lines if you've seen this kind of code before you know that these this stuff is is hard like doing this in 180 lines this is pretty spectacular it's great for a number of reasons besides the fact that it's fast right it's it's much easier to verify that things are correct this way also this kind of code is very hard to test right so you can't just say like unit test this stuff and it's gonna be fine this kind of stuff is very very very difficult to test and a lot of times it's probabilistic so it's not even clear that you can necessarily write a test done that's deterministic and you don't necessarily want to try to like mock out all kinds of certain things in order to try to make it deterministic that's not always a good idea so it is very very valuable to distill the code down to the point where you can kind of read it and verify that it's correct on one screen at a time so that's that's flop in a nutshell so now we'll go into some of our plumbing some storage and compute abstractions that we have so store is the first one store is pretty cool actually we use it all the time and it's basically just Kiva it's an abstraction over a key value store that is backed by memory filesystem s-3b DB sequel we had Rio find there before we got everything that you can possibly imagine behind this interface and we use the specific features of the underlying store obviously right we do things very efficiently if there's a smarter way to update or custom ways to flush or whatever we do this on a store by store basis and this is very very very useful like we did this because of previous experience when so everybody knows the get input that's easy to do we call a bucket like a key value store that's our kind of primitive key value store but the big deal is the updates right people have worked with this stuff a lot know that the the thing that's difficult is really the read-modify-write kind of pattern that comes up all the time and so they get in the put are not such a big deal but the update is and so we have this I this merge bucket right which is another one of the protocols that that buckets can reify and and merge basically takes care of this for you so at the time when you construct your bucket you inject it with a merge function and that merge function allows you to merge data so you can just in the same way that you could do get or put you can do merge and merge is gonna be smart about where the data is how to merge it in etc because the thing that happens for us obviously is we have a huge amounts of streaming data that are always coming in which were aggregating and so you don't want to be redefining and writing even if the store that you're supposedly dealing with under the hood is is supposed to be efficient about this kind of thing because it doesn't know the structure of your data like you do so if you have a key-value situation and you have some funky map or other data structure that you're merging you'd rather do that in memory you'd rather keep that in an in-memory like immutable hash map and then flush it periodically rather than going back and forth and round-tripping to the data store and relying on it to do that for you also I forgot to mention their sync and we'll come back to this in a couple slides sync is a big deal to that has to deal with all of the buffering and flushing as well as all of the caching and checkpointing kind of operations that come on so let's see also the other thing that yeah the other thing I should mention here is it's really useful to get this stuff behind a protocol because then you can abstract the network away and oftentimes with key value stores or any kind of store for that matter you know whether it's document or these other various hybrid stores that are coming up it doesn't have built-in network facilities and if it does they're not always consistent and the reason we put all of our data stores behind an API is that we can then implement those api's for the network rather than for each individual store so anytime you hook up a new store and implement the protocol you get everything for free right which is really nice and also means that we can deal with like all the robustness and error handling things and all that in one place instead of dealing with it on a per store basis it also means that you can easily swap these things the in-memory one is great obviously for this reason and we use it all over the place in testing it's a really easy way to do system testing for really complicated things where you're composing a number of different services or whatever together and you have the network faked out and you have your storage faked out and everything is just running in one process and it's really nice when something is very complicated and you can get it running in a simple in-memory fashion in one address space that's beautiful and it makes it way easier to verify that things are correct abstracting over bus or buffer and flush policies is super super important I'm sure many of you have written this code a whole bunch of times if you've worked with data much we all have it tends to be done on a case-by-case basis in different places it happens for messaging right a lot of things like for example the 0 mq the thing I don't know if you guys have heard of this is your mq thing it's like supposedly fast queue and they say on their webpage you know that's faster than tcp and i'm like faster than tcp what the hell are these guys doing and so i want to go look oh they're buffering you know they're buffering message you know it's not that they're it's transferred over TCP it's called buffering stuff and sending it over TCP there's nothing new there but the point is we do that all the time it happens constantly and sometimes it's happening for storage because it's expensive to merge something sometimes it's happening for passages across the network but the problem is it's tedious kind of code to write and you'd rather not write it for every single different purpose so this is the kind of stuff that you want to abstract get it right nail it and never do it again and so that's the kind of stuff that this store and what we're gonna show in a minute the graph library all about so onto a concrete example with the store merging we're going to do indexing by grams really simple so first you see we set up the store here and it's an in-memory store and it's a merge function is is our good friend merged with Plus right pretty simple and then we're gonna do seek over the partitioned words in the by grams are going to store indexed on the first word all the counts of the subsequent words that's it it's really simple to index by grams of this stuff second example is MapReduce we're gonna do a trivial MapReduce and then we're gonna do one that's gonna go through three or four different slides and talk about some of the real details it's gonna get slightly more real not yes is getting real and when it's not going to be it's not going to be distributed or anything but it's gonna deal with like aggressive Prius owner but in the simple example we just have a bucket spec here with a reduced function in it again that's all we have we have a whole bunch of buckets that we're making that's the BS here and we have this worker function the worker function is just taking outputs from the mapper the mapper can generate many that's why you see the do seek here of the key value pairs and we're merging right so again the whole reduced thing is just being taken care of by the merge and then over here we create all of our workers and we're just using our do work function that we have the do work is dealing with all the nastiness of the underlying Java executor stuff and Java concurrency and practice stuff will you we don't use a ton of the closure concurrency primitives I should add it's nothing necessarily wrong or against them it's just that a lot of times we're doing things where we really need to find grande control and we need to create these pools and we really need to maximize resources and be smart about that so a lot of times we found that we just go straight to the executor stuff directly and control it there and then we just reduce over the values so this is a pretty pretty trivial example of the Map Reduce so now getting into a slightly more realistic version so mostly the same we have the main aggregator bucket setup at the top we have a worker pool that we've set up we've got an in queue because we're gonna be like giving all the tasks into the workers with this in queue we have a sentinel value to know how we're gonna be done and this is a thing just doesn't aside a lot of times people will talk about streaming or batch computing and everybody's talking about you know stream processing and and blah blah blah now and it's the same thing except that you have a sentinel right that's how you know when you're done I mean that's always how it works it's not really that big of a deal you have the same plumbing in both cases and and you're gonna see in here right like the in in two three slides are gonna see that all the mysteries revealed to you of how to do like stream and batch right it's really not that big of a deal so we have our setup here and we kick off our future and our future is gonna fill the queue and it's gonna stick the sentinel value in the end there you see that so that's it now the second part of the setup has to do with all the latching right so you need one global latch for the whole job to know and you're done and each mapper needs a latch as well and then you have your Terminator function here your Terminator function is basically wrapping the mapper you see that at the bottom checking to see if you're hitting the Sentinel and if so you're counting down on the Terminator latch so that's basically saying once I've seen the Sentinel I know that all work to be submitted is being processed by somebody but I don't know if I'm done because other people can still be have tasks in flight right so it's safe for me to count down my terminal latch there but it's not safe for me to be done with the job so this is just basically how to wrap each mapper and make sure that each mapper knows how to check whether or not things are done globally and then we create these buckets again because each worker here has its own bucket so what the whole purpose of going through all these gymnastics here at the beginning and so that when the job is running each mapper is able to aggressively merge before any kind of barrier has to happen at the end right so everything is all the way done so that by the time the very last computation happens almost everything is complete right there's only a tiny amount of work to be done there so you really try to remove all the barriers that you can so here we're gonna submit all the jobs so we've got where we're seeing over the buckets and we're submitting all these tasks to the pool you can see here earlier we had the code for the one global latch you can see that here we have the countdowns on the mapper latch you can also see that we need to check the terminal latch as well right because we don't need to bother checking for any more work if we know there isn't any so I don't think I need to walk through each line by line on yeah let's just skip that and then at the at the very end you have the awaiting right so you check of your waiting for the terminal latch then your waiting for all the mapper latches and then you're ensuring that everybody is fully merged into the master bucket and then you're done all right so that's it it's really not that much code it's not that hard to build this kind of stuff and the whole reason I bring it up is again because the fact that it uses buckets right and you there's there's papers about this actually about speeding up MapReduce and doing smarter more efficient partial aggregations and so on and the whole reason that this stuff is cool is because basically instead of doing a Hadoop or one of these other things where you have no control anymore and if you want to try to do these modifications in Hadoop you have to go in and modify the Hadoop code base which is kind of funky and you have to try different things like we've read some papers about people trying to do this kind of stuff with bdb like basically partial aggregations and then trying to speed up and reduce the bottleneck and when you build these smaller abstractions and you build everything else in terms of those it's just ways you to do this kind you know because you have most of the tools there for it already and most of the difficult part in dealing with this is all about the storage in the aggregation and you can see from this code here right it's really not that much code to deal with a pretty complicated task that also has correct concurrency semantics in a fairly tricky scenario and it's really like I mean even on slides it's only a few pages of code right and in in Emacs that's like a half a screen so it's really a small amount of code alright next up is going into the store library here so we've seen some examples I want to tell you guys what they're all about so the two patterns that come up all the time and we have these wrapper policies around stores that deal with - we only saw merging they're right there was no flushing and one of the harder parts is actually the flushing like you're building up this merge data how often do flush it where do you flush it etc the longer I've worked with this kind of stuff the more I realized that all the gymnastics around the data are mostly related to the buffering and the flushing or the caching of the checkpoint it needs to kind of go together right because if you're gonna cache something you don't want to flush it all the time when you need to persist it you want to checkpoint it so you basically want to take a seek of everything and you want to checkpoint it to this but you don't want to drain seek you just want like a checkpoint see crate is what we call it and so the the opposite is true when you're flushing if I'm if I have a bunch of Rights coming in and I'm gonna smartly pre I create everything before I write it to this then I'm gonna merge them all together and I'm gonna flush so same the the reason that I'm bring this up is that we're relying on the seek underneath it right so even for even on the back side of the store we're decoupling then from that protocol and just omitting seeks and not really so it doesn't have to be like hot store on store action right it can be you can have you can have seeks going anywhere and we're gonna show that in the in the last example here after we talked about the graph a little bit I'm gonna go into an example where we actually flush a siik which is then updating some custom in-memory data structure to be used in the online system so all right what is the grafting the grafting is pretty simple it's like a better API into doing like distributed computing than like a MapReduce kind of a thing it's it's more or less the same idea as as Microsoft's Dryad it's pretty simple you just have a bunch of functions you want to compose them into a dag and you want to decouple the execution plan from that dag that's really it and you want to optimize it for throughput this is again why we don't use like Hadoop Ernie one of these things because we want to express our computations in a way that can run in our production systems the live with online data getting online updates and then serving results up to users we want to use the same code there that we would use to run over all our historical data to do training we don't want to split those up and we don't want limitations of batch updates so we don't want any of that but at the same time we want it to be fast we want to be scalable so we kind of built up these abstractions of our own let's go through an example here that's really simple with the graph C so first thing we're going to do is fetch Docs so this we you can see like the syntax here is pretty simple we're going through the pike we've got the same functions as closure but with a G in front of them were first fetching a bunch of docs then we're tagging them with entities that's what this NLP extract them to these guys doing and then we branch so that branch statement there you see the terminal i/o vertices there and a one is a bucket merge and the other ones publishing on pub/sub right so this is a kind of pattern that we do all the time with graphs is you set up a computation and the terminal nodes are i/o and it's basically always a store or pub/sub for all intents and purposes and so this is a pattern that came up so much that we kind of like built it in and made it really simple to do so the inputs and outputs play nicely with that also I should mention they get their inputs all the time for this so the input scenario that comes up all the time is you start up a graph and you have some data on disk which is gonna prime it and then it's subsequently going to be getting all of its data from live events so it's also very Simmons scenario that you you start up a graph and you you have a priority queue on its head right and then you start up a job in the background on a future or whatever which is going and reading data in from disk and dumping it in and then you have that same guy up on a listener which is getting hit with inbound data so that's a pretty common scenario that we do as well and that's what happens for example when you if you reboot like one of our core machines that runs our our news feeds for example that makes everybody's news feeds they have to read in very large amount of data into memory we're using the the big instances like the cluster compute guys on Amazon and so they're reading and gigabytes and gigabytes and stuff into memory when the service starts up but you don't want it to be dead and you also want it to be available for accepting live events right so this is a very common system when you're dealing with streaming data so we've we've taken pains to kind of try to abstract that stuff so it's very easy to put a new service up like this you can do it in a couple of hours so the execution policies are also really important for one thing you want to be able to test this stuff if you've written like MapReduce jobs and stuff like that much you know that it's kind of a pain in the ass to test people have even written weird like testing harnesses for this stuff but the main reason it's a pain in the ass is because you can't just test them like normal functions and there's absolutely no reason why you couldn't do that except for the fact that the things were written as frameworks by people who didn't care much about this kind of stuff so what you'd like to be able to do is define one of these graphs and then just compose the thing into one giant function to test and that's what we can do with our trusty function called graph which actually does what what it sounds like it does you also can run each node each vertex inside of a thread pool all on one machine so you could have 20 different nodes each and its own thread pool you could schedule them all in one thread pool or you could schedule each one on a different machine so it's completely decoupled and you can run it in a variety of different ways but there's another bigger win I think probably from this one you know you can extract the exit you can abstract the execution policy that's cool you can fan out execution in a variety of ways and you can eventually we don't do this now it's pretty primitive but eventually we can have a really really really smart planner there right eventually we can know the mathematical properties of each function and know whether it's optimal to and we can also monitor there perfect I'm and so when you know its properties and you know its execution a profile over time then you can really smartly and dynamically plan how and where you execute these vertices and that's really cool but the bigger thing which is a huge issue in production systems is the monitoring and the visibility stuff and I would say for most of the kinds of systems I've dealt with in the past it's pretty weak you know even at Google I was at Google for a while they have really elaborate system for querying over logs and such because almost all these systems have always been built you know with this metaphor that when things go astray it just writes out something to the log and I think logs are a really bad idea and and I don't know why we have that there's a whole cottage industry called log analysis which is just created to basically query over logs and it's like well why the hell are you doing that in the first place why are we creating a whole industry to like fine crap out of data that we wrote out as strings when we had all the information about it in the first place in memory when we wrote it there it doesn't make any sense at all and it's making our lives miserable it makes debugging terrible it's like you can't monitor your stuff in production so it's just like why do we live like this why do we put you know why do we permit ourselves miserable existence and I don't understand because we don't have to right so that's what we do because you have all these nodes in the graph are just functions well you just wrap them you know that's all you just you just use higher-order functions and then you throw up exceptions or you do whatever you want to do and you capture the data and then you can write it out it's pretty straightforward so here's what's kind of like an example looks like of one of our dashboards right so in the graph example I also forgot to mention I should have mentioned this so we might been wondering what these little IDs are so we're using keywords as IDs for each of our text and we don't have to do that they don't have to be there they're optional but we always do that and in order to give them descriptive names for for monitoring purposes and obviously you can just like introspect and look at the function name but that kind of breaks down with lambdas right so because we want to be able to like do you know partials and lambdas and all that kind of stuff we found that it's actually just easier to manually tag these things with IDs it's a small amount of work given the debug ability benefits so you can see what we're doing we're looking at time the number of times things have been executed what their throughput looks like what their cpu time looks like the number of exceptions that each one has thrown which you know the number of each type of exception that has been thrown and all this kind of information is really great I mean as soon as we got this kind of stuff this is the first time in my life that I've ever run anything like this in production with this level of visibility and it's just so liberating you know it really really is to be able to find where things are screwed up and notice where data is being lost like the first time we got this thing up we literally found that we were dropping like 85% of the data on the floor like like so so I mean seriously 85 percent of articles that were coming in we're just being dropped on the floor you know from parsing errors or all kinds of other right so a lot of it is pretty a trivial to fix when you can see it in a clean way but you can't see you know you can't see it like you know like neo looking at the matrix thing like falling down in front of you in the logs right you can't find anything that way it's just a disaster so with this it's great you can really you can really fix stuff and so just practically that's a huge win so we use graph to compute and monitor again like the the same abstractions are useful for anything right just like with all of FP it's like if you just deal with functions you can always wrap them you can inject them into other functions you can do whatever you want to do and so the same thing that gives you that compose ability for computing gives you that compose ability for monitoring again it tends to always at the end of a graph have a terminal store node or pub/sub it will pub to another graph so like concretely what happens with us is that data comes in from somewhere that we pulled or crawled we do we find out have we seen this link before etc there's a lot going on at that stage and then it passes off to a big pool that deals with documents and that knows how to fetch documents how to label them how to do all the text extraction and etc also clustering and then it gets passed off to these services that our user facing which are building feeds and performing other tasks like that right so between each every single one of those things is running a graph and there are oftentimes both pubbing out their data to a subsequent graph and writing it out to a store and this allows us to quickly craft systems for all kinds of problems so if we need to do a new thing like there was a time that we decided to Index Wikipedia we didn't know Wikipedia is one of these things where there's been a lot of people for years like doing Semantic Web kind of stuff with it and whatever and most of its nonsense so we didn't use it forever we were like this most this stuff's mostly rubbish but we did decide to play with it one time and that was an example where it's like it was literally like four or five hours to go from like I've never seen Wikipedia data in my life - like now we're now we have a fully indexed Wikipedia and what and we're looking at engrams and stuff so that's that's the whole purpose of all this stuff right is that would this kind of composability allows you to tackle a new problem really quickly because you don't have a bigger thing that suited only to one particular task we have lots of smaller stuff and you can kind of nudge it together in the right way so now we're gonna come to this example I was telling you where we have both a graph and a store the purpose is going to be doing online learning over streaming events but the catch is going to be that updates to the online model are very expensive so you don't just stream data through and update the model straightaway because those updates are costly and the data is coming in very frequently what you're going to do is batch stuff up and then update periodically right so we're going to trigger them after we collect enough events a 10,000 user events or something like that so here at the beginning we have params params are kind of learned parameters for ranking and the sufficient statistics are coming from the current iteration we're going to update those params with a sufficient statistics every so often so you can see here we do the G mathcad over the features we extract them then we merge the sufficient statistics here and then just on this little cron job I think this is kind of wishful thinking on the syntax here but we have this cron job running every 60 minutes that's updating the params and here at the very bottom you can see the example I was talking about where we're flushing that bucket just so you understand what's happening and that update params function is actually taking a siik as its second argument and updating the in-memory params and we do this kind of stuff all the time where we go back and forth so I don't want you to think that everything is all in these holy abstractions that are just like perfect or whatever it's not the case at all and that's the whole point some of the data in say for example in our in our social feed service is stored in these very highly customized and very very efficient data structures that are all like cost protocols and everything right it's not using any of these abstractions at all but then something else that writes data to it will be so this kind of scenario is fairly common where you might have a bucket thing but then something on the other side is this like super tuned very very focused solution to a particular problem and this kind of stuff is like great for going back and forth right because you've got these finer grained things I don't have to depend on a particular data store or a particular way of storing data in memory and like everything has to get munde into that paradigm we don't we don't do that because of exactly this kind of thing so I think that's it so I wanted to show you guys a demo really quick and then we can then you can bug me with questions here well not really bug me but you know what I mean so let's see if I can get this thing to actually like collapse to a reasonable size here we go so here's the product oh I'm on staging right now this is because we have our new our new landing pages here you guys want to see the new landing page it's pretty awesome I'll show you this really quick before isn't that great look at that awesome hold on look at the button here I'm really help stoked about that button see that see that see the press state here boom all right okay all right let's look at the real product okay so this is what it is it's it's pretty straightforward kind of newsfeed type of experience we've tried to get all the rubbish out of the way for the most part and keep it pretty simple so you have a home feed you have favorite shared in red this is just stuff that I add right so like this takes care of like the read it later kind of scenarios and so on also like over time this will get more sophisticated the thing that I've always hated is that you know I basically like can never find sharing and bookmark are kind of the same thing for me right so I if I find something cool I might share it out or I might bookmark it or whatever and then I can never find all this crap in one place so we're trying to move toward that so that we you can have kind of search over your personal interaction items across the board no matter where I share to so you can see like this is a great eye I have a really good mix of topics in here I guess because I'm getting all like really nice visual stuff right from Nano and you can see again here's a topic diversification right so we got gardening in the front then we move on to this oh no I'm in my favorite feed so that's a bad example that's all awesome because I actually favor today it's a different red herring but here's the same thing right so you have design and interior design which is something I'm really into then San Francisco is next where I'm from again more interior design stuff here's something about China which is an area I'm interested in so it's like it's doing a good job of kind of you know flipping between different stuff here's something about sustainable energy right so it's taking my different interests in its diversifying so that I'm getting a lot of the kinds of top stuff that I'm into but it forces me to not consume only that right so that's pretty cool also the navigation model that's interesting here is like so ok sustainable design right I want to go there boom I'm there so you see how fast that was that was all dynamic right there that happened at request time right so this whole feed here was built and re ranked for me dynamically when I issued that request so it's hard to stress how hard that is it might not seem like a very big deal that's really hard to do so I just want to make sure we're clear that that was awesome so that that's really it and then there's search I'll do this one thing and then I'll be done so search it's giving me some suggestions already right but if I do like let's try GOP Gophers football that's pretty weird I don't know what that's doing but the other things are what you would expect right so the query that the query expansion is a long topic similarity so you're gonna get whatever in real time right now is kind of most similar to the query that I issued so I can get results for the election and come into that feed and I can see what's going on in the election and so what we're trying to do here is is encourage people to have that home feed experience where you sit there and everything comes to you which we've gotten used to from like the Twitter and Facebook kind of world but the problem that we're trying to go after is say well that's great and the simplicity of it is beautiful but if you're stuck only there you're only going to see the stuff from your friends and you're gonna keep getting caught in this bubble you're gonna be in the echo chamber for the rest of our lives and so what we're trying to do here is give these easy windows to explore out right and then now I can come and explore campaign finance I'm sure that's something I want to spend my time reading about so yeah so that's it thank you very much [Applause]
Video description
from infoq