Outbrain Techblog https://www.outbrain.com/techblog Wed, 20 Jan 2021 21:15:34 +0000 en-US hourly 1 https://wordpress.org/?v=5.8.1 Yet another Hadoop migration – but from the human perspective https://www.outbrain.com/techblog/2021/01/yet-another-hadoop-migration-but-from-the-human-perspective/ https://www.outbrain.com/techblog/2021/01/yet-another-hadoop-migration-but-from-the-human-perspective/#respond Wed, 20 Jan 2021 11:53:26 +0000 http://wpadmin.outbrain.com/techblog/?p=2920 This blog is about our last Hadoop migration but from a different angle, instead of describing the technical aspects of it (don’t worry, I will go over it in addition) I will be more focusing on the human perspective of

Read more ›

The post Yet another Hadoop migration – but from the human perspective appeared first on Outbrain Techblog.

]]>

This blog is about our last Hadoop migration but from a different angle, instead of describing the technical aspects of it (don’t worry, I will go over it in addition) I will be more focusing on the human perspective of it, how and why we took the decision to go from (attention, spoiler alert!!!) commercial to community solution.

Background

But first some context, Outbrain is the world’s leading content discovery platform. We serve over 400 billion content recommendations every month, to over 1 billion users across the world. In order to support such a large scale, we have a backend system built of thousands of micro services running inside Kubernetes containers spread over more than 7000 physical machines in 3 data centers and in public clouds (GCP & AWS). In order that the recommendations we supply will be valuable to our readers we invest in  personalization as much as possible. To achieve this goal we have lots of machine learning algorithms that run in the background on top of the Hadoop ecosystem which makes it a very critical system for our business, we have 2 flavors of Hadoop clusters: 

  • Online – we have 2 clusters in full DR mode running on bare metal machines, they are used for online serving activities.
  • Research – we have several clusters (per group and usage) running in GCP, they are used for research and offline activities.

Every day each cluster gets over 50TB of new data and there are over 10K job executions.

The Trigger

Few years ago we migrated our online Hadoop clusters to use the MapR commercial solution (you can read more about this in the Migrating Elephants blog), it had lots of improvements and we enjoyed the support we got. But a few years later, our Hadoop usage increased dramatically which made us very professional in supporting this system, we improved our knowledge and we could handle things on our own instead of depending on external resources for solving issues (which is one of the benefits when using a commercial solution). So, a few months before the license renewal we wanted to get a decision for what is the best way for us to proceed with this system.

We knew that it can take us some time until we will come up with new solution in production, we needed to make sure that the current system will continue to operate so we did the following actions in order to mitigate the time factor:

  • MapR have also a community version, we mapped the differences between it and the commercial version, there are few but the main ones that we had to solve were:
    • No HA solution (for the CLDB, the main management service) – for that we implemented our own custom solution
    • Support only single NFS endpoint – some of our use cases copy huge amount of data from the Hadoop into a separated DB so we needed a scalable solution, we ended up with a internal FUSE implementation  
  • Since we had 2 clusters in DR mode, we were able to test and run the community version in parallel to the commercial version
  • Just for case, we made sure that we could extend the license for only several months instead of 3 years commitment

Those actions enabled us to make the right decision without any pressure, we made sure that we will be able to run the system with the community version until we will have the new solution in place.

The Alternatives

We came up with those alternatives:

  1. Stay with the MapR commercial version – we didn’t want to stay with the community version (don’t have big users adoption and support).
  2. Migrate to Cloudera commercial version
  3. Migrate to Cloudera community version – this option was dropped immediately since the amount of nodes we had (>100) exceeded the limitation of using the community version
  4. Migrate to Apache Hadoop community solution
  5. Migrate to Google Cloud – the same we did with our research cluster

In order to be able to compare between those alternatives, we first defined the measurements that will help up to determine the chosen alternative, we summarized all in the following table:

There were other measurements like cost and required additional headcount, but they were not factors in our decision.

The Decision

After the successful migration we had with the research cluster to GCP (you can read about this in the Hadoop Research Journey blog series), lots of us thought that once we will need to migrate the online clusters, they will be migrated also to GCP. But like in real life, you need to handle each case by its own characteristics. I have 4 children and I wish that there was a single way to rule them all, but the reality is that each one requires a different approach and a different attitude. The same goes with our Hadoop clusters, each one has its own characteristics so each one has different requirements and needs, they differ by their usage so what is good for one is not necessarily good for the other. Due to the nature of the online clusters (lots of compute, less storage) we realized that we will not be able to benefit from the cloud features (elasticity and compute-storage separation) like we had with the research cluster. 

Like said before, case-by-case, we needed another solution for our online clusters. Following this conclusion it was clear that the chosen solution will be from one of the bare metal alternatives. This left us to choose between the community and the commercial versions, we compared the pros and cons of each option but the main question was: do we want to pay for a license or to invest the money in training people and have better internal skills?

We figured out that the professional level we will have will have the best impact, it will be a win-win situation, our users (the engineering teams) will get better service and we will invest in our people. Following the post’s title, we took the dissection from the human perspective. Having said that, it was clear that the winning alternative was the Apache Hadoop Community.

The Migration (in short)

The migration needed to be done while the system itself continued to operate as a production system, this meant that we needed to continue to give the same service to the users while they will not be affected. As we have 2 clusters in separated DCs, so we did it one cluster at a time, in each it was done in a rolling manner, we started with a small Apache cluster and in several cycles we managed to move the workload from the MapR cluster into it.

Data migration

  • New data – we started to ingest the new data into the cluster, it was done by integrate the Apache cluster with our data delivery pipeline
  • Historical data – most of the workload needs some historical data for their processing, as the migration was in cycles, we copied only the data that was required for each cycle

Workload migration

We repeated the following steps until we managed to migrate the entire workload:

  • Moved some of the jobs from the MapR cluster to the Apache cluster
  • Stopped the migrated jobs in the MapR cluster
  • Moved machines from the MapR to the Apache cluster

The ability to perform those cycles was thanks to mapping we did, we defined the groups of jobs that can be moved together in one piece taking into consideration the dependencies between the jobs and the cluster capacity. The last cycle contained all the jobs that were tightly connected to each other and there was no possibility to be separated. 

Of course there were lots of other technical details but they are a subject for a separate blog.

Achievements

The migration was a huge success, it was a full cooperation of all relevant engineering teams. After preparations that took us a few months, in no time (~1 month per cluster) we managed to complete it.

Now we can summarize what we achieved:

  • The human factor – our people improved their technical skills and got more professional in supporting the Hadoop system 
  • We are using Open Source Software – there are lots of articles describing the benefits of using open source software which are: flexibility and agility, speed, cost effectiveness, attract better talents etc.
  • Improved cluster performance – with the same amount of cores (actually with a little less), all jobs finish their processing at least 2 times faster. For example, in the below graphs you can see the process time of the hourlyFactFlowFaliCreationInHiveJob in the Apache vs. MapR
  • Cleanup – as a preparation for the migration we mapped the storage and workloads that will be migrated, this enabled us to perform cleanups where possible
    • Storage – we were able to delete more than 700TB which reduced the cluster capacity from 73% to 49%, you can see the impact in the graph below
  • Workloads – we disabled more than 800 jobs which was almost half of the total number (today we have ~950 jobs). In addition, we made some order with the jobs owners so instead of 19 ownership groups we now have only 16
  • Technical debt – one of the advantages of moving to open source software is the ability to use the latest versions of all integrated packages and in addition to be able to integrate even more packages, this ability is sometimes missing when using a commercial solution. In our case we were able to:
    • Upgraded our Spark jobs (version 2.3) 
    • Upgraded our SparkStreaming jobs (version 2.4)
    • Implemented Presto as another service for supplying a high performance query engine that can combine few data sources in a single query.

Epilog

The decision to migrate one of our critical systems from commercial to community version was not an easy decision to make, it involved some amount of risk but according to the blog’s title, we did it from the human perspective, when you invest in the people you get the investment back in big time. 

Taking such a brave decision requires strong leadership, by nature there are always few sceptical people that fear changes, especially from such a big change in such a critical system. We needed to be determined and express confidence in the people and in the process, all people must be committed and engaged to this decision.

To summarize things up, the migration was a big success from the human perspective and from the technical aspects, it was a team effort that we all are enjoying its outcome. And regarding the different solution per Hadoop cluster, like in real life, try to follow the rule of case-by-case.

The post Yet another Hadoop migration – but from the human perspective appeared first on Outbrain Techblog.

]]>
https://www.outbrain.com/techblog/2021/01/yet-another-hadoop-migration-but-from-the-human-perspective/feed/ 0
Faster release with Maven CI Friendly Versions and a customised flatten plugin https://www.outbrain.com/techblog/2020/12/faster-release-with-maven-ci-friendly-versions-and-a-customised-flatten-plugin/ https://www.outbrain.com/techblog/2020/12/faster-release-with-maven-ci-friendly-versions-and-a-customised-flatten-plugin/#respond Mon, 14 Dec 2020 10:50:25 +0000 http://wpadmin.outbrain.com/techblog/?p=2883 Fed up with waiting for the maven release? We’ve found a way to cut the release time by half. Each of our teams at Outbrain is responsible for its own service code in its own repository. However, our teams also

Read more ›

The post Faster release with Maven CI Friendly Versions and a customised flatten plugin appeared first on Outbrain Techblog.

]]>

Faster release with maven
by Lightspring

Fed up with waiting for the maven release? We’ve found a way to cut the release time by half. Each of our teams at Outbrain is responsible for its own service code in its own repository. However, our teams also share a large Maven-based repository that contains modules (libraries) that get released as Maven artifacts. After a module is released, it can be used by the teams. Thus, the shared repository — in contrast to service code, which is managed within individual team repositories — serves as a centralised place to manage team libraries.

Since our shared repository has hundreds of modules managed by multiple teams, we started to face failures during release, and release time increased dramatically. We decided to tackle this issue to boost efficiency.

The solution we came up with was to move to the Maven Ci Friendly Versions, which eliminate race conditions. (The Maven release plugin involves a git commit phase to change the pom.xml versions, but the change cannot get pushed if the commit hash set is out of date, prompting the build to fail.)

Moreover, we stopped using the Maven Release Plugin, accelerating our release process.

Bye Bye, Release Plugin.

Until recently, we had been using the Maven Release Plugin in order to release our libraries. The plugin has a two-step process, with different commands involved (release:prepare, release:perform).

The “prepare” and “perform” goals involve building the project multiple times. Moreover only one release can be triggered at a time — i.e., multiple releases cannot run simultaneously due to race conditions that were described earlier — you must wait for the current release to complete before starting the next. For projects that, like ours, have long build times, this is a deal-breaker. The Maven release plugin took far too long to run.

Welcome, Maven CI-Friendly versions
The approach we took is lightweight compared to the Maven release plugin approach and allows for multiple releases to be triggered and run simultaneously. 

Here are the advantages of this approach over using the release plugin:

The Maven CI-Friendly Setup

The structure of our Monorepo (which follows a parent-child hierarchy) allowed us to easily transform all our pom.xml files from hard-coded versions to ${revision} properties as our artifact versions, which can be overridden as well.

In order to avoid redefining the revision property for each module, we defined the revision property in the parent pom.xml.

Here is a child pom.xml:

<project>
<parent>
<artifactId>ci-friendly-parent</artifactId
<groupId>com.outbrain.example</groupId>
<version>${revision}</version>
</parent>
<artifactId>ci-friendly-child</artifactId>
<name>CI Friendly Child</name>
</project>

And this is the parent pom.xml:

<project>
<groupId>com.outbrain.example</groupId>
<artifactId>ci-friendly-parent</artifactId>
<name>CI Friendly Parent</name>
<version>${revision}</version>
<properties>
<revision>1.0.0-SNAPSHOT</revision>
</properties>
</project>

As you can see, we moved to the CI Friendly Versions using revision property, and we are now set up to issue a local build to verify that the definition is correct.

To issue a local build, which will not be published, we invoked
 mvn clean package as usual. This resulted in the artifact version 1.0.0-SNAPSHOT.

Want to change the artifact version? Easy.
Use the following command:

mvn clean package -Drevision=<REPLACE_ME>

The Maven release plugin used the revision placed in the pom.xml to define the next revision for release. If a development pom.xml holds a version value of 1.0-SNAPSHOT then the release version would be 1.0.
This value is then committed to the pom.xml file. 
Finally, we can avoid those commits and hard-coded versions in pom.xml files.

Install/Deploy

In the Maven Ci Friendly Versions guidelines it is mentioned that the flatten maven plugin is necessary if you want to deploy or install your artifacts. Without this plugin the artifacts generated by this project cannot be used by other Maven projects.

This is true. But, the problem is that the flatten Maven plugin coupled with the “resolveCiFriendliesOnly” option does not work as expected due to bugs. Maven’s flatten plugin is a somewhat over-engineered, overly complex plugin that did not fit our needs. As adherers of the Unix philosophy, we decided to create our lightweight custom plugin, the ci friendly flatten maven plugin that replaces only the ${revision}, ${sha}, and ${changelist} properties.

The final pom.xml 

</project>
<project>
<groupId>com.outbrain.example</groupId>
<artifactId>ci-friendly-parent</artifactId>
<name>CI Friendly Parent</name>
<version>${revision}</version>
   <properties>
<revision>1.0.0-SNAPSHOT</revision>
</properties>
<modules>
...
</modules>
<build>
<plugins>
<plugin>
<groupId>com.outbrain.swinfra</groupId>
<artifactId>ci-friendly-flatten-maven-plugin</artifactId>
<version>FIND_HERE</version>
<executions>
<execution>
<goals>
<goal>clean</goal>
<goal>flatten</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>


Building the bridge to Maven Ci Friendly Versions

We decided to transition to Maven Ci Friendly Versions. As I mentioned before, due to the bugs in the flatten plugin, we first needed to develop our custom flatten plugin. In our case, we used TeamCity to release our libraries, fetching the release version from git and tagging it (but it is worth noting that this process is suited to other build systems as well).

So, what do the release steps look like?

  1. Fetch the latest git tag, increment it, and write the result to revision.txt file. This is the version we are going to release.
mvn ci-friendly-flatten:version

2. Set a system.version TeamCity parameter for our soon-to-be-released version, needed in order to use this version in the steps that follow.

#!/bin/bash -x
VER_PATH=”%teamcity.build.checkoutDir%/revision.txt”
REV=`cat $VER_PATH`
set +x
echo “##teamcity[setParameter name=’system.version’ value=’$REV’]”

3. Deploy the jars with the new version.

mvn clean deploy -Drevision=%system.version%

4. Tag the current commit with the updated version and push the tag.

mvn ci-friendly-flatten:scmTag -Drevision=%system.version%


The Maven-Release Plugin has two commits and, thus, triggers Clean/Compile/Test multiple times. In contrast, the new release process relying on our custom flatten plugin has zero commits, effectively eliminating build failures caused by race conditions in committed code during the build.

Moreover, in this release process, the goals Clean/Compile/Test are executed only once.

Overall, our new approach slashed release time by 50%, from 6+ min to 3+ min.

As a last note, we’ve employed this approach in other projects as well. So, while the time savings on this project was 3 minutes, in another project that was originally 22 minutes, the custom flatten plugin cut 11 minutes off the build. 

So, take it from us. There is no need to get bogged down by Maven flatten plugin. Save yourself the headache. All you need is to switch to the Maven Ci Friendly Versions with our custom Ci Friendly Flatten Maven Plugin.

The post Faster release with Maven CI Friendly Versions and a customised flatten plugin appeared first on Outbrain Techblog.

]]>
https://www.outbrain.com/techblog/2020/12/faster-release-with-maven-ci-friendly-versions-and-a-customised-flatten-plugin/feed/ 0
Taming the bear (metal) https://www.outbrain.com/techblog/2020/11/taming-the-bear-metal/ https://www.outbrain.com/techblog/2020/11/taming-the-bear-metal/#respond Sun, 15 Nov 2020 12:41:37 +0000 http://wpadmin.outbrain.com/techblog/?p=2854 Picture by hansjurgen007 on Unsplash Introduction In this blog post, I am going to share with you the process behind Namaste – an automated tool we built, in order to manage our bare metal machine requests, as part of our

Read more ›

The post Taming the bear (metal) appeared first on Outbrain Techblog.

]]>

Picture by hansjurgen007 on Unsplash

Introduction

In this blog post, I am going to share with you the process behind Namaste – an automated tool we built, in order to manage our bare metal machine requests, as part of our BMaaS approach (that’s Bare Metal as a Service).

Who are we?

Before we dive into the new tool we built, let me introduce our team.

We are a team of 4 DevOps engineers based in Israel, managing an on premise environment of 7000 bare metal servers across 3 Data Centers in the US.

But there is a little bit more to it. When we say 7000 servers, it includes:

  • about 20 different server models
  • more than 100 different disk models
  • …memory sticks
  • …CPUs
  • …raid cards
  • ..NICs
  • …PSUs
  • …and a combination of all of the above

So there are a lot of hardware configuration options and a lot of reusing of the servers.

Since we are located in Israel and our Data Centers are located in the US, in each Data Center, there is a support team available 24/7 to handle everything on the physical side. They take the role of our “Remote Hands and Eyes” with any onsite related work. I will address them as “Remote Hands” in this blog post.

What we do?

We divide our work into 3 main types of tasks:

Planned changes – Our planned changes are managed in Jira and are mainly hardware related tickets opened by our customers – mostly engineers in the Cloud Platform group. We usually divide these tickets into 3 types of tasks:

  • hardware upgrades
  • service requests
  • machine requests (which we’ll discuss in more details later)

These can be upgrading the resources of a server, working on a suspected hardware issue or building a new cluster altogether.

Unplanned changes – Everything which has to be done Now. During working hours it can be a task managed in Jira as a blocking/critical ticket. During off hours it can be a PagerDuty alert.

Internal projects – Projects which our team initiates and decides to manage. These are usually the tasks which our users don’t know or really care about, but those tasks are more interesting and make our users’ lives and our lives much better. In essence, our internal projects prevent unplanned changes and make the planned changes easier to perform. The new tool I’ll discuss falls under this category.

If you’d like to learn more about who we are and how we do things, I strongly encourage you to check out this great blog post by my colleague, Adib Daw.

The challenge

Looking back, a year ago our tasks were distributed like this:

Planned changes – 75%

Unplanned changes – 15%

Internal projects – 10%

This means 90% of the work was manual labor and 10% were projects geared towards automation. The meaning was both relatively slow delivery, much room for human errors and frustration due to the highly repetitive nature of most of the work. Simply put, it just didn’t scale.

Our solution to this problem was to formulate a vision, and present it to our customers to get their buyin. Our customers agreed to take a hit on the delivery times of their requests, so we could focus on the “internal projects” category, with the intention of investing in automation, and thus the velocity and robustness of our execution.

We decided to push towards building as many automated processes as possible, in order to minimise the amount of time spent on planned changes and unplanned changes, since many of these tasks appeared to lend themselves quite well to automation.

The “before” picture

When we began our journey towards automation, the process of choosing the best suited servers for the machine request took us a lot of time.

It looked like this:

Picture by tjholowaychuk on Unsplash
  1. Check available hardware and verify we have enough free servers to fulfill the request
  2. Choose the best suited hardware by carefully (and manually) ensuring hardware constraints are met (suitable server form factor, support for required number of drives, RAID adapter, etc)
  3. Reserve the servers in a temporary allocation pool so no one else will use them for something else as they’re being worked on
  4. Allocate the relevant parts in our inventory system
  5. Submit the target configuration into the Remote Hands management system we’d built, so that the onsite technician would have the required information for the task (server setup and required parts)
  6. Open a ticket with the relevant information (server locations, serial number, etc) to the onsite Remote Hands team, to begin work on the hardware configuration changes

All of this had to be done before any actual hands-on work on the hardware even started.

In addition, we also had to deal with failures in the flow: 

  • configuration mistakes
  • hardware mismatches
  • human errors
  • faulty parts

The “after” picture

The process we had worked, but it took a lot of time, sometimes as much as hours per request, depending on how many servers were requested and what changes were needed to their config.

While we had a set of tools that helped bring us to this stage – much improved over the previous process we had, which could take days for large requests – it would keep us afloat but still required manual labor which seemed entirely unnecessary.

Our vision for this process was to simplify it, to look as follows:

Picture by etiennescamera on Unsplash
  1. Submit a request for the final server config, detailing number of servers per datacenter and their final spec
  2. Drink coffee while Namaste handles everything in the background, including opening a ticket for Remote Hands
  3. Get notified that the request was fulfilled

This simplified process would have code deal with the metadata matching, allocation and reservation, and leave us to handle fulfilment rejects, where human intervention is really necessary.

Behind the scenes

The flow of matching machines per a machine request:

  • Accept user input – We enter server requirements such as quantity, location and spec
  • Perform machine matching – An engine we built returns all the servers matching the requirements, some of which may be a partial match (missing a drive, less RAM, etc)
  • Prioritise – Prioritise from the matched server list, based on:
    • Rack awareness
    • Server form factor
    • Drive form factor
    • Number of cores closest to requested spec
    • Hardware changes required to bring machines to the requested spec
  • Reserve machines – Reserve the prioritised servers in a temporary pool to avoid double allocation 
  • Perform parts matching – Decide which parts to use to bring the reserved servers to the requested spec, wherever needed
  • Reserve parts – Reserve the selected parts in our inventory to avoid double allocation
  • Set machines to target spec – Mark the servers with the needed changes, so when the onsite support team begins working on fulfilling the work order, they will connect an iPad to each server which will tell them what changes need to be done.
  • Open a ticket – Open a ticket with all the relevant information to the onsite support team for fulfilment.

Return on investment

The most obvious gain in developing this mechanism is time saving. With the steps of matching hardware and reserving parts being the main points of friction that were removed. Time saving also shortens the wait for hardware request fulfillment, so both the team and the customers win.

Automated matching also reduces human errors, which has impact on time saving as well as on customer experience. In addition, it further allows us to use the best suitable hardware, which improves efficiency by assigning the best matching resource to fulfill the request. Simply put, we no longer assign the “first matching” or “random” servers. Instead, we select the “most fitting” ones, as the matching algorithm is now explicit and repeatable.

A “side effect” of the improved matching process is that a whole class of issues is avoided completely. Those are issues around server <-> part incompatibility, maximum server capacity and generally everything Remote Hands might come across after the request had already been dispatched to them. These kinds of issues are the most expensive time-wise, as they require human to human communication across different time zones. We knew this would help, but the actual impact was astounding.

Finally, the user experience dramatically improved with the introduction of this system:

  • Users request hardware get their requests fulfilled quicker
  • Remote Hands come across less issues in the field
  • Our team deals with much fewer rejects and manual labor

As you can see, there are 3 types of customers here, and all come out on top.

A short summary

A year ago we embarked on a journey to build our BMaaS solution, where Namaste is just one part of the full picture. While this is still a work in progress and we expect it to continue evolving indefinitely, now is a good opportunity to look back on where we started, specifically on what drove us down this path.

At the beginning of this post, I mentioned what our work distribution looked like before we decided to change things around. To remind you, it was something along:

Planned changes – 75%

Unplanned changes – 15%

Internal projects – 10%

A year into the process, the nature of our work has changed dramatically, and the work distribution is now more along:

Planned changes – 15%

Unplanned changes – <0.1%

Internal projects – 85%

It’s important to note that the actual volume of changes hasn’t decreased. In fact, we now handle more hardware fulfillment requests than ever before. But the actual mindful attention these changes require from us has dropped significantly. Instead, we focus mostly on building the system that gets the work done for all of us, shifting us from “moving hardware” to writing software.

What’s next

  • APIs – by making our BMaaS platform’s APIs open to our users, we expect new and unforeseen use cases to surface, where our users build their own solutions on top of the platform
  • Web UI – much of what we do is currently accessible via command-line tools. While this works, a web UI could improve the user experience even further and allow building various views which the system doesn’t support as of yet
  • Events & notifications – our team still serves as a communication pipeline in various parts of the process, as we’re validating different ideas and ensuring everything works as expected. We plan to introduce events and notifications that would remove us entirely from the flow and allow users to act upon certain events, in-person or via their own code
  • Coffee – maybe move to some decaf. Or tea.
Picture by asthetik on Unsplash

The post Taming the bear (metal) appeared first on Outbrain Techblog.

]]>
https://www.outbrain.com/techblog/2020/11/taming-the-bear-metal/feed/ 0
Our First Remote Hackathon https://www.outbrain.com/techblog/2020/11/our-first-remote-hackathon/ https://www.outbrain.com/techblog/2020/11/our-first-remote-hackathon/#respond Sun, 01 Nov 2020 07:58:25 +0000 https://wpadmin.outbrain.com/techblog/?p=2845 Background 2020 was a unique year when most of the industries in the world were forced to work remotely for a long period of time due to the COVID-19 pandemic. Besides health and employment concerns, staying at home for a

Read more ›

The post Our First Remote Hackathon appeared first on Outbrain Techblog.

]]>

Background

2020 was a unique year when most of the industries in the world were forced to work remotely for a long period of time due to the COVID-19 pandemic. Besides health and employment concerns, staying at home for a long period of time is not natural for the human race as we are communicative beings by nature, causing many challenges, where one of the biggest challenges is how to maintain the social fabric. This is true in general and in particular for companies that most or all of its employees are working from home, and is becoming more significant as you go up in the organization structure, from team level to group level and above.

As managers, we were very concerned about finding ways to maintain the social connection between teams and had several discussions about it. During one of those discussions we remembered that in a different life period before the COVID-19 pandemic, we planned to have a Hackathon and we even started working on it but as usual, life has other plans. So, why not do the Hackathon we planned on doing just a little different? Besides the remote limitation, Hackathon is exactly what we need in order to solve this “disconnection” challenge.

As we already had experience on how to perform a “regular” Hackathon, we knew what are the basic things that needed to be done, but as this will be the first time we will do it remotely we understood that things need to be done a little differently, especially on the administrative side. As a rule, the participation in our Hackathons is not mandatory and the people’s engagement depends on the communication they receive. This is true for regular Hackathons and even more relevant for a remote Hackathon.

In this post I will describe the actions we took as the preparation and during the Hackathon itself, which led to a successful event. You can use it as a useful source of tips in case you plan to perform your own remote Hackathon.

Preparation

Most of the preparations for a remote Hackathon are the same as for a regular Hackathon, but there need to do some adjustments due to the nature of working remotely:

Choose the theme

The Hackathon’s theme define the Hackathon’s projects, so first thing you need to do is to define the Hackathon theme, in our case we choose the theme YACHADNESS, which is taken from the Hebrew word YACHAD which means TOGETHER, so  YACHADNESS=TOGETHERNESS. We chose this theme since the target of this Hackathon was to make the people working together in order to solve the lack of communication during the WFH period, this was the reason we did this Hackathon and it was more important for us to focus on this aspect instead of any other technological subject. Due to the selected theme, there were no boundaries for the projects, anything is welcome and we let the people be as free as they want with their ideas.

Set timeline

When planning the Hackthon’s timeline need to take into account the following millstones:

  • Define the Hackathon’s projects
  • Present the selected projects
  • Establish the projects’ teams, each one can choose to assign himself to his favorite project
  • Define the winning KPIs
  • Establish the judges team
  • The Hackathon itself

Communication

As said before, it is true for regular Hacktons and even more for a remote one. Since all communication is done by mails and zoom meetings, you need to make sure that people are informed about the current status: where are we standing regarding the timeline and what are the next steps. In addition, due to the WFH situation, you need to invest more on the people’s engagement, this can be done by sending relevant mails, mentioning it during meetings and setting dedicated meetings for each milestone in the timeline.

Define the schedule

In most regular Hackathons, the people are gathered together in one place and are working in groups on their projects. In a remote Hackathon the people are working from their Homes, so the schedule needs to be prepared in advance, known to all and take into account all the limitations while working from home (kids, availability etc.). In our schedule we also added fun activities and we also arranged some giveaway that was delivered to the people’s houses during the Hackathon itself.

My personal giveaway

Working area

There are some benefits for working remotely, instead of arranging meeting rooms for the project teams (this is one of/most challenging task in a regular Hackathon), you need to create a dedicated zoom meeting and slack channel for each team to be used for all the team communication during the Hackathon period. In this way everyone could jump into each project and could say hi or to check how things are progressing, it was used also by our judges to do their periodic visits. 

The Hackathon

The Hackathon itself was planned to take 2 days, since we all worked remotely we created a detailed schedule with check-in meetings and with lots of fun activities like Yoga session, Tabata workout and beer breaks, we wanted that those 2 days will be used also for fun in addition to the innovation work that was done. All was done in order to serve our initial plan, to bring the people closer even while working remotely.

So after a short kickoff session, each team started to work on their project (through their zoom & slack channel). We followed the schedule with dedicated shared meetings until we reached the final meeting that was the projects’ presentation, each team was given 10 minutes to present their project and in the end the winning project was declared, the winning was according to the results of a public voting combined with the judges decision.

And in case you wonder, the winning project was OBWho – a slack bot that can be give you a place to ask who is the owner of any internal system, the main logic of this project is a crawler that scans all our slack channels and according to that creates some mapping by using a machine learning algorithm that map the owners based on tokens that were found in each channel.

Conclusion

All the hard work that was done proved itself, we had a Successful Hackathon with results that passed our expectations. The participation ratio was very high, we had 6 great projects and the distance limitation did not influence the people who were committed and engaged as in a regular Hackathon, beside that we had lots of fun with all the fun breaks and activities.

But most importantly we achieved our initial goal, to bring people together for doing innovative work combined with fun despite the distance limitation, for 2 days we were all YACHAD again.

In order to summaries all in one place, in the below list you can find the additional actions that need to be taken into consideration when planning a remote Hackathon:

  • Lots of communication
  • Create slack channels for each project
  • Create zoom rooms for each project
  • Again, lots of communication
  • Define detailed schedule with sync zoom meetings
  • Have fun breaks through zoom
  • Send some giveaways to the people’s houses
  • In case it wasn’t clear unto now, lots of communication
  • Collect feedback through survey

For us, this remote Hackathon made the impact we wanted and we achieved exactly what we planned and more, I hope that you can learn from our experience in case you plan to have your own successful remote Hackathon.

The post Our First Remote Hackathon appeared first on Outbrain Techblog.

]]>
https://www.outbrain.com/techblog/2020/11/our-first-remote-hackathon/feed/ 0
Faster Than a Missile: Intercepting Content Fraud https://www.outbrain.com/techblog/2020/07/intercepting-content-fraud/ https://www.outbrain.com/techblog/2020/07/intercepting-content-fraud/#respond Thu, 09 Jul 2020 09:09:30 +0000 http://wpadmin.outbrain.com/techblog/?p=2784 On May 21, 2020, thousands of Israeli websites were hacked by a group called “Hackers Of Savior”, probably including hackers from Turkey, The Gaza Strip, and North Africa. Not the first – and surely not the last – cyber-attack of

Read more ›

The post Faster Than a Missile: Intercepting Content Fraud appeared first on Outbrain Techblog.

]]>

On May 21, 2020, thousands of Israeli websites were hacked by a group called “Hackers Of Savior”, probably including hackers from Turkey, The Gaza Strip, and North Africa. Not the first – and surely not the last – cyber-attack of its kind, the hackers targeted websites stored on uPress, an Israeli web hosting company.

The websites’ content was replaced by a page with a statement against Israel  –  “The countdown to Israel’s destruction began a long time ago”  – and a disturbing video clip. The attack was timed to coincide with the Israeli national holiday “Jerusalem Day”, which celebrates the reunification of the city after the Six-Day War. It also coincided with the Iranian “Quds Day”, which was probably an additional motivation for the attack.

The attacker’s content as shown on the hacked websites

Beyond the visual corruption of the websites, the attackers also tried to delete all the websites’ data. Besides, the hackers prompted the browser to ask the viewer’s permission to take their photo with their computer’s webcam.

This attack was not your ordinary run-of-the-mill fraud event. It did not revolve around the usual advertiser-sourced content fraud. Rather, the source of the problem was an outside party and Outbrain was not even the target. The attack harmed not just the website publishers and end-users, but advertisers too.

Before diving in to see how Outbrain coped on the battleground during this event, let’s start at the beginning. What is “content fraud”, why is it such a challenge, and how does Outbrain protect its network and partners every day?

 

Content Fraud – The Major Challenge for Content Networks

Content fraud is an industry-wide problem that comes in many forms, including phishing, redirects, cloaking, malvertising, conspiracy-mongering and fake news. The internet is subsumed by an endless amount of fake content, such as fake quotes, doctored images, and deepfake video, making it increasingly difficult to navigate towards an authentic online experience. As content fraud is always evolving, it becomes much harder to detect and prevent over time.

As a discovery company, part of Outbrain’s lighthouse is to provide qualified content to its readers. While the overwhelming majority of content provided by advertisers is qualified, interesting and safe, Outbrain invests significant resources to ensure that no malicious or deceitful content makes its way onto our advertising platform.

When content fraud is detected, it comes from a specific and identifiable source that can be easily blocked and prevented from accessing Outbrain’s network. Generally, we use multiple technologies and people-powered techniques to verify that advertisers are legitimate and promoting qualified content. We look at a variety of properties, the identity of the advertiser, and the content itself. Based on the technologies Outbrain usually uses, the ability to respond is broad and fast.

In the recent event, the challenge for Outbrain was different.

 

Why and How Did Outbrain Respond to the “Hackers of Savior” Cyber Attack?

The cyber event of May 2020 presented our anti-fraud specialists with a different and unique set of obstacles.

Firstly, we were not battling against an entity that was part of our network. The threat did not come from one of our partners – rather, our publishers and advertisers were first in the line of fire.

Secondly, the attack was broad and not sourced from a specific website. We assumed that some websites in our network had been corrupted, however, we didn’t know which ones. Targeted and specific blocking is the most common anti-fraud solution, yet in this case, we had to find and block multiple domains that were yet unknown.

Thirdly, any action would be temporary. When we identify a fraudster on the network, they are permanently blocked and banished from the network. There is no going back. However, in this case, we knew that once the affected websites recovered, we had to immediately re-enable them and permit them back on the network.

Lastly, even though Outbrain was not the target, we had to stay on our toes, and to assume that the attack was still happening or could happen again at any time. Although we identified the characteristics when it occurred, we had to be ready to quickly detect any new method or pattern that the attackers could develop and deploy.

The response by our anti-fraud team was designed to protect all three pillars of the Outbrain ecosystem:

  • The first pillar – Users who need protection from threatening content.
  • The second pillar – Advertisers who need protection from being charged for harmful visits and damage to their brand image.
  • The third pillar – Publishers who were prevented from directing traffic to threatening content that would harm their reputation.

 

The Technology Challenge

In order to find the corrupted sites, we had to scan loads and loads of web pages. Once such a page was found, we had to disable it immediately so it would no longer be promoted on the Outbrain network. These scans had to be done quickly and repeatedly. If a cyber attack is ongoing, pages can be hacked even after we scan them, and if the pages have recovered, we need to know straight away in order to recommend them back to the network.

“The hacked pages shared the same attributes” – One of the attributes we discovered from crawler logs

How did we identify a hacked page? We had a few options:

  1. Published lists of the hacked websites:  In some cyber attacks, this can be a good solution, but not in this case. The attack was a rolling one, which meant not all the websites were hijacked at once. There was no single reliable source that contained all the hacked websites.
  2. Third-party solutions: In order to maintain the integrity of the Outbrain network, the Anti-Fraud team uses not only internal solutions that were developed in-house, but also external solutions, to flag problematic content, like malware, cloakers, redirects, and other fraud types. A third-party solution can be used to classify specific characteristics as a website “under attack”,  so it might have been an option in a case like this. However, making the adjustments necessary to detect this specific attack would take too long, and in this instance, the ability to react fast was super important.
  3. Outbrain’s in-house crawler: In order to recommend the best content to readers, Outbrain crawls the promoted content in its network, analyzing characteristics and extract features that will be used in the NLP and recommendation algorithms. Our in-house crawler was identified as the ideal solution for this challenge, as it is already connected to our databases. In order to get immediate results, we only had to define the population we wanted to crawl. Moreover, with some additional configurations and quick development by the AppServices team, we could tackle all the challenges at once, automatically, and rapidly.

 

Crawling Our Way Out

So Outbrain’s in-house crawler was immediately mobilized for detecting which sites were hacked.

We started by analyzing several hacked sites found in multiple sources across the net. Using our crawler output logs, we discovered that the pages were identical in terms of both structure and visual content, meaning they shared the same attributes (title, text, links, keywords, etc.). By querying the crawler logs for those attributes, we were able to easily identify more hacked domains. The problem was that the picture we saw was only a partial one, since not all the pages were crawled yet.

We chose the relevant potentially-hacked domains list and set it to be crawled in cycles of 3 hours, using a configured crawler feature that was specially tailored for this problem. Once the crawler identified a new (specific) value for the attributes the attackers injected into the web pages, we could disable the hacked pages. Disabling a page meant that the hacked native ads were no longer available as recommendations on the Outbrain network, and were no longer accessible by our users.

The ongoing scans also allowed us to monitor and understand the impact of the attack on our inventory in real-time and to communicate it to our partners. When web pages were recovered from the cyber attack, we used the crawler to re-enable them on the Outbrain network. With the help of our unique tech capabilities, we were able to come full circle out of this pernicious cyber attack.

 

Rapid Response Saves the Day

The attack began on the morning of May 21. According to uPress’s announcement, by May 22, 02:28 (Israel Time), 90% of the websites were recovered. As a result of our actions, we were able to prevent Outbrain users from being exposed to hacked web pages containing disturbing content.

This rapid reaction not only protected our users from threatening content, but it also enabled affected publishers and advertisers to return to their usual Outbrain traffic services very quickly after websites were recovered.

 

What Doesn’t Kill Us, Makes Us Stronger

During the cyber rescue operation, we were in direct contact with several teams in the organization, enabling us to present a united front and a holistic solution. Along the way, we benefited from the ability to pivot in several different directions as necessary, both internally and externally. It was a great lesson about how to use various tools to solve an array of problems.

With the rapid, coordinated response from Outbrain’s teams, we overcame the “Hackers of Savior”  attack, and also developed the infrastructure to manage similar cyberattacks in the future, if – and when – they occur.

The post Faster Than a Missile: Intercepting Content Fraud appeared first on Outbrain Techblog.

]]>
https://www.outbrain.com/techblog/2020/07/intercepting-content-fraud/feed/ 0
Real Time Data Pipeline – More Than We Expected https://www.outbrain.com/techblog/2020/05/real-time-data-pipeline-more-than-we-expected/ https://www.outbrain.com/techblog/2020/05/real-time-data-pipeline-more-than-we-expected/#respond Wed, 27 May 2020 06:08:55 +0000 http://wpadmin.outbrain.com/techblog/?p=2747   When we were considering migrating our data delivery pipeline from batches of hourly files into a real time streaming system, the reasons we had in mind were the obvious ones; to reduce the latency of our processing system and

Read more ›

The post Real Time Data Pipeline – More Than We Expected appeared first on Outbrain Techblog.

]]>

 

When we were considering migrating our data delivery pipeline from batches of hourly files into a real time streaming system, the reasons we had in mind were the obvious ones; to reduce the latency of our processing system and to make the data available in real time. But as soon as we started to work on it, we realized that there are quite a few additional good reasons to embark on this complex project.

Overview

Our data delivery pipeline is designed to deliver data from various services into our Hadoop ecosystem. There it is processed using Spark and Hive in order to produce aggregated data to be used by our algorithms and machine learning (ML) implementations, business reports, debugging, and other analysis. We call each type of this data a “datum”. We have more than 60 datums, each representing a different type of data, for example: request, click, listings, etc.

Outbrain is the world’s leading content discovery platform. We serve over 400 billion content recommendations every month, to over 1 billion users across the world. In order to support such a large scale, we have a backend system building from thousands of micro services running inside Kubernetes containers on top of an infrastructure that is spread over more than 7000 physical machines spread between 3 data centers and in public clouds (GCP & AWS).

As you can imagine, our services produce lots of data that is passed through our data delivery pipeline. On an average day, more than 40TB is moving from through it, and on peak days it can pass the 50TB. 

And yes, in order to support this scale, the delivery pipeline needs to be scalable, reliable and lots of other words that end with “able”. 

 

Pipeline Architecture

As I wrote above, we have multiple data centers (DCs), but in order to simplify the diagram and explanation, I will use only 2 DCs to present our pipeline architecture. As we go through the two architectures (legacy hourly and current RT), you will notice that the only components which remain the same are the edges. The services that produce the data, and the Hadoop where the data ends up.

Another thing worth mentioning is that in order to have full disaster recovery (DR), we have 2 independent Hadoop clusters in separated locations, each one of them gets all the data and processes the same jobs independently, in case one of them goes down, the other continues to work normally.

 

Hourly Files Pipeline – legacy

The data is collected by the services, and saved to files on the local file system of each service. From there, another process copies those files to our Data Processing System (DPS), which is an in house solution that collects all the data and copies it into the Hadoop. As you can see, it is a very simple architecture without a lot of components, and is quite reliable and robust since the data is saved in files which can be easily recovered in case of any malfunction.

The drawbacks of this pipeline is that the data is moved in chunks of hours and it is not accessible during the duration of the delivery time. It is only available for processing after each hour is completed, and all the data is in the Hadoop. And it also places a heavy burden on the network because the data is transferred in spikes.

 

Real Time Streaming Pipeline

Replacing the local file system storage, we have a Kafka cluster. And, instead of our in house DPS system we have a MirrorMaker, an aggregated Kafka cluster and a SparkStreaming cluster.

In this pipeline, the data is written directly from the service to the Kafka cluster. From this point all the data is accessible to anyone that wants to use it. Analysts, algo developers, or any other service that can find it useful. Of course it is also available to the MirrorMaker, as part of the pipeline.

The MirrorMaker job is to sync data between Kafka clusters, and in our case, to make sure that each aggregated Kafka cluster will have the entire data from all DCs.

Like before, the data in the aggregated cluster is available to all, especially to our SparkStreaming cluster which runs various jobs that consume the data and writes to the Hadoop.

So it’s clear that by implementing this pipeline, the data is available to all, and it reaches the Hadoop faster.

 

Obvious Benefits

Now lets go over the obvious benefits of the real time pipeline:

  • Data availability – since the data is passed through Kafka clusters, it is available to all implementations that need real time data. Here are 2 good examples:
    • Recommendation engines can now calculate its algo models with fresher data. This led to a lift of more than 16% in our Revenue Per Mille (RPM).
    • We added real time customer facing dashboards which enable them to take required actions faster according to fresh data.
  • Reduced the latency of the processing system – since the data reach the Hadoop faster, the hourly jobs that wait for the hourly data can start work sooner and to complete their work much closer to the end of the hour. This reduced the overall latency of the processing system and now the business reports for example, are available in a shorter time.

 

Additional Benefits

And those are the unexpected benefits that we gain from the real time pipeline:

  • Network capacity – In the hourly files pipeline, the data is moved in hourly batches and at the end of each hour. This means that the network capacity needs to support movement of all data at once. This requirement forces us to allocate the required bandwidth while it is used only for a short time during each hour, wasting expensive resources. The real time pipeline moves the same amount of data incrementally over the course of the hour.

    The graphs below demonstrate the bandwidth savings that we made once we moved to the real time pipeline. For example, in the RX graph, you can see that we moved from peaks of 17 GBPs to have flat bandwidth of 7 GBPs, saving 10 GBPs.

Total RX traffic

Total TX traffic

  • Disaster recovery – the fact that in the hourly based pipeline the data is saved in files on the local file system of the machines that run the services have some limitations. 
    • Data loss – as we all know, machines have 100% chance of failing at some point. And when you have thousands of machines, the odds are against us. Each time a machine goes down, you have the risk of data loss since all the hourly files may not have been copied into the pipeline yet. In the real time pipeline, the data is written immediately to the Kafka cluster which is more resilient and the risk of the data loss is reduced.
    • Late processing – if a machine has recovered from a failure, and you were lucky enough to avoid data loss, the recovered data needs to be processed and in most cases it won’t be done within the time period that the data is related to. This means that this time period will be processed again which adds extra load on the Hadoop and may result in data delays, since the Hadoop needs to process multiple time periods at the same time. Like before, the benefit of the real time pipeline in that aspect is that the data reaches the Hadoop without any delays so there is no reprocessing of any time period.

 

Conclusions

Having the real time pipeline, our life became much simpler. Beside the planned goals (data availability and reducing the latency), the extra goodies that we got from this change made us less sensitive to any network glitch or hardware malfunction. In the past, each one of those issues forced us to handle data delays and have the risk of data loss, and now the real time pipeline by its nature, solved all of them.

Yes, there are more components that we need to maintain and monitor, but this cost is justified if you compare it to the great results we achieved by implementing this real time system for our data delivery pipeline.

 

The post Real Time Data Pipeline – More Than We Expected appeared first on Outbrain Techblog.

]]>
https://www.outbrain.com/techblog/2020/05/real-time-data-pipeline-more-than-we-expected/feed/ 0
4 Engineers, 7000 Servers, And One Global Pandemic https://www.outbrain.com/techblog/2020/05/4-engineers-7000-servers-and-one-global-pandemic/ Sun, 17 May 2020 07:41:18 +0000 http://wpadmin.outbrain.com/techblog/?p=2710 Inside one of our data centers (tiny COVIDs by CDC  on Unsplash)   If this title did not send a slight shiver down your spine, then you should either move on to the next post, or drop by our careers page

Read more ›

The post 4 Engineers, 7000 Servers, And One Global Pandemic appeared first on Outbrain Techblog.

]]>

Inside one of our data centers (tiny COVIDs by CDC  on Unsplash)

 

If this title did not send a slight shiver down your spine, then you should either move on to the next post, or drop by our careers page – we’d love to have a chat.

This post will touch the core of what we do – building up through abstraction. It will demonstrate how design principles, like separation of concerns and inversion of control, manifest in the macro, physical realm, and how software helps bind it all together into a unified, ever developing, more reliable system.

While we’re only touching the tip of the iceberg, we intend to dive into the juicy details in future posts.

 

Who we are

We are a team of 4 penguins who enjoy writing code and tinkering with hardware. In our spare time, we are in charge of deploying, maintaining and operating a fleet of over 7000 physical servers running Linux, spread across 3 different DCs located in the US.

We also happen to do this 6,762 miles away, from the comfort of our very own cubicle, a short drive from the nearest beach resort overlooking the Mediterranean.

License to use: Creative Commons Zero – CC0.

 

Challenges of Scale

While it may make sense for a start-up to choose to start with hosting their infrastructure in a public cloud due to the relatively small initial investments, we at Outbrain choose to host our own servers. We do so because the ongoing costs of public cloud infrastructure far surpass the costs of running our own hardware in collocated data centers once a certain scale is reached, and it offers an unparalleled degree of control and fault recovery. 

As we grow in scale, challenges are always a short distance away, and they usually come in droves. Managing a server’s life cycle must evolve to contain the rapid increase in the number of servers. Spreadsheet-native methods of managing server parts across data centers become, very quickly, cumbersome. Detecting, troubleshooting and resolving failures, while maintaining reasonable SLAs, becomes an act of juggling vastly diverse hardware arrays, different loads, timing of updates and the rest of the lovely things no one wants to care about. 

 

Master your Domains

To solve many of these challenges, we’ve broken down a server’s life cycle at Outbrain into its primary components, and we called them domains. For instance, one domain encompasses hardware demand, another the logistics surrounding inventory lifecycle, while a third is communication with the staff onsite. There is another one concerning hardware observability, but we won’t go into all of them at this time. The purpose of this is to examine and define the domains, so they can be abstracted away through code. Once a working abstraction is developed, it is translated into a manual process that is deployed, tested and improved, as a preparation for a coded algorithm. Finally, a domain is set up to integrate with other domains via APIs, forming a coherent, dynamic, and ever evolving hardware life cycle system that is deployable, testable and observable. Like all of our other production systems. 

Adopting this approach has allowed us to tackle many challenges the right way – by building tools and automations.

 

The Demand Domain

While emails and spreadsheets were an acceptable way to handle demand in the early days, they were no longer sustainable as the number of servers and volume of incoming hardware requests reached a certain point. In order to better organize and prioritize incoming requests in an environment of rapid expansion, we had to rely on a ticketing system that was:

  • Could be customised to present only relevant fields (simple)
  • Exposed APIs (extendable)
  • Familiar to the team (sane)
  • Integrated with our existing workflows (unified)

Since we used JIRA to manage our sprints and internal tasks, we decided to create another project that would help our clients submit tickets and monitor their progress. Relying on JIRA for both incoming requests and internal task management, allowed us to set up a consolidated Kanban board that gave us a unified view. Our internal customers, on the other hand, had a view that presented only hardware request tickets, without exposing “less relevant details” of additional tasks within the team (such as enhancing tools, bug fixing and writing this blog post).

(JIRA Kanban board)

As a bonus, the fact that queues and priorities were now visible to all, gave visibility into “where in line” each request stood, what came before it, what stage it was in, and allowed owners to shift priorities within their own requests without having to talk to us. As simple as a “drag and drop”. It has also allowed us to estimate and continuously evaluate our SLAs according to the types of request, based on metrics generated by JIRA.

 

The Inventory Lifecycle Domain

You could only imagine the complexity of managing the parts that go in each server. To make things worse, many parts (memory, disk) can find themselves traveling back and forth from inventory to different servers and back, according to the requirements. Lastly, when they fail, they are either decommissioned and swapped, or returned to the vendor for RMA. All of this, of course, needs to be communicated to the colocation staff onsite who do the actual physical labor. To tackle these challenges we have created an internal tool called Floppy. Its job is to:

  • Abstract away the complexities of adding/removing parts whenever a change is introduced to a server
  • Manage communication with the onsite staff including all required information – via email / iPad (more on that later)
  • Update the inventory once work has been completed and verified

The inventory, in turn, is visualized through a set of Grafana dashboards, which is what we use for graphing all our other metrics. So we essentially use the same tool for inventory visualization as we do for all other production needs.

 

(Disk inventory management Grafana dashboard)

(Disk inventory management Grafana dashboard)

When a server is still under warranty, we invoke a different tool we’ve built, called Dispatcher. It’s jobs are to:

  • Collects the system logs
  • Generate a report in the preferred vendor’s format
  • Open a claim with the vendor via API
  • Return a claim ID for tracking purposes

Once the claim is approved (typically within a business day), a replacement part is shipped to the relevant datacenter, and is handled by the onsite staff.

(Jenkins console output)

 

The Communication Domain

In order to support the rapid increases in capacity, we’ve had to adapt the way we work with the onsite data center technicians. When at first growing in scale meant buying new servers, following a large consolidation project (powered by a move to Kubernetes), it became something entirely different. Our growth transformed from “placing racks” to “repurposing servers”. So instead of adding capacity, we started opening up servers and replacing their parts. To successfully make this paradigm shift, we had to stop thinking about Outbrain’s colocation providers as vendors, and begin seeing them as our clients. Adopting this approach meant that we would need to design and build the right toolbox to help make the work of data center technicians:

  • Simple
  • Autonomous
  • Efficient
  • Reliable

All the while abstracting away the operational differences between our different colocation providers, and the seniority of the technicians at each location. We had to remove ourselves from the picture, and let them communicate with the server without our intervention and without making assumptions involving the workload, work hours, equipment at hand, and other factors.

To tackle this we had set up iPad rigs in each data center. Once plugged into a server, the following would happen:

  • The mechanism validates that this is indeed a server that needs to be worked on
  • The application running on the server is shutdown (where needed)
  • A set of work instructions is posted to a Slack channel explaining the required steps
  • Once work is completed, the tool validates the final state is correct
  • And if needed, restarts the application

In addition, we have also introduced a Slack bot meant to service the technician, with an ever expanding set of capabilities to make their work smoother and our lives easier. By doing so, we’ve turned most of the process of repurposing and servicing servers into an asynchronous one, removing ourselves from the loop.

(iPad rig at one of our data centers)

 

The Hardware Observability Domain

Scaling our datacenter infrastructure reliably requires good visibility into every component of the chain, for instance:

  • Hardware fault detection
  • Server states (active, allocatable, zombie, etc)
  • Power consumption
  • Firmware levels
  • Analytics on top of it all

Metric based decisions allow us to decide on how, where and when to procure hardware, at times before we even get the demand. It also allows us to better distribute resources such as power, by identifying workloads which are more demanding. Thus, we can make informed decisions as to server placement before a server is racked and plugged into the power,  throughout its maintenance cycles, and up to its eventual decommissioning.

(Rack power utilization dashboard in Grafana)

 

And then came COVID-19…

At Outbrain, we build technologies that empower media companies and publishers across the open web by helping visitors discover relevant content, products and services that may be of interest to them. Therefore, our infrastructure is designed to contain the traffic generated when major news events unfold. 

The media coverage of the events surrounding COVID-19 coupled with the increase in visitors traffic has meant that we had to rapidly increase our capacity to cope with these volumes. To top it off, we had to do this facing a global crisis, where supply chains were disrupted, and a major part of the workforce was confined to their homes.

But, as we described, our model already assumes that:

  • The hardware in our data centers, for the most part, is not physically accessible to us
  • We rely on remote hands for almost all of the physical work
  • Remote hands work is done asynchronous, autonomously and in large volume
  • We meet hardware demand by method of “building from parts” vs “buying ready kits”
  • We keep inventory to be able to build, not just fix

So the global limitations preventing many companies from accessing their physical data centers had little effect on us. And as far as parts and servers go – yes, we scrambled to secure hardware just like everyone else, but it was to ensure that we don’t run out of hardware as we backfill our reserves rather than to meet the already established demand.

In summary, I hope that this glimpse into our world of datacenter operations shows one can apply the same principles of good code design to the physical domains of datacenter management, and live to tell.

 

Want to hear more about the way we run our data centers? Stay tuned for the next part!

The post 4 Engineers, 7000 Servers, And One Global Pandemic appeared first on Outbrain Techblog.

]]>
How you can set many spark jobs write to the same path https://www.outbrain.com/techblog/2020/03/how-you-can-set-many-spark-jobs-write-to-the-same-path/ https://www.outbrain.com/techblog/2020/03/how-you-can-set-many-spark-jobs-write-to-the-same-path/#respond Thu, 12 Mar 2020 12:05:33 +0000 http://wpadmin.outbrain.com/techblog/?p=2654 One of the main responsibilities of the DataInfra team at Outbrain, which I am a member of, is to build a data delivery pipeline, our pipeline is built on top of Kafka and Spark Streaming frameworks. We process up to 2 million

Read more ›

The post How you can set many spark jobs write to the same path appeared first on Outbrain Techblog.

]]>

One of the main responsibilities of the DataInfra team at Outbrain, which I am a member of, is to build a data delivery pipeline, our pipeline is built on top of Kafka and Spark Streaming frameworks. We process up to 2 million requests per second through dozens of streaming jobs.

fig 1: Simplified data flow architecture

 

Problem Description

One of our requirements was to read data from different Kafka clusters and stream the data to the same path in the HDFS.  This doesn’t sound complicated, several identical jobs, each provided with the Kafka address as a parameter. In practice, two concurrent jobs may delete each other files, here is how:

When we call saveAsHadoopFile() action in the spark program

View the code on Gist.

View the code on Gist.

The save action will be performed by SparkHadoopWriter, a helper that writes the data and in the end issues a commit for the entire job. 

See relevant documentation for SparkHadoopWriter:

View the code on Gist.

The commit is the part that is most relevant to our problem. The class that by default does the commit is FileOutputCommitter which, among other things, creates ${mapred.output.dir}/_temporary  subdirectory where the files are written and later on, after being committed, moved to ${mapred.output.dir}.

In the end, the entire temporary folder is deletedWhen two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable.

(OutputCommitter Documentation)

I’ve tried to find an easy solution in the Stack Overflow and Spark community but couldn’t find anything, except for suggestions to write to different locations and use Distcp later on, which would require additional compute resources which I’d like to spare.

The solution:

We’ve created our own OBFileOutputCommitter which is almost identical to the default FileOutputCommiterbut it supports the change of the temporary configuration, fortunately, we can add our own committer through Spark configuration. This way each job will have its own temporary folder, so the cleanup won’t delete data of other jobs.

There is a catch, of course – You’ll have to use MultipleTextOutputFormat to make sure that the files have unique names. If you won’t, two jobs will have the same default names which will collide.

Here is the link to the custom committer code. Add it to your project and follow the example below.

View the code on Gist.

Conclusion

The Hadoop framework is flexible enough to amend and customize output committers. With proper documentation,

it could have been easier, but that shouldn’t stop you from trying to alter the framework to suit your needs.

 

 

 

References

https://hadoop.apache.org/docs/r2.7.2/api/src-html/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.html

https://stackoverflow.com/questions/46665299/spark-avoid-creating-temporary-directory-in-s3/46690036#46690036

https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning?rq=1

https://github.com/apache/spark/pull/21286

https://issues.apache.org/jira/browse/MAPREDUCE-1471?jql=text%20~%20%22FileOutputCommitter%22

https://issues.apache.org/jira/browse/MAPREDUCE-7029

The post How you can set many spark jobs write to the same path appeared first on Outbrain Techblog.

]]>
https://www.outbrain.com/techblog/2020/03/how-you-can-set-many-spark-jobs-write-to-the-same-path/feed/ 0
I was a developer. Now I’m a Product Manager. Why and how. https://www.outbrain.com/techblog/2020/03/i-was-a-developer-now-im-a-product-manager-why-and-how/ https://www.outbrain.com/techblog/2020/03/i-was-a-developer-now-im-a-product-manager-why-and-how/#respond Wed, 11 Mar 2020 06:24:37 +0000 http://wpadmin.outbrain.com/techblog/?p=2646 Photo by davisco on Unsplash I love writing code. I love everything about it. Coding is like solving puzzles. You get a problem and you need to think how to solve it. Think up the best solution, optimize it, debug

Read more ›

The post I was a developer. Now I’m a Product Manager. Why and how. appeared first on Outbrain Techblog.

]]>

Photo by davisco on Unsplash

I love writing code. I love everything about it. Coding is like solving puzzles. You get a problem and you need to think how to solve it. Think up the best solution, optimize it, debug it, squash bugs. I love writing code so much that I spent 20 years working as a developer.  I even got used to waking up at night with a deep understanding that I just found a bug in my code. I love it! And despite having had several opportunities during my career to shift to other positions, I always decided to stick to coding. In a sense, I was afraid of not getting my hands dirty, of stepping away from the keyboard.

 

Product Hunt

When I joined Outbrain as a backend software developer in the Platform group, I took ownership of a backend service called Dyploma. Dyploma is a deployment gateway built on top of Kubernetes, enabling developers to build artifacts and deploy them to multiple data centers and environments with the click of a button. It makes our developers’ lives super easy. 

Dyploma is a product with 200 users. A live product which is being used constantly, with feature requests, bugs and direct impact on the production services of Outbrain. With so many users, requirements and impact, we realized we wanted a product oriented approach to managing and evolving it. But we had no prior experience managing products, nor any product managers in the group. Our solution to this challenge was to introduce the Product Leads.

Product leads were an attempt to get the best of both worlds – keep the technical people doing the tech stuff they love and excel at, but also allow them to take on Product Manager responsibilities part-time. Dyploma wasn’t the only product in the Platform group where we wanted to implement a product-oriented approach, we had others as well. So when Platform leadership invited people to volunteer for these Product Lead roles, due to my deep sense of involvement with the product I was developing, I decided to take the challenge head on.

I took it upon myself to be the product lead for Dyploma, in parallel to being its backend developer. I met with the users, heard their wishes and complaints, mapped the different flows they needed, the new features they wanted, prioritized them… then switched hats to “backend developer” to do the implementations. It was fun, exciting, difficult, satisfying – it was everything I loved about writing code, but… not necessarily writing code.

Since I started doing something I was not familiar with, I decided I should start learning about what it takes to be a product manager. I have worked with product managers but to actually be one is different, and it had (and still has) its learning curve.

 

The Lean Startup

I learned that there is a constant loop of Learn->Build->Measure (Lean Startup anyone?).

 

For us, the “Learn” stage included creating a focus group, meeting the customers, documenting their requests and feedback.

One of the big advantages of working on internal products is that you can meet your customers all the time (should be handled with care, so it does not become a pain….). I would walk in the corridors and people would jump at me and start bombarding me with ideas and problems that they had. Corridor chats are nice but we needed more formalized communication channels to make communication scale, so I created focus groups and held periodic meetings with them. I used the documentation of these meetings when prioritizing features. 

I used users stories to help me focus on what was really important, on the Minimum Viable Product (MVP). I then sat with a focus group of developers and we reviewed these user stories for feedback and validation. A slack channel for more ad-hoc documentation (and a paper trail), and boy, was it busy…

Next, the ‘Build’ stage. I figured this would be easy, since I was also the developer. But working with the UI team and seeing how the backend and frontend code work together toward a common goal, required going beyond just writing code. It required synchronization and team work.

Once built, came the ‘Measure’ stage. At first, this didn’t seem important – “if you build it, they will come” felt intuitively right. But that wasn’t the case. After implementing one specific, large feature, we found that user adoption was there… but we didn’t get the value we wanted. Partially because we didn’t define the value clearly enough, or had KPIs in place to measure it. In fact, once we implemented the KPIs, it seemed we were moving in the opposite direction!

Our learnings here lead us to implement a “KPIs first” approach, which wasn’t simple or necessarily intuitive. Instead of working on the feature, we would first work on the metrics, define the measurements and only then move onto building the features. This allowed us to “focus on the target” at all times and change direction quickly if needed. Another set of KPIs would focus on feature impact (for example, # of affected users) and implementation effort (high level estimation is mostly good enough). These reflect the feature’s perceived value and cost, so we could not only measure whether the feature moves the needle, but also understand how to prioritize the different features and manage our resources.

Measuring KPIs was true for customer satisfaction as well – we started gathering customer feedback, and were happy to see that our customers played along. They felt they were being heard and given a mandate to influence product direction.

To close the loop, following measurements, we returned to the ‘Learn’ stage, plotting the next steps in the product’s evolution.

 

Present day

Fast forward a year and a half – the Product Leads initiative was a partial success. Unfortunately, the number of Product Leads that kept the addon role over time wasn’t high. It was… one. But wherever we had a Product Lead, we got what we asked for :-). We had good user interaction, a product roadmap and a regular release cadence centered around it. This was the proof of the hypothesis we had in mind. As an infrastructure group we build products! And every product, no matter if it is internal or external, should be managed as one.

From this point on, we decided to invest more in this direction. I moved on to a role of full time Product Manager in the Platform group. While writing these lines, I manage 5 different products, developed by multiple teams, with more to come!

Looking back, here are some of my insights on what made this adventure a successful one.  

  • Personal Skills – Like in many cases, a great human interface makes everything work better. The ability to build good relationships with different types of customers, make them feel they can approach you anytime and give feedback. Building these relationships will also enable you to facilitate better coordination and collaboration between the different teams. 
  • Being a good Listener – As any good Product Manager, you need to collect ideas, listen to problems and needs, get feedback – and for all of that to be done well, you need to be a good listener. This usually involves keeping quiet while others talk 🙂
  • Being a good Investigator – Sometimes your customers are “too nice”. It might be that they don’t feel comfortable passing criticism, or they simply did not put the effort to think about it. They might not even realize they feel uncomfortable with some aspects of your product. To overcome this obstacle, you need to learn how to ask the hard questions.  Look for the criticism, even if it doesn’t feel great, and focus on it. Dig deep, listen and look for hints.
  • Methodology – Working with different people, getting requirements all around, and choosing the right things to focus on – you must have a system if you want to get those right. When a VP has a feature request – is it more important than the feature request of the single developer? The system is more than just the priorities, it is also the KPIs, making sound decisions and holding conversations based on data and not on beliefs. The lighthouse should always be the value you bring to the product, and not necessarily the requestor’s role, how often they ask for it, or other “psychological bias factors”.

This past year was quite a ride – I learned a lot, gained exciting new experiences, failed and succeeded countless times. There’s still so much for me to learn, but one thing I already know – I love being a Product Manager!

The post I was a developer. Now I’m a Product Manager. Why and how. appeared first on Outbrain Techblog.

]]>
https://www.outbrain.com/techblog/2020/03/i-was-a-developer-now-im-a-product-manager-why-and-how/feed/ 0
Trusting the Infrastructure https://www.outbrain.com/techblog/2020/02/trusting-the-infrastructure/ https://www.outbrain.com/techblog/2020/02/trusting-the-infrastructure/#respond Mon, 10 Feb 2020 15:21:56 +0000 http://wpadmin.outbrain.com/techblog/?p=2466 One of the biggest challenges facing infrastructure groups is how to build TRUST with their customers. If achieved, this trust will lead to a better working environment where colleagues have good relationships and trust each other. And like any relationship,

Read more ›

The post Trusting the Infrastructure appeared first on Outbrain Techblog.

]]>

One of the biggest challenges facing infrastructure groups is how to build TRUST with their customers. If achieved, this trust will lead to a better working environment where colleagues have good relationships and trust each other.

And like any relationship, one of the ways to gain someone’s trust is by being transparent. Transparency shows that you have nothing to hide, and allowing the free flow of information between sides. 

In our engineering context, we are talking about transparency of the micro services owned by development teams and the resources that the infrastructure groups provide them. 

Providing  transparency for the resources’ usage is a difficult feat even in a simple system. It is even more complex when resources are shared across development teams, since you need to show each team its own relevant usage and not the overall usage of the whole physical resource.

At Outbrain we advocate for a full responsibility model – each development team has full ownership of its services, from development to production. This model gives the teams lots of power, as they have full freedom and are independent to do their job (and they do it great!) without having the infrastructure teams as a bottleneck.

This model makes the need for transparency of resources (and especially shared resources) even more critical.

A Bit Of Nostalgia

In the past everything was simpler. I grew up in the Eighties (actually it was the Seventies but we don’t have to be accurate all the time), we were young and beautiful, and everything was great and simple – there was no Internet, no mobile phones and we played outside in the street with other real human beings.

Even the computer systems were simpler. We had the classic 3 Tier Architecture: Presentation, Application & Data layers. The Application layer was a big monolithic server and the Data layer was probably an Oracle RDBMS. The infrastructure teams had full control over all the systems – no changes were made in production, unless they were the ones making it.

Back To Reality

The Eighties are gone, we are not children anymore (but still beautiful…). On the tech side, the micro services architecture was introduced, the big monolithic servers were broken into many little micro services, and the big databases followed the same trend. Each micro service can have its own database as if that wasn’t complex enough, each micro service can even have a different type of database depending on its own specific use case.

In addition, the amount of data and traffic grew exponentially, and the term BigData was invented. In order to support its growth, the systems become more and more complex, and in some cases the architecture became so complex and dynamic that no single person can describe it, not to mention to map it in a flow chart.…

The Challenge

Outbrain is the world’s leading content discovery platform. We serve over 300 billion content recommendations every month, to over 800 million users across the world. In order to support such a large scale, we have a backend system running hundreds of micro services on top of an infrastructure that is spread over more than 6000 physical machines spread between 3 data centers and in public clouds (GCP & AWS).

Our infrastructure contains, among other things, many data store clusters, data processing systems, data delivery pipelines and a BigData lake that is used by our recommendation ML algorithms, BI Engineers and analysts to produce business reports, etc.

And yes, the Cloud Platform Group which is responsible for supplying all this infrastructure has the same challenge that was described in the beginning – how do we empower our users and provide them the transparency to how much resources they are using and for what.

Conflicts

As complex as these computer systems are, the real complexity is actually the human factor. Every company has conflicts of two types: personal and structural. 

As everyone knows, personal conflicts will always complicate engineering challenges. While structural conflicts are between technical teams with different roadmaps, priorities and constraints. They can quickly lead to personal conflicts and to a work environment with lots of tensions and frustrations. Such an atmosphere will inevitably affect the quality of the computer systems …

In our case the structural conflict is between the R&D teams and the Infrastructure teams. For example, infrastructure teams often feel like they are treated like warehouse workers. The R&D teams come with capacity or feature demands and the Infrastructure teams must fulfils them, the R&D teams are not aware or concerned of the effects and costs of their requests or the effect it might have on the resource which is shared with other teams.

Structural conflicts will always exist, but they can be managed effectively as long as there is a relationship built between the teams.

The Solution

The base of each relationship is TRUST and in our case the base for that is to provide visibility of all used services and resources.

 

The visibility enables the infrastructure teams to know who is using the infrastructure and how it is being used. The same visibility helps the R&D teams to understand the capacity of infrastructure that they are using. In fact, by providing this visibility we empower the R&D teams and taking the Infrastructure teams out of the warehouse worker role.

We realized that we must find a solution, a single place that will enable us and our users to understand the footprint of the various resources that are used and for what purposes. This will also enable us to estimate the cost of each feature/project leveraging our knowledge of resources that are used by it.

At Outbrain we created a self-serve Usage Report system, which crawls our entire infrastructure and provides visibility and transparency of resources for everyone, in order to display the required dashboards and reports.

 

There are lots of resources that can be monitored by the Usage Report system, but we decided to focus on the following:

  •     Micro services deployments
  •     Physical machines
  •     Public cloud resources
  •     Data processing utilization
  •     Storage
  •     Network usage

As you might imagine, the most challenging part of the Usage Report system is the crawling process, which needs to correlate between all the services and their users. This is done by adding an “owner” label to the services so it can be used later. Whereas the Usage Report system had to support different types of labels, based on the capability of each resource.

Conclusions

While not all services are part of the Usage Report system yet, we have already started to see the benefits of this tool.

In the process of implementing this tool, we added additional metadata which identifies the resource usage for each development team. Now we can associate the usage for each team.

In addition, by having this visibility we can estimate the cost of a certain feature. This can be achieved by calculating the cost of all the resources that it uses; in that way we can know if a feature is ROI positive or not.

But the main benefit that we gain is the TRUST. Now we have a real time dashboard which enables everyone to understand their resources usage. This knowledge increases the trust between the R&D and Infrastructure teams and empowering everyone to be more efficient and to take our ownership model one step forward.

The post Trusting the Infrastructure appeared first on Outbrain Techblog.

]]>
https://www.outbrain.com/techblog/2020/02/trusting-the-infrastructure/feed/ 0