How does the data processing infrastructure arrange and? / Habr

In the first post on the analytical system Sports.ru and Tribuna.com, we talked about how we use our infrastructure in everyday life: we fill the advisory system with content, observe …

How does the Sports.ru and Tribuna.com data processing infrastructure arrange?

In the first post on the analytical system of Sports.ru and Tribuna.com, we talked about how we use our infrastructure in everyday life: we fill the advisory system with content, observe business metrics, look for diamonds among user content, find answers to the questions “how it works as it works it is better? And “why?”, cut users for mailing mails and build beautiful reports on the company's activities. We modestly hid the entire technical part of the narrative behind this scheme:

Readers legally demanded to continue the narrative with ridiculous cats, and Olegbunin invited to tell about everything that was hidden in rit ++. Well, we will set out some technical details – in the continuation of a fun post.

Storage

As Sports.ru evolved from the news site to a full -fledged social network, new services for data analysis (Simlarweb, Appanie, Flurry, etc.) were added to the usual analytical tools such as Google Analytics and Yandex.Metric. In order not to burst between a dozen different applications and have a common coordinate system for measuring metrics, we decided to collect in one storage all the data on the traffic of all our sites, mobile applications and social flows.

We needed to get a single service for working with different data that would allow journalists to monitor the popularity of texts and authors, social editorial staff – quickly respond to arising trends, product managers – evaluate user involvement in our services, get the results of experiments when starting a new functionality etc. We decided to store the collected data so that it can be easily worked with them, to present in a form convenient for a variety of users: both for journalists and systems.

What did we need to count? Attendance of sites and mobile applications, the activity of registered users (comments, ratings, posts, subscriptions, registration, etc.), the number of subscribers for our flows on social networks, the number of mobile applications. Since the monthly traffic volume was approaching 250 million hits and continued to grow, we decided not to use conventional relational DBMSs, but to choose a technology that could easily scale with an increase in data volume. We wanted to have the simplest and most convenient access to data, for example, in the form of SQL request. As a result, we stopped at the Redshift SAAS storage, which was launched in the Amazon Web Services ecosystem in early 2013.

Data visualization

Redshift is a distributed column DBMS with some features, such as the lack of restrictions of integrity such as external keys or the uniqueness of the field values.Kak_ustroena_infrastruktura_obrabotki_dannix_i_xabr

How does the data processing infrastructure arrange and? / Habr In the first post on the analytical system Sports.ru and Tribuna.com, we talked about how we use our infrastructure in everyday life: we fill the advisory system with content, observe ….

In order not to waste time developing a graphic interface, we decided to use a third -party service for drawing charts and Chart.io tables. It has ready -made connectors for Redshift, Google Analytics and other sources of data, supports caching the results of requests on its side. Data for the schedule can be requested by SQL request or using a convenient interface where you can choose metrics, measurements and filters.

How does the Sports.ru and Tribuna.com data processing infrastructure arrange?

In the first post on the analytical system of Sports.ru and Tribuna.com, we talked about how we use our infrastructure in everyday life: we fill the advisory system with content, observe business metrics, look for diamonds among user content, find answers to the questions “how it works as it works it is better? And “why?”, cut users for mailing mails and build beautiful reports on the company's activities. We modestly hid the entire technical part of the narrative behind this scheme:

Readers legally demanded to continue the narrative with ridiculous cats, and Olegbunin invited to tell about everything that was hidden in rit ++. Well, we will set out some technical details – in the continuation of a fun post.

Storage

As Sports.ru evolved from the news site to a full -fledged social network, new services for data analysis (Simlarweb, Appanie, Flurry, etc.) were added to the usual analytical tools such as Google Analytics and Yandex.Metric. In order not to burst between a dozen different applications and have a common coordinate system for measuring metrics, we decided to collect in one storage all the data on the traffic of all our sites, mobile applications and social flows.

We needed to get a single service for working with different data that would allow journalists to monitor the popularity of texts and authors, social editorial staff – quickly respond to arising trends, product managers – evaluate user involvement in our services, get the results of experiments when starting a new functionality etc. We decided to store the collected data so that it can be easily worked with them, to present in a form convenient for a variety of users: both for journalists and systems. What did we need to count? Attendance of sites and mobile applications, the activity of registered users (comments, ratings, posts, subscriptions, registration, etc.), the number of subscribers for our flows on social networks, the number of mobile applications. Since the monthly traffic volume was approaching 250 million hits and continued to grow, we decided not to use conventional relational DBMSs, but to choose a technology that could easily scale with an increase in data volume. We wanted to have the simplest and most convenient access to data, for example, in the form of SQL request. As a result, we stopped at the Redshift SAAS storage, which was launched in the Amazon Web Services ecosystem in early 2013. Redshift is a distributed column DBMS with some features, such as the lack of restrictions of integrity such as external keys or the uniqueness of the field values.Columnar DBMSs store records in such a way that queries with groupings and calculating aggregates over a large number of rows work as quickly as possible. At the same time, selecting all fields of one record by key (select * from table where, as well as modifying or adding operations by one record can take longer than in conventional databases. You can read more about the structure of columnar DBMS in this post. Redshift bribed us with its ease of access to data (you can work with it like regular PostgreSQL), ease of scaling (as the amount of data grows, new servers are added in a couple of clicks, all data is redistributed automatically) and low implementation cost (compared, for example, with the Apache bundle Hadoop + Hive). At the very beginning, by the way, we thought about our Hadoop cluster, but abandoned this idea, which limited ourselves and lost the ability to create personalized Realtime recommendations on our repository, use Apache Mahout, and generally run some kind of distributed algorithms that cannot be describe in SQL. Didn't really want to Collecting traffic data
3/25/2014 17:42:00 To store views and events on pages, we use the client part of the Piwik open source counter, which consists of a javascript tracker and a PHP/MySQL backend part, which we threw out as unnecessary. The counter asynchronously sends requests to servers with Nginx, where they are written to standard logs. The counter code looks like this: / 0 1 Requests from the counter contain all the necessary data like the page address, User-Agent, Referer, referral source, date of the previous visit, unique visitor ID from the Cookie, and so on. Each request is accompanied by a user ID (if known), by which we can combine browsing data with user activity data that is accumulated in the site database (comments, posts, pluses and minuses). In addition, nginx adds information about the geographic location of the visitor to the logs (maxmind + ngx_http_geoip_module).
3/25/2014 17:43:00 For writing to Amazon Redshift, the data is converted to a CSV-like form, sent to Amazon S3, and optionally split into several parts for fast loading.Parsing the log is done with a simple script that selects the necessary data in one pass (see above) and makes simple transformations like translating the User-Agent into a Browser / OS pair. Since each request from the counter contains a unique visitor ID and the time of the last visit to the site, there is no need to calculate user sessions at the data preparation stage. Moreover, pageview data from the same session can be recorded on different counter servers, so the hits in the session are already glued together inside Redshift. Thus, we collect basic metrics in our storage, such as the volume of impressions (inventory) or the number of unique users (audience) for any sections: site section, source, browser, etc. 0 1 timestamp
3/25/2014 23:42:00 id url 1 0 3/25/2014 17:42:00

visits_count

new_visitor

last_visit_ts 108e36856acb1384 NULL 108e36856acb1385 /football/
3/25/2014 17:42:00 3/25/2014 17:43:00 NULL 108e36856acb1384 2
3/25/2014 23:42:00 3/25/2014 23:53:00 /cska/ 3/25/2014 17:42:00 4
3/26/2014 0:42:00 3/25/2014 0:43:00 In most cases, when analyzing traffic, it is not necessary to find out something about a particular visitor or one of his sessions, but total or average values ​​are required, such as the number of unique visitors for a certain period or the average number of views per session. Therefore, it makes no sense to analyze the raw clickstream directly, which takes up a lot of space, and requests to it are not executed very quickly. 3/25/2014 23:42:00 2

We pack the collected data into aggregates that are convenient for further analysis, for example, we compose the entire user session, which usually consists of several hits, into one record that reflects the main information about this visit: the duration and number of views in the session, the source of the transition, and others visit details:

start end id last_visit_ts
3/25/2014 pageviews 6 2
3/25/2014 108e36856acb1384 2 1
3/25/2014 NULL 12 3

108e36856acb1385

108e36856acb1385

From this data, aggregators of the following level can be collected:

date

id

pageviews

sessions

108e36856acb1384

In order to connect Chartio to the database, just indicate the host, port and login with a password for connecting to Redshift, and in the Amazon Web Services settings you need to allow access to data from Chart.io servers:

108e36856acb1385

a3e990633872eb74

Working with final aggregates is much faster than with raw data: to get inventory information, you just need to take the total number of views of all sessions, and to estimate the audience, you need to calculate the number of unique visitor IDs from a compressed 20 times (compared to raw clickstream) data array.To measure coverage in a particular section of the site, we create a separate profile (in GA terminology) in the form of a table and pre-compute into this profile only those session data that meet the sampling conditions:

Grouping clickstream hits in a session is performed by an SQL query once a day and takes a few seconds. In addition to aggregating data by visits, we also consider information about user activity by day, week and month, so that such data is instantly available for relevant reports. We store the clickstream for the last month in Redshift so that we can make arbitrary requests for any slices, as well as have fresh data for the last hour.

We do not sample raw data: we collect information about each page impression, so we always have the opportunity to calculate statistics for any page in any dimension, even if such a page has collected only 10 hits in a month. We store old clickstream data (older than a month) in backup files and can import them into Redshift at any time.

Collection of user activity data In order to be able to work simultaneously with data on page views and user actions (pluses for blog posts, comments, team management in fantasy tournaments, etc.), we import some user data from the site database into Redshift. At the same time, the full texts of posts or news, as well as personal data (email, names, logins, passwords) are not transferred to Redshift. User activity data is aggregated by day, week, and month so that you can quickly build engagement reports. For example, this is how you can track the number of new active users (who added, commented, started teams in fantasy tournaments, or performed any other actions on the site) by day: To get a list of active fans of the Lesotho team from Saratov for email distribution, who visited us at least twice in the last week and at the same time visited the site for more than three weeks in a row, you can run the following request:
2014-03-26 As a result, we have the opportunity to observe the activity of users in different dimensions: Integration with API social networks 10315
2014-03-26 We collect data on our subscribers on Facebook, Twitter and VKontakte once a day and, like all other statistics, we put them in Redshift. To obtain data on the number of subscribers in Facebook groups, Graphs API is used: VKontakte API also allows you to receive data on groups using the Groups.getbyid method: 133427

In order to build a schedule with the breakdown of the number of comments on the types of commented content and the dynamics on the days, it is enough to indicate the fields by which the data group will be made and the sample conditions will be made:

After updating his API, Twitter began to request an authorization and tightened the limits to the query frequency, so the data on our five hundred accounts began to collect for too long, so we use several accounts on our streams on our streams on Twitter and Access token Keys. To work with the Twitter API, we use Tweepy.

Having carried out all the imports and transformations, we get a table with the dynamics of changing the number of subscribers by day:

Date

Social Network ID

Simple mathematical functions can be applied to the selected data:

Tag

After that, you need to select the type of graph:

Followers

And the interactive panel is ready for use.Actually, in this way we were overlaid with all sorts of graphs and tables on a wide range of analytical data. After spending less than 3 months on development, we were able to make arbitrary queries on data from all the necessary sources, not limited to standard Google Analytics dashboards.

Twitter

What else?

CSKA

In addition to the basic analytical tools of the company's employees, we started building data services for our users. For example, we calculate the visitor's interest in a particular club, athlete, tournament with a certain accuracy and use this knowledge to form a personalized website menu and links to frequently visited pages. We use information about the tags of the materials that the visitor reads, sometimes we enrich this knowledge with other signals (transitions from the club's fan site or from a thematic ad of contextual advertising).

VK

Premier League (England)

Talking about our data processing infrastructure is endless. We will continue a detailed conversation within our section on RIT++, where we will talk in more detail about the architecture, the cost of implementation and ownership, we will discuss user campaigning for mailing lists, real-time content recommendations, business metrics monitoring, as well as searching for bots among users and formation of client profiles.

An interactive panel in Chart.io is built over these data, which allows you to observe all our streams on social networks: