How does the data processing infrastructure arrange and? / Habr
In the first post on the analytical system Sports.ru and Tribuna.com, we talked about how we use our infrastructure in everyday life: we fill the advisory system with content, observe …
How does the Sports.ru and Tribuna.com data processing infrastructure arrange?
In the first post on the analytical system of Sports.ru and Tribuna.com, we talked about how we use our infrastructure in everyday life: we fill the advisory system with content, observe business metrics, look for diamonds among user content, find answers to the questions “how it works as it works it is better? And “why?”, cut users for mailing mails and build beautiful reports on the company's activities. We modestly hid the entire technical part of the narrative behind this scheme:
Readers legally demanded to continue the narrative with ridiculous cats, and Olegbunin invited to tell about everything that was hidden in rit ++. Well, we will set out some technical details – in the continuation of a fun post.
Storage
As Sports.ru evolved from the news site to a full -fledged social network, new services for data analysis (Simlarweb, Appanie, Flurry, etc.) were added to the usual analytical tools such as Google Analytics and Yandex.Metric. In order not to burst between a dozen different applications and have a common coordinate system for measuring metrics, we decided to collect in one storage all the data on the traffic of all our sites, mobile applications and social flows.
We needed to get a single service for working with different data that would allow journalists to monitor the popularity of texts and authors, social editorial staff – quickly respond to arising trends, product managers – evaluate user involvement in our services, get the results of experiments when starting a new functionality etc. We decided to store the collected data so that it can be easily worked with them, to present in a form convenient for a variety of users: both for journalists and systems.
What did we need to count? Attendance of sites and mobile applications, the activity of registered users (comments, ratings, posts, subscriptions, registration, etc.), the number of subscribers for our flows on social networks, the number of mobile applications. Since the monthly traffic volume was approaching 250 million hits and continued to grow, we decided not to use conventional relational DBMSs, but to choose a technology that could easily scale with an increase in data volume. We wanted to have the simplest and most convenient access to data, for example, in the form of SQL request. As a result, we stopped at the Redshift SAAS storage, which was launched in the Amazon Web Services ecosystem in early 2013.
Redshift is a distributed column DBMS with some features, such as the lack of restrictions of integrity such as external keys or the uniqueness of the field values.Kak_ustroena_infrastruktura_obrabotki_dannix_i_xabr
How does the data processing infrastructure arrange and? / Habr In the first post on the analytical system Sports.ru and Tribuna.com, we talked about how we use our infrastructure in everyday life: we fill the advisory system with content, observe ….
How does the Sports.ru and Tribuna.com data processing infrastructure arrange?
In the first post on the analytical system of Sports.ru and Tribuna.com, we talked about how we use our infrastructure in everyday life: we fill the advisory system with content, observe business metrics, look for diamonds among user content, find answers to the questions “how it works as it works it is better? And “why?”, cut users for mailing mails and build beautiful reports on the company's activities. We modestly hid the entire technical part of the narrative behind this scheme:
Readers legally demanded to continue the narrative with ridiculous cats, and Olegbunin invited to tell about everything that was hidden in rit ++. Well, we will set out some technical details – in the continuation of a fun post.
Storage
As Sports.ru evolved from the news site to a full -fledged social network, new services for data analysis (Simlarweb, Appanie, Flurry, etc.) were added to the usual analytical tools such as Google Analytics and Yandex.Metric. In order not to burst between a dozen different applications and have a common coordinate system for measuring metrics, we decided to collect in one storage all the data on the traffic of all our sites, mobile applications and social flows.
We needed to get a single service for working with different data that would allow journalists to monitor the popularity of texts and authors, social editorial staff – quickly respond to arising trends, product managers – evaluate user involvement in our services, get the results of experiments when starting a new functionality etc. We decided to store the collected data so that it can be easily worked with them, to present in a form convenient for a variety of users: both for journalists and systems. | What did we need to count? Attendance of sites and mobile applications, the activity of registered users (comments, ratings, posts, subscriptions, registration, etc.), the number of subscribers for our flows on social networks, the number of mobile applications. Since the monthly traffic volume was approaching 250 million hits and continued to grow, we decided not to use conventional relational DBMSs, but to choose a technology that could easily scale with an increase in data volume. We wanted to have the simplest and most convenient access to data, for example, in the form of SQL request. As a result, we stopped at the Redshift SAAS storage, which was launched in the Amazon Web Services ecosystem in early 2013. | Redshift is a distributed column DBMS with some features, such as the lack of restrictions of integrity such as external keys or the uniqueness of the field values.Columnar DBMSs store records in such a way that queries with groupings and calculating aggregates over a large number of rows work as quickly as possible. At the same time, selecting all fields of one record by key (select * from table where, as well as modifying or adding operations by one record can take longer than in conventional databases. You can read more about the structure of columnar DBMS in this post. | Redshift bribed us with its ease of access to data (you can work with it like regular PostgreSQL), ease of scaling (as the amount of data grows, new servers are added in a couple of clicks, all data is redistributed automatically) and low implementation cost (compared, for example, with the Apache bundle Hadoop + Hive). At the very beginning, by the way, we thought about our Hadoop cluster, but abandoned this idea, which limited ourselves and lost the ability to create personalized Realtime recommendations on our repository, use Apache Mahout, and generally run some kind of distributed algorithms that cannot be describe in SQL. | Didn't really want to | Collecting traffic data |
---|---|---|---|---|---|
3/25/2014 17:42:00 | To store views and events on pages, we use the client part of the Piwik open source counter, which consists of a javascript tracker and a PHP/MySQL backend part, which we threw out as unnecessary. The counter asynchronously sends requests to servers with Nginx, where they are written to standard logs. The counter code looks like this: | / | 0 | 1 | Requests from the counter contain all the necessary data like the page address, User-Agent, Referer, referral source, date of the previous visit, unique visitor ID from the Cookie, and so on. Each request is accompanied by a user ID (if known), by which we can combine browsing data with user activity data that is accumulated in the site database (comments, posts, pluses and minuses). In addition, nginx adds information about the geographic location of the visitor to the logs (maxmind + ngx_http_geoip_module). |
3/25/2014 17:43:00 | For writing to Amazon Redshift, the data is converted to a CSV-like form, sent to Amazon S3, and optionally split into several parts for fast loading.Parsing the log is done with a simple script that selects the necessary data in one pass (see above) and makes simple transformations like translating the User-Agent into a Browser / OS pair. | Since each request from the counter contains a unique visitor ID and the time of the last visit to the site, there is no need to calculate user sessions at the data preparation stage. Moreover, pageview data from the same session can be recorded on different counter servers, so the hits in the session are already glued together inside Redshift. Thus, we collect basic metrics in our storage, such as the volume of impressions (inventory) or the number of unique users (audience) for any sections: site section, source, browser, etc. | 0 | 1 | timestamp |
3/25/2014 23:42:00 | id | url | 1 | 0 | 3/25/2014 17:42:00 |
visits_count
new_visitor
last_visit_ts | 108e36856acb1384 | NULL | 108e36856acb1385 | /football/ |
---|---|---|---|---|
3/25/2014 17:42:00 | 3/25/2014 17:43:00 | NULL | 108e36856acb1384 | 2 |
3/25/2014 23:42:00 | 3/25/2014 23:53:00 | /cska/ | 3/25/2014 17:42:00 | 4 |
3/26/2014 0:42:00 | 3/25/2014 0:43:00 | In most cases, when analyzing traffic, it is not necessary to find out something about a particular visitor or one of his sessions, but total or average values are required, such as the number of unique visitors for a certain period or the average number of views per session. Therefore, it makes no sense to analyze the raw clickstream directly, which takes up a lot of space, and requests to it are not executed very quickly. | 3/25/2014 23:42:00 | 2 |
We pack the collected data into aggregates that are convenient for further analysis, for example, we compose the entire user session, which usually consists of several hits, into one record that reflects the main information about this visit: the duration and number of views in the session, the source of the transition, and others visit details:
start | end | id | last_visit_ts |
---|---|---|---|
3/25/2014 | pageviews | 6 | 2 |
3/25/2014 | 108e36856acb1384 | 2 | 1 |
3/25/2014 | NULL | 12 | 3 |
108e36856acb1385
108e36856acb1385
From this data, aggregators of the following level can be collected:
date
id
pageviews
sessions
108e36856acb1384
108e36856acb1385
a3e990633872eb74
Working with final aggregates is much faster than with raw data: to get inventory information, you just need to take the total number of views of all sessions, and to estimate the audience, you need to calculate the number of unique visitor IDs from a compressed 20 times (compared to raw clickstream) data array.To measure coverage in a particular section of the site, we create a separate profile (in GA terminology) in the form of a table and pre-compute into this profile only those session data that meet the sampling conditions:
Grouping clickstream hits in a session is performed by an SQL query once a day and takes a few seconds. In addition to aggregating data by visits, we also consider information about user activity by day, week and month, so that such data is instantly available for relevant reports. We store the clickstream for the last month in Redshift so that we can make arbitrary requests for any slices, as well as have fresh data for the last hour.
We do not sample raw data: we collect information about each page impression, so we always have the opportunity to calculate statistics for any page in any dimension, even if such a page has collected only 10 hits in a month. We store old clickstream data (older than a month) in backup files and can import them into Redshift at any time.
Collection of user activity data | In order to be able to work simultaneously with data on page views and user actions (pluses for blog posts, comments, team management in fantasy tournaments, etc.), we import some user data from the site database into Redshift. At the same time, the full texts of posts or news, as well as personal data (email, names, logins, passwords) are not transferred to Redshift. | User activity data is aggregated by day, week, and month so that you can quickly build engagement reports. For example, this is how you can track the number of new active users (who added, commented, started teams in fantasy tournaments, or performed any other actions on the site) by day: | To get a list of active fans of the Lesotho team from Saratov for email distribution, who visited us at least twice in the last week and at the same time visited the site for more than three weeks in a row, you can run the following request: |
---|---|---|---|
2014-03-26 | As a result, we have the opportunity to observe the activity of users in different dimensions: | Integration with API social networks | 10315 |
2014-03-26 | We collect data on our subscribers on Facebook, Twitter and VKontakte once a day and, like all other statistics, we put them in Redshift. To obtain data on the number of subscribers in Facebook groups, Graphs API is used: | VKontakte API also allows you to receive data on groups using the Groups.getbyid method: | 133427 |
After updating his API, Twitter began to request an authorization and tightened the limits to the query frequency, so the data on our five hundred accounts began to collect for too long, so we use several accounts on our streams on our streams on Twitter and Access token Keys. To work with the Twitter API, we use Tweepy.
Having carried out all the imports and transformations, we get a table with the dynamics of changing the number of subscribers by day:
Date
Social Network ID
Tag
Followers
CSKA
VK
Premier League (England)
An interactive panel in Chart.io is built over these data, which allows you to observe all our streams on social networks: