In the second column, clause NTILE(2) ensures that the values are divided into two buckets, represented by numbers 1 and 2. The second and third column show how NTILE function assigns numbers to all values so that each bucket has an equal number of rows. Let’s see an example for the following new dataset of six values: It divides the partition as equally as possible by assigning numbers 1.num_buckets to rows. The window function NTILE(num_buckets) enables us to compute quantiles. If a dataset becomes large enough, we could be interested in some statistical measures. Notice the last periods for each user – both of them did not use the phone for more than a week. The following table shows the output of the query. The outer SELECT puts a threshold on the period length and filters only periods longer than 2 days. The next two columns are just differences of the first two columns in INT and INTERVAL type. Period of inactivity ends when the following activity begins, therefore LEAD function extracts time of the next activity in the sequence ordered by date and time partitioned by a user. The COALESCE function ensures that for the internet usage, the duration is zero since the function returns 0 anytime duration field is null. The beginning of a period is computed as a sum of the starting date, time and duration of the call. To get starts and ends of the periods, we will take advantage of the LEAD function, which behaves as the LAG function, however, instead of the previous rows, it looks up the following ones. This scenario is also very useful for identification and segmentation of sessions in a clickstream data, also known as sessionization. We will consider internet usage as an instant activity with no duration and calls as an activity with the specified duration. Let’s say we want to see all periods of inactivity longer than 2 days for each user. A good example for our dataset would be periods of inactivity between calls. Lead function: Periods of inactivityĪnytime we need to calculate values based on neighbouring rows, window functions should be the first thing that come to mind. Some other numbers took less than 1 percent of total call time. We can discover from the percentage column for example, that the first user spent almost half of his calling time by calling number 49508432274. Window functions produce an output row for each call, but similar to the previous example all rows with the same phone number contain the same aggregated information. The WHERE clause collects only rows with outgoing calls since we do not know the phone number of incoming calls.If the condition is false, there are only calls of zero length for the phone number, hence we return 0 as the total duration in this case. The CASE clause checks if the total duration is greater than 0 so that we do not divide by zero.After the division, the result is multiplied by 100 and rounded to get the percentage. The type of duration is INT, which would cause the division to give integer result, and therefore we need to cast the denominator to FLOAT before the division. To compute the percentage, we use the previously mentioned expression in the numerator and sum over all calls in the denominator.The same is applied to INTERVAL and INT type. The total duration over user and phone is computed by the window function SUM and partitioned by the mentioned fields.Let’s understand the query starting from the inner core section: Let’s also print the percentage of this sum relative to all user’s calls. In other words, we are interested in the total sum of call durations for the particular counterpart. Let’s find out who receives the longest calls. SUM function: Total duration over phone numbers Also note that to convert the number of seconds in INT to INTERVAL, we concatenate the number with ‘ secs’string and then cast to INTERVAL with a double colon. This expression is valid only for Redshift and equivalent in Postgres is DATE_TRUNC(‘day’, date_time). To extract a date from a timestamp, we use TRUNC function in the query. Using the traditional GROUP BY aggregation would require self-joining tables. The aggregate function MAX together with the partitioning by phone clause computes the longest call made by a user and puts this against all user’s calls so we have this repeating extra column available for various comparisons. See the comparison with the longest_today column. The duration time in bold marks the calls that were longest on a particular day. Notice that they share the same number_longest time. Rows with the same phone number are in the same colour.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |