您的小型存储库中有更多站点统计信息

通过分析该站点的统计信息,我们可以了解该站点所发生的情况。 我们将结果与有关产品或服务的其他知识进行比较,从而改善我们的体验。

当对第一个结果的分析完成时,信息已被理解并得出结论,下一阶段开始。 想法浮出水面:如果从另一个角度看数据会怎样?

此时,分析工具存在局限性。 这就是Google Analytics(分析)工具对我来说还不够的原因之一,即由于查看和操作数据的能力有限。

我一直想快速加载基本数据(主数据),添加另一级别的聚合或以其他方式解释现有值。

在基于access.log文件的小型存储库中,这很容易做到,而SQL语言就足够了。

那么,我想找到什么问题的答案?

网站上的更改内容和时间


始终关注基础数据(主数据)的更改历史。

图片

SQL报告请求
SELECT 1 as 'SideStackedBar: Content Updates by Months', strftime('%m/%Y', datetime(UPDATE_DT, 'unixepoch')) AS 'Day', COUNT(CASE WHEN PAGE_TITLE != 'na' THEN DIM_REQUEST_ID END) AS 'Web page updates', COUNT(CASE WHEN PAGE_DESCR = 'IMAGES' THEN DIM_REQUEST_ID END) AS 'Image uploads', COUNT(CASE WHEN PAGE_DESCR = 'VIDEO' THEN DIM_REQUEST_ID END) AS 'Video uploads', COUNT(CASE WHEN PAGE_DESCR = 'AUDIO' THEN DIM_REQUEST_ID END) AS 'Audio uploads' FROM DIM_REQUEST WHERE PAGE_TITLE != 'na' OR PAGE_DESCR != 'na' GROUP BY strftime('%m/%Y', datetime(UPDATE_DT, 'unixepoch')) ORDER BY UPDATE_DT 


例如,在某个时候,已经进行了搜索引擎优化或向网站添加了新内容,在这方面,预计流量会增加。

用户群


组的最简单示例是用户代理或操作系统名称。

用户代理维度已累积了大约一千条记录,我很想了解该组中代理分布的动态。

图片

SQL报告请求
 SELECT 1 AS 'SideStackedBar: User Agents', AGENT_OS AS 'OS', SUM(CASE WHEN AGENT_BOT = 'na' THEN 1 ELSE 0 END ) AS 'User Agent of Users', SUM(CASE WHEN AGENT_BOT != 'na' THEN 1 ELSE 0 END ) AS 'User Agent of Bots' FROM DIM_USER_AGENT WHERE DIM_USER_AGENT_ID != -1 GROUP BY AGENT_OS ORDER BY 3 DESC 


代理的各种组合中的大多数都来自Windows世界。 其中不确定的因素包括WhatsApp,PocketImageCache,PlayStation,SmartTV等。

每周用户组活动


通过组合一些组,我们可以观察其活动的分布。

例如,Linux群集的用户在站点上消耗的流量比其他任何人都多。

图片

SQL报告请求
 SELECT 1 as 'StackedBar: Traffic Volume by User OS and by Week', strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Week', SUM(CASE WHEN USG.AGENT_OS IN ('Android', 'Linux') THEN FCT.BYTES ELSE 0 END)/1000 AS 'Android/Linux Users', SUM(CASE WHEN USG.AGENT_OS IN ('Windows') THEN FCT.BYTES ELSE 0 END)/1000 AS 'Windows Users', SUM(CASE WHEN USG.AGENT_OS IN ('Macintosh', 'iOS') THEN FCT.BYTES ELSE 0 END)/1000 AS 'Mac/iOS Users', SUM(CASE WHEN USG.AGENT_OS IN ('na', 'BlackBerry') THEN FCT.BYTES ELSE 0 END)/1000 AS 'Other' FROM FCT_ACCESS_USER_AGENT_DD FCT, DIM_USER_AGENT USG, DIM_HTTP_STATUS HST WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID AND FCT.DIM_HTTP_STATUS_ID = HST.DIM_HTTP_STATUS_ID AND USG.AGENT_BOT = 'na' /* users only */ AND HST.STATUS_GROUP IN ('Successful') /* good pages */ AND datetime(FCT.EVENT_DT, 'unixepoch') > date('now', '-3 month') GROUP BY strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch')) ORDER BY FCT.EVENT_DT 


交通繁忙


该表显示了最活跃的用户组及其活动的日期。
最活跃的属于Linux集群。

图片

SQL报告请求
 SELECT 1 AS 'Table: User Agent with Havy Usage', strftime('%d.%m.%Y', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day', ROUND(1.0*SUM(FCT.BYTES)/1000000, 1) AS 'Traffic MB', ROUND(1.0*SUM(FCT.IP_CNT)/SUM(1), 1) AS 'IPs', ROUND(1.0*SUM(FCT.REQUEST_CNT)/SUM(1), 1) AS 'Requests', USA.DIM_USER_AGENT_ID AS 'ID', MAX(USA.USER_AGENT_NK) AS 'User Agent', MAX(USA.AGENT_BOT) AS 'Bot' FROM FCT_ACCESS_USER_AGENT_DD FCT, DIM_USER_AGENT USA WHERE FCT.DIM_USER_AGENT_ID = USA.DIM_USER_AGENT_ID AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-30 day') GROUP BY USA.DIM_USER_AGENT_ID, strftime('%d.%m.%Y', datetime(FCT.EVENT_DT, 'unixepoch')) ORDER BY SUM(FCT.BYTES) DESC, FCT.EVENT_DT LIMIT 10 


使用属性日和座席ID,您可以快速查找和跟踪各个用户组的日期统计信息。 如有必要,您可以在阶段表中快速找到详细信息。

如何获得信息?


通过集成其他数据源并引入新级别的聚合和分组,可以使access.log文件中的信息更加有效。

基本数据和实体


基本数据包括有关实体的信息:网页,图片,视频和音频内容(对于商店而言)是产品。

实体本身起着维度的作用,保存属性更改的过程称为历史化。 在数据库中,此过程通常以尺寸变化缓慢(SCD)的形式实现。

各种系统都可以作为基本数据的来源,因此几乎总是需要将它们集成在一起。

尺寸变化缓慢


DIM_REQUEST维度将以历史形式包含有关网站查询的信息。

SCD2表
 CREATE TABLE DIM_REQUEST ( /* scd table for user requests */ DIM_REQUEST_ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, DIM_REQUEST_ID_HIST INTEGER NOT NULL DEFAULT -1, REQUEST_NK TEXT NOT NULL DEFAULT 'na', /* request without ?parameters */ PAGE_TITLE TEXT NOT NULL DEFAULT 'na', PAGE_DESCR TEXT NOT NULL DEFAULT 'na', PAGE_KEYWORDS TEXT NOT NULL DEFAULT 'na', DELETE_FLAG INTEGER NOT NULL DEFAULT 0, UPDATE_DT INTEGER NOT NULL DEFAULT 0, UNIQUE (REQUEST_NK, DIM_REQUEST_ID_HIST) ); INSERT INTO DIM_REQUEST (DIM_REQUEST_ID) VALUES (-1); 


除此以外,创建一个视图,该视图始终显示最后状态的所有记录。 有必要加载测量本身。

图片

SCD2当前视图
 /* Content: actual view on scd table */ SELECT HI.DIM_REQUEST_ID, HI.DIM_REQUEST_ID_HIST, HI.REQUEST_NK, HI.PAGE_TITLE, HI.PAGE_DESCR, HI.PAGE_KEYWORDS, NK.CNT AS HIST_CNT, HI.DELETE_FLAG, strftime('%d.%m.%Y %H:%M', datetime(HI.UPDATE_DT, 'unixepoch')) AS UPDATE_DT FROM ( SELECT REQUEST_NK, MAX(DIM_REQUEST_ID) AS DIM_REQUEST_ID, SUM(1) AS CNT FROM DIM_REQUEST GROUP BY REQUEST_NK ) NK, DIM_REQUEST HI WHERE 1 = 1 AND NK.REQUEST_NK = HI.REQUEST_NK AND NK.DIM_REQUEST_ID = HI.DIM_REQUEST_ID; 


以及每个记录的历史信息收集位置的视图。 有必要在历史上建立与事实的正确联系。

图片

SCD2的历史视图
 /* Content: actual view on scd table */ SELECT SCD.DIM_REQUEST_ID, SCD.DIM_REQUEST_ID_HIST, SCD.REQUEST_NK, SCD.PAGE_TITLE, SCD.PAGE_DESCR, SCD.PAGE_KEYWORDS, SCD.DELETE_FLAG, CASE WHEN HIS.UPDATE_DT IS NULL THEN 1 ELSE 0 END ACTIVE_FLAG, SCD.DIM_REQUEST_ID_HIST AS ID_FROM, SCD.DIM_REQUEST_ID AS ID_TO, CASE WHEN SCD.DIM_REQUEST_ID_HIST=-1 THEN 3600 ELSE IFNULL(SCD.UPDATE_DT,3600) END AS TIME_FROM, CASE WHEN HIS.UPDATE_DT IS NULL THEN 253370764800 ELSE HIS.UPDATE_DT END AS TIME_TO, CASE WHEN SCD.DIM_REQUEST_ID_HIST=-1 THEN STRFTIME('%d.%m.%Y %H:%M', DATETIME(3600, 'unixepoch')) ELSE STRFTIME('%d.%m.%Y %H:%M', DATETIME(IFNULL(SCD.UPDATE_DT,3600), 'unixepoch')) END AS ACTIVE_FROM, CASE WHEN HIS.UPDATE_DT IS NULL THEN STRFTIME('%d.%m.%Y %H:%M', DATETIME(253370764800, 'unixepoch')) ELSE STRFTIME('%d.%m.%Y %H:%M', DATETIME(HIS.UPDATE_DT, 'unixepoch')) END AS ACTIVE_TO FROM DIM_REQUEST SCD LEFT OUTER JOIN DIM_REQUEST HIS ON SCD.REQUEST_NK = HIS.REQUEST_NK AND SCD.DIM_REQUEST_ID = HIS.DIM_REQUEST_ID_HIST; 


资料汇整


压缩(聚合)使您可以评估更高级别的数据,并检测在详细报告中不可见的异常和趋势。

例如,在具有请求状态代码DIM_HTTP_STATUS的维中,添加组:
状态/组
0xx /无
1xx /信息性
2xx /成功
3xx /重定向
4xx /客户端错误
5xx /服务器错误
用户代理维DIM_USER_AGENT将包含组的AGENT_OS和AGENT_BOT属性。 可以在ETL过程中填写它们:

下载DIM_USER_AGENT
 /* Propagate the user agent from access log */ INSERT INTO DIM_USER_AGENT (USER_AGENT_NK, AGENT_OS, AGENT_ENGINE, AGENT_DEVICE, AGENT_BOT, UPDATE_DT) WITH CLS AS ( SELECT BROWSER FROM STG_ACCESS_LOG WHERE LENGTH(BROWSER)>1 GROUP BY BROWSER ) SELECT CLS.BROWSER AS USER_AGENT_NK, CASE WHEN INSTR(CLS.BROWSER,'Macintosh')>0 THEN 'Macintosh' WHEN INSTR(CLS.BROWSER,'iPhone')>0 OR INSTR(CLS.BROWSER,'iPad')>0 OR INSTR(CLS.BROWSER,'iPod')>0 OR INSTR(CLS.BROWSER,'Apple TV')>0 OR INSTR(CLS.BROWSER,'Darwin')>0 THEN 'iOS' WHEN INSTR(CLS.BROWSER,'Android')>0 THEN 'Android' WHEN INSTR(CLS.BROWSER,'X11;')>0 OR INSTR(CLS.BROWSER,'Wayland;')>0 OR INSTR(CLS.BROWSER,'linux-gnu')>0 THEN 'Linux' WHEN INSTR(CLS.BROWSER,'BB10;')>0 OR INSTR(CLS.BROWSER,'BlackBerry')>0 THEN 'BlackBerry' WHEN INSTR(CLS.BROWSER,'Windows')>0 THEN 'Windows' ELSE 'na' END AS AGENT_OS, -- OS CASE WHEN INSTR(CLS.BROWSER,'AppleCoreMedia')>0 THEN 'AppleWebKit' WHEN INSTR(CLS.BROWSER,') ')>1 AND LENGTH(CLS.BROWSER)>INSTR(CLS.BROWSER,') ') THEN COALESCE(SUBSTR(CLS.BROWSER, INSTR(CLS.BROWSER,') ')+2, LENGTH(CLS.BROWSER) - INSTR(CLS.BROWSER,') ')-1), 'N/A') ELSE 'na' END AS AGENT_ENGINE, -- Engine CASE WHEN INSTR(CLS.BROWSER,'iPhone')>0 THEN 'iPhone' WHEN INSTR(CLS.BROWSER,'iPad')>0 THEN 'iPad' WHEN INSTR(CLS.BROWSER,'iPod')>0 THEN 'iPod' WHEN INSTR(CLS.BROWSER,'Apple TV')>0 THEN 'Apple TV' WHEN INSTR(CLS.BROWSER,'Android ')>0 AND INSTR(CLS.BROWSER,'Build')>0 THEN COALESCE(SUBSTR(CLS.BROWSER, INSTR(CLS.BROWSER,'Android '), INSTR(CLS.BROWSER,'Build')-INSTR(CLS.BROWSER,'Android ')), 'na') WHEN INSTR(CLS.BROWSER,'Android ')>0 AND INSTR(CLS.BROWSER,'MIUI')>0 THEN COALESCE(SUBSTR(CLS.BROWSER, INSTR(CLS.BROWSER,'Android '), INSTR(CLS.BROWSER,'MIUI')-INSTR(CLS.BROWSER,'Android ')), 'na') ELSE 'na' END AS AGENT_DEVICE, -- Device CASE WHEN INSTR(LOWER(CLS.BROWSER),'yandex.com')>0 THEN 'yandex' WHEN INSTR(LOWER(CLS.BROWSER),'googlebot')>0 THEN 'google' WHEN INSTR(LOWER(CLS.BROWSER),'bingbot')>0 THEN 'microsoft' WHEN INSTR(LOWER(CLS.BROWSER),'ahrefsbot')>0 THEN 'ahrefs' WHEN INSTR(LOWER(CLS.BROWSER),'jobboersebot')>0 OR INSTR(LOWER(CLS.BROWSER),'jobkicks')>0 THEN 'job.de' WHEN INSTR(LOWER(CLS.BROWSER),'mail.ru')>0 THEN 'mail.ru' WHEN INSTR(LOWER(CLS.BROWSER),'baiduspider')>0 THEN 'baidu' WHEN INSTR(LOWER(CLS.BROWSER),'mj12bot')>0 THEN 'majestic-12' WHEN INSTR(LOWER(CLS.BROWSER),'duckduckgo')>0 THEN 'duckduckgo' WHEN INSTR(LOWER(CLS.BROWSER),'bytespider')>0 THEN 'bytespider' WHEN INSTR(LOWER(CLS.BROWSER),'360spider')>0 THEN 'so.360.cn' WHEN INSTR(LOWER(CLS.BROWSER),'compatible')>0 OR INSTR(LOWER(CLS.BROWSER),'http')>0 OR INSTR(LOWER(CLS.BROWSER),'libwww')>0 OR INSTR(LOWER(CLS.BROWSER),'spider')>0 OR INSTR(LOWER(CLS.BROWSER),'java')>0 OR INSTR(LOWER(CLS.BROWSER),'python')>0 OR INSTR(LOWER(CLS.BROWSER),'robot')>0 OR INSTR(LOWER(CLS.BROWSER),'curl')>0 OR INSTR(LOWER(CLS.BROWSER),'wget')>0 THEN 'other' ELSE 'na' END AS AGENT_BOT, -- Bot STRFTIME('%s','now') AS UPDATE_DT FROM CLS LEFT OUTER JOIN DIM_USER_AGENT TRG ON CLS.BROWSER = TRG.USER_AGENT_NK WHERE TRG.DIM_USER_AGENT_ID IS NULL 


资料整合


它包括从操作系统到报表系统的数据传输组织。 为此,请创建一个结构类似于源的阶段表。

有关网页的信息以插入请求的形式从CMS备份进入阶段。

使用基本数据加载历史记录DIM_REQUEST表需要三个步骤:加载新键和属性,更新现有键和属性以及修复已删除的记录。

下载新的SCD2条目
 /* Load request table SCD from master data */ INSERT INTO DIM_REQUEST (DIM_REQUEST_ID_HIST, REQUEST_NK, PAGE_TITLE, PAGE_DESCR, PAGE_KEYWORDS, DELETE_FLAG, UPDATE_DT) WITH CLS AS ( -- prepare keys SELECT '/' || NAME AS REQUEST_NK, TITLE AS PAGE_TITLE, CASE WHEN DESCRIPTION = '' OR DESCRIPTION IS NULL THEN 'na' ELSE DESCRIPTION END AS PAGE_DESCR, CASE WHEN KEYWORDS = '' OR KEYWORDS IS NULL THEN 'na' ELSE KEYWORDS END AS PAGE_KEYWORDS FROM STG_CMS_MENU WHERE CONTENT_TYPE != 'folder' -- only web pages AND PAGE_TITLE != 'na' -- master data which make sense ) /* new records from stage: CLS */ SELECT -1 AS DIM_REQUEST_ID_HIST, CLS.REQUEST_NK, CLS.PAGE_TITLE, CLS.PAGE_DESCR, CLS.PAGE_KEYWORDS, 0 AS DELETE_FLAG, STRFTIME('%s','now') AS UPDATE_DT FROM CLS LEFT OUTER JOIN ( SELECT DIM_REQUEST_ID, REQUEST_NK, PAGE_TITLE, PAGE_DESCR, PAGE_KEYWORDS FROM DIM_REQUEST_V_ACT ) TRG ON CLS.REQUEST_NK = TRG.REQUEST_NK WHERE TRG.REQUEST_NK IS NULL -- no such record in data mart 


更新SCD2属性
 /* Load request table SCD from master data */ INSERT INTO DIM_REQUEST (DIM_REQUEST_ID_HIST, REQUEST_NK, PAGE_TITLE, PAGE_DESCR, PAGE_KEYWORDS, DELETE_FLAG, UPDATE_DT) WITH CLS AS ( -- prepare keys SELECT '/' || NAME AS REQUEST_NK, TITLE AS PAGE_TITLE, CASE WHEN DESCRIPTION = '' OR DESCRIPTION IS NULL THEN 'na' ELSE DESCRIPTION END AS PAGE_DESCR, CASE WHEN KEYWORDS = '' OR KEYWORDS IS NULL THEN 'na' ELSE KEYWORDS END AS PAGE_KEYWORDS FROM STG_CMS_MENU WHERE CONTENT_TYPE != 'folder' -- only web pages AND PAGE_TITLE != 'na' -- master data which make sense ) /* updated records from stage: CLS and build reference to history: HIST */ SELECT HIST.DIM_REQUEST_ID AS DIM_REQUEST_ID_HIST, HIST.REQUEST_NK, CLS.PAGE_TITLE, CLS.PAGE_DESCR, CLS.PAGE_KEYWORDS, 0 AS DELETE_FLAG, STRFTIME('%s','now') AS UPDATE_DT FROM CLS, DIM_REQUEST_V_ACT TRG, DIM_REQUEST HIST WHERE CLS.REQUEST_NK = TRG.REQUEST_NK AND TRG.DIM_REQUEST_ID = HIST.DIM_REQUEST_ID AND ( CLS.PAGE_TITLE != HIST.PAGE_TITLE /* changes only */ OR CLS.PAGE_DESCR != HIST.PAGE_DESCR OR CLS.PAGE_KEYWORDS != HIST.PAGE_KEYWORDS ) 


删除的SCD2条目
 /* Load request table SCD from master data */ INSERT INTO DIM_REQUEST (DIM_REQUEST_ID_HIST, REQUEST_NK, PAGE_TITLE, PAGE_DESCR, PAGE_KEYWORDS, DELETE_FLAG, UPDATE_DT) WITH CLS AS ( -- prepare keys SELECT '/' || NAME AS REQUEST_NK, TITLE AS PAGE_TITLE FROM STG_CMS_MENU WHERE CONTENT_TYPE != 'folder' -- only web pages AND PAGE_TITLE != 'na' -- master data which make sense ) /* deleted records in data mart: TRG */ SELECT TRG.DIM_REQUEST_ID AS DIM_REQUEST_ID_HIST, TRG.REQUEST_NK, TRG.PAGE_TITLE, TRG.PAGE_DESCR, TRG.PAGE_KEYWORDS, 1 AS DELETE_FLAG, STRFTIME('%s','now') AS UPDATE_DT FROM ( SELECT DIM_REQUEST_ID, REQUEST_NK, PAGE_TITLE, PAGE_DESCR, PAGE_KEYWORDS FROM DIM_REQUEST_V_ACT WHERE PAGE_TITLE != 'na' -- track master data only AND DELETE_FLAG = 0 -- not already deleted ) TRG LEFT OUTER JOIN CLS ON TRG.REQUEST_NK = CLS.REQUEST_NK WHERE CLS.REQUEST_NK IS NULL -- no such record in stage 


每个数据源都必须附带一个正式的描述,例如在readme.txt文件中:
正式/技术上的收件人:姓名,电子邮件地址
数据提供者正式/技术上:名称,电子邮件
数据源:文件路径,服务名称
数据访问信息:用户和密码

数据移动方案将在维护和更新过程中提供帮助,例如以文本形式:
移动文件。 来源:ftp.domain.net:/logs/access.log目标:/var/www/access.log
在舞台上读书。 目标:STG_ACCESS_LOG
下载和转换。 目标:FCT_ACCESS_REQUEST_REF_HH
下载和转换。 目标:FCT_ACCESS_USER_AGENT_DD
报告。 目标:/var/www/report.html

结论


因此,本文介绍了诸如基本数据集成和新级别聚合的引入之类的机制。 在建立数据仓库时需要它们,以获取更多知识并提高信息质量。

Source: https://habr.com/ru/post/zh-CN463493/


All Articles