数据采集与预处理
数据采集工具
- • Flume:日志采集
- • Logstash:多源数据采集与转换
数据清洗与转换
# Logstash配置示例
input { file { path => "/var/log/syslog" } }
filter { grok { match => { "message" => "%{SYSLOGBASE}" } } }
output { elasticsearch { hosts => ["localhost:9200"] } }ETL流程
# PySpark数据清洗
from pyspark.sql import functions as F
df = df.withColumn('age', F.col('age').cast('int'))
df = df.dropna()