数据采集与预处理
数据采集工具
- • Flume:日志采集
- • Logstash:多源数据采集与转换
数据清洗与转换
# Logstash配置示例 input { file { path => "/var/log/syslog" } } filter { grok { match => { "message" => "%{SYSLOGBASE}" } } } output { elasticsearch { hosts => ["localhost:9200"] } }
ETL流程
# PySpark数据清洗 from pyspark.sql import functions as F df = df.withColumn('age', F.col('age').cast('int')) df = df.dropna()