导航菜单

数据采集与预处理

数据采集工具

  • • Flume:日志采集
  • • Logstash:多源数据采集与转换

数据清洗与转换

# Logstash配置示例
input { file { path => "/var/log/syslog" } }
filter { grok { match => { "message" => "%{SYSLOGBASE}" } } }
output { elasticsearch { hosts => ["localhost:9200"] } }

ETL流程

# PySpark数据清洗
from pyspark.sql import functions as F
df = df.withColumn('age', F.col('age').cast('int'))
df = df.dropna()