Flaml-1-介绍

Flaml-1-介绍

FLAML是一个轻量级的 Python 库,用于机器学习和 AI 操作的高效自动化。它基于大型语言模型、机器学习模型等实现工作流程自动化,并优化其性能

主要特点

  • FLAML 能够以最小的工作量构建基于多智能体对话的下一代 GPT-X 应用程序。它简化了复杂 GPT-X 工作流程的编排、自动化和优化。它最大限度地提高了 GPT-X 模型的性能并增强了它们的弱点。
  • 对于常见的机器学习任务(如分类和回归),它可以快速为计算资源较少的用户提供的数据找到高质量的模型。它易于定制或扩展。
  • 它支持快速、经济的自动调优,能够处理具有异构评估成本和复杂约束/引导/提前停止的大搜索空间。

补充知识

Autogen 通过多代理对话框架启用下一代大型语言模型(LLM)应用程序,可以实现类似于输入语言描述程序操作而自动执行的模式,暂未试用。

import tempfile

temp_dir = tempfile.gettempdir()

arithmetic_agent = ConversableAgent(
    name="算术代理",
    llm_config=False,
    human_input_mode="ALWAYS",
    # 这个代理将始终需要人工输入,以确保代码执行安全。
    code_execution_config={"use_docker": False, "work_dir": temp_dir},
)

code_writer_agent = ConversableAgent(
    name="代码撰写代理",
    system_message="你是一名代码撰写者。你可以在 Markdown 代码块中编写 Python 脚本。",
    llm_config={"config_list": [{"model": "gpt-4", "api_key": os.environ["OPENAI_API_KEY"]}]},
    human_input_mode="NEVER",
)

poetry_agent = ConversableAgent(
    name="诗歌代理",
    system_message="你是一位 AI 诗人。",
    llm_config={"config_list": [{"model": "gpt-4", "api_key": os.environ["OPENAI_API_KEY"]}]},
    human_input_mode="NEVER",
)

spark,大数据通用计算平台

使用

安装

pip install flaml
#在notebook上运行需要添加notebook参数,autogen同理
pip install "flaml[notebook]"
#conda
conda install flaml -c conda-forge

训练

flaml.AutoML 是面向任务的 AutoML 类。它可以用作具有标准 fit 和 predict 函数的 scikit-learn 样式估计器,x_train和y_train可以以以numpy array或pandas dataframe格式传入,模式如下:

# Prepare training data
# ...
from flaml import AutoML

automl = AutoML()
other_settings = {
    "time_budget": 500,  # 总时间上限(单位秒)
    “estimator_list” :['lgbm', 'rf', 'catboost', 'xgboost', 'extra_tree', 'xgb_limitdepth'],
    "metric": 'roc_auc',  # 候选可以是: 'r2', 'rmse', 'mae', 'mse', 'accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'log_loss', 'mape', 'f1', 'ap', 'ndcg', 'micro_f1', 'macro_f1'
    "task": 'classification',  # 任务类型
    "log_file_name": 'airlines_experiment.log',  # flaml日志文件
    "seed": 7654321,    # 随机种子
    "eval_method" : 'cv',
    "n_splits" : 5,
    #"ensemble" : True,
    "ensemble": {
        "final_estimator": LogisticRegression(),
        "passthrough": False, #True (default) or False, whether to pass the original features to the stacker.
        "best_individual_model" : True
    },
    "n_jobs": 4,

}
automl.fit(X_train, y_train, task="regression", time_budget=60, **other_settings)
# Save the model
with open("automl.pkl", "wb") as f:
    pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)

# At prediction time
with open("automl.pkl", "rb") as f:
    automl = pickle.load(f)
pred = automl.predict(X_test)

预定义参数 Tasks (specified via task):

  • ‘classification’: classification with tabular data.
    ‘regression’: regression with tabular data.
    ’ts_forecast’: time series forecasting.
    ’ts_forecast_classification’: time series forecasting for classification.
    ’ts_forecast_panel’: time series forecasting for panel datasets (multiple time series).
    ‘rank’: learning to rank.
    ‘seq-classification’: sequence classification.
    ‘seq-regression’: sequence regression.
    ‘summarization’: text summarization.
    ’token-classification’: token classification.
    ‘multichoice-classification’: multichoice classification.

Built-in metric.

  • ‘accuracy’: 1 - accuracy as the corresponding metric to minimize.
    ’log_loss’: default metric for multiclass classification.
    ‘r2’: 1 - r2_score as the corresponding metric to minimize. Default metric for regression.
    ‘rmse’: root mean squared error.
    ‘mse’: mean squared error.
    ‘mae’: mean absolute error.
    ‘mape’: mean absolute percentage error.
    ‘roc_auc’: minimize 1 - roc_auc_score. Default metric for binary classification.
    ‘roc_auc_ovr’: minimize 1 - roc_auc_score with multi_class=“ovr”.
    ‘roc_auc_ovo’: minimize 1 - roc_auc_score with multi_class=“ovo”.
    ‘roc_auc_weighted’: minimize 1 - roc_auc_score with average=“weighted”.
    ‘roc_auc_ovr_weighted’: minimize 1 - roc_auc_score with multi_class=“ovr” and average=“weighted”.
    ‘roc_auc_ovo_weighted’: minimize 1 - roc_auc_score with multi_class=“ovo” and average=“weighted”.
    ‘f1’: minimize 1 - f1_score.
    ‘micro_f1’: minimize 1 - f1_score with average=“micro”.
    ‘macro_f1’: minimize 1 - f1_score with average=“macro”.
    ‘ap’: minimize 1 - average_precision_score.
    ’ndcg’: minimize 1 - ndcg_score.
    ’ndcg@k’: minimize 1 - ndcg_score@k. k is an integer.

也可以自定义损失,如:

def custom_metric(
    X_val,
    y_val,
    estimator,
    labels,
    X_train,
    y_train,
    weight_val=None,
    weight_train=None,
    *args,
):
    from sklearn.metrics import log_loss
    import time

    start = time.time()
    y_pred = estimator.predict_proba(X_val)
    pred_time = (time.time() - start) / len(X_val)
    val_loss = log_loss(y_val, y_pred, labels=labels, sample_weight=weight_val)
    y_pred = estimator.predict_proba(X_train)
    train_loss = log_loss(y_train, y_pred, labels=labels, sample_weight=weight_train)
    alpha = 0.5
    return val_loss * (1 + alpha) - alpha * train_loss, {
        "val_loss": val_loss,
        "train_loss": train_loss,
        "pred_time": pred_time,
    }

返回验证损失和训练损失之间的差距所惩罚的验证损失作为要最小化的指标,此时用户还可以在附加指标字典中指定对一个或多个指标的约束,如下:

metric_constraints = [("train_loss", "<=", 0.1), ("val_loss", "<=", 0.1)]
automl.fit(
    X_train,
    y_train,
    max_iter=100,
    train_time_limit=1,
    metric_constraints=metric_constraints,
)

Estimator list ,模型列表列表可以包含一个或多个估算器名称,每个名称对应于一个内置估算器或自定义估算器。每个估算器都有一个超参数配置的搜索空间。FLAML 支持经典机器学习模型和深度神经网络。同样也可以自定义模型和搜索空间

Built-in estimator.

  • ’lgbm’: LGBMEstimator for task “classification”, “regression”, “rank”, “ts_forecast” and “ts_forecast_classification”. Hyperparameters: n_estimators, num_leaves, min_child_samples, learning_rate, log_max_bin (logarithm of (max_bin + 1) with base 2), colsample_bytree, reg_alpha, reg_lambda.
    ‘xgboost’: XGBoostSkLearnEstimator for task “classification”, “regression”, “rank”, “ts_forecast” and “ts_forecast_classification”. Hyperparameters: n_estimators, max_leaves, min_child_weight, learning_rate, subsample, colsample_bylevel, colsample_bytree, reg_alpha, reg_lambda.
    ‘xgb_limitdepth’: XGBoostLimitDepthEstimator for task “classification”, “regression”, “rank”, “ts_forecast” and “ts_forecast_classification”. Hyperparameters: n_estimators, max_depth, min_child_weight, learning_rate, subsample, colsample_bylevel, colsample_bytree, reg_alpha, reg_lambda.
    ‘rf’: RandomForestEstimator for task “classification”, “regression”, “ts_forecast” and “ts_forecast_classification”. Hyperparameters: n_estimators, max_features, max_leaves, criterion (for classification only). Starting from v1.1.0, it uses a fixed random_state by default.
    ’extra_tree’: ExtraTreesEstimator for task “classification”, “regression”, “ts_forecast” and “ts_forecast_classification”. Hyperparameters: n_estimators, max_features, max_leaves, criterion (for classification only). Starting from v1.1.0, it uses a fixed random_state by default.
    ‘histgb’: HistGradientBoostingEstimator for task “classification”, “regression”, “ts_forecast” and “ts_forecast_classification”. Hyperparameters: n_estimators, max_leaves, min_samples_leaf, learning_rate, log_max_bin (logarithm of (max_bin + 1) with base 2), l2_regularization. It uses a fixed random_state by default.
    ’lrl1’: LRL1Classifier (sklearn.LogisticRegression with L1 regularization) for task “classification”. Hyperparameters: C.
    ’lrl2’: LRL2Classifier (sklearn.LogisticRegression with L2 regularization) for task “classification”. Hyperparameters: C.
    ‘catboost’: CatBoostEstimator for task “classification” and “regression”. Hyperparameters: early_stopping_rounds, learning_rate, n_estimators.
    ‘kneighbor’: KNeighborsEstimator for task “classification” and “regression”. Hyperparameters: n_neighbors.
    ‘prophet’: Prophet for task “ts_forecast”. Hyperparameters: changepoint_prior_scale, seasonality_prior_scale, holidays_prior_scale, seasonality_mode.
    ‘arima’: ARIMA for task “ts_forecast”. Hyperparameters: p, d, q.
    ‘sarimax’: SARIMAX for task “ts_forecast”. Hyperparameters: p, d, q, P, D, Q, s.
    ‘holt-winters’: Holt-Winters (triple exponential smoothing) model for task “ts_forecast”. Hyperparameters: seasonal_perdiods, seasonal, use_boxcox, trend, damped_trend.
    ’transformer’: Huggingface transformer models for task “seq-classification”, “seq-regression”, “multichoice-classification”, “token-classification” and “summarization”. Hyperparameters: learning_rate, num_train_epochs, per_device_train_batch_size, warmup_ratio, weight_decay, adam_epsilon, seed.
    ’temporal_fusion_transformer’: TemporalFusionTransformerEstimator for task “ts_forecast_panel”. Hyperparameters: gradient_clip_val, hidden_size, hidden_continuous_size, attention_head_size, dropout, learning_rate. There is a known issue with pytorch-forecast logging.
    对于每一个模型可以额外定义参数
automl_settings = {
    "task": "classification",
    "time_budget": 10,
    "estimator_list": ["catboost", "rf"],
    "fit_kwargs_by_estimator": {
        "catboost": {
            "verbose": True,  # setting the verbosity of catboost to True
        }
    },
}

训练时间的范围确定 关于时间的参数包括: ”time_budget“ “max_iter” : 30,#限制在 AutoML 进程中尝试的最大模型数 “train_time_limit” :1,#训练时间(s) “pred_time_limit ”:1e-3,#每个实例的预测时间(s)

如果希望时间有约束,可以设置time_budget参数,那么它的大小怎么设置呢?可以先设置一个较短的时间,之后查看日志有没有提示时间过短

WARNING - All estimator hyperparameters local search has converged at least once, and the total search time exceeds 10 times the time taken to find the best model.

或设置一个较长时间,但是而外设置early_stop=True,就会在有收敛之后自动停止 如果想了解大约需要多少时间,可以先设置 max_iter=2 ,就会在日志计算预估的所需时间

INFO - Estimated sufficient time budget=145194s. Estimated necessary time budget=2118s.

并行(默认1) n_jobs,几个线程 n_concurrent_trials,几个核,当有多个cpu时可以增加 那么如何实现并行: 1,Parallel tuning with Ray

pip install flaml[ray,blendsearch]
ray.init(num_cpus=16)
automl.fit(X_train, y_train, n_jobs=4, n_concurrent_trials=4)

2,Parallel tuning with Spark,试行,不支持gpu。

pip install flaml[spark,blendsearch]>=1.1.0
automl.fit(X_train, y_train, n_concurrent_trials=4, use_spark=True)

集成模型

采用sklearn的 stack方法

automl.fit(
    X_train, y_train, task="classification",
    "ensemble": {
        "final_estimator": LogisticRegression(),##最终估算器
        "passthrough": False,###是否将原始特征传递给stacker
    },
)

Resampling strategy​ 重采样策略

eval_method to be “holdout” or “cv” for holdout or cross-validation. 对于holdout,可选split_ratio :验证数据的分数,默认为 0.1;X_val , y_val单独的验证数据集 对于cv,可设置n_splits 折叠数。默认情况下,它是 5

数据拆分方式

对于不同类型模型有不同拆分方式

  • stratified split for classification; 分层拆分进行分类; uniform split for regression; 回归的均匀拆分; time-based split for time series forecasting; 用于时间序列预测的基于时间的拆分; group-based split for learning to rank. 基于小组的拆分,用于学习排名。

split_type=“uniform” 改为统一分割。当 split_type in (“uniform”, “stratified”) 时,数据被洗牌 对于分类和回归模型还可以设置 split_type=“time” split_type=“group”

warm start(不想每次都从头开始训练)

automl1 = AutoML()
automl1.fit(X_train, y_train, time_budget=3600)
automl2 = AutoML()
automl2.fit(
    X_train,
    y_train,
    time_budget=7200,
    starting_points=automl1.best_config_per_estimator,
)

结果

以此句为例

automl.fit(X_train, y_train, task="regression")
print(automl.model)
# <flaml.automl.model.LGBMEstimator object at 0x7f9b502c4550>
#最佳模型
print(automl.best_estimator)
# lgbm
#最佳参数
print(automl.best_config)
# {'n_estimators': 148, 'num_leaves': 18, 'min_child_samples': 3, 'learning_rate': 0.17402065726724145, 'log_max_bin': 8, 'colsample_bytree': 0.6649148062238498, 'reg_alpha': 0.0009765625, 'reg_lambda': 0.0067613624509965}
#每个模型的最佳参数
print(automl.best_config_per_estimator)
# {'lgbm': {'n_estimators': 148, 'num_leaves': 18, 'min_child_samples': 3, 'learning_rate': 0.17402065726724145, 'log_max_bin': 8, 'colsample_bytree': 0.6649148062238498, 'reg_alpha': 0.0009765625, 'reg_lambda': 0.0067613624509965}, 'rf': None, 'catboost': None, 'xgboost': {'n_estimators': 4, 'max_leaves': 4, 'min_child_weight': 1.8630223791106992, 'learning_rate': 1.0, 'subsample': 0.8513627344387318, 'colsample_bylevel': 1.0, 'colsample_bytree': 0.946138073111236, 'reg_alpha': 0.0018311776973217073, 'reg_lambda': 0.27901659190538414}, 'extra_tree': {'n_estimators': 4, 'max_features': 1.0, 'max_leaves': 4}}
#训练时间
print(automl.best_config_train_time)
# 0.24841618537902832
#训练次数
print(automl.best_iteration)
# 10
#最佳loss
print(automl.best_loss)
# 0.15448622217577546
#找到最佳模型所用时间
print(automl.time_to_find_best_model)
# 0.4167296886444092
#
print(automl.config_history)
# {0: ('lgbm', {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20, 'learning_rate': 0.09999999999999995, 'log_max_bin': 8, 'colsample_bytree': 1.0, 'reg_alpha': 0.0009765625, 'reg_lambda': 1.0}, 1.2300517559051514)}
# Meaning: at iteration 0, the config tried is {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20, 'learning_rate': 0.09999999999999995, 'log_max_bin': 8, 'colsample_bytree': 1.0, 'reg_alpha': 0.0009765625, 'reg_lambda': 1.0} for lgbm, and the wallclock time is 1.23s when this trial is finished.

flaml.automl.model.LGBMEstimator is a wrapper class(包装类)

print(automl.model.estimator)
"""
LGBMRegressor(colsample_bytree=0.7610534336273627,
              learning_rate=0.41929025492645006, max_bin=255,
              min_child_samples=4, n_estimators=45, num_leaves=4,
              reg_alpha=0.0009765625, reg_lambda=0.009280655005879943,
              verbose=-1)
"""
#绘制特征重要性
import matplotlib.pyplot as plt

plt.barh(
    automl.model.estimator.feature_name_, automl.model.estimator.feature_importances_
)

准确度与时间绘图

增加时间预算可能会进一步提高准确性

from flaml.automl.data import get_output_from_log

time_history, best_valid_loss_history, valid_loss_history, config_history, metric_history =
    get_output_from_log(filename=settings["log_file_name"], time_budget=120)

import matplotlib.pyplot as plt
import numpy as np

plt.title("Learning Curve")
plt.xlabel("Wall Clock Time (s)")
plt.ylabel("Validation Accuracy")
plt.step(time_history, 1 - np.array(best_valid_loss_history), where="post")
plt.show()
最后更新于