Photovoltaic Power Forecasting Without Local Data: A Spatially-Aware Approach Using Neighboring Plants

Messias, Leonardo Alves; Domingos, José Luis; Mendes, Thiago Augusto; da Costa, Bruno Barzellay Ferreira; Maciel, Ana Carolina Fernandes; Mendes, Saymon Fonseca Santos; de Aquino Gomes, Raphael

doi:10.21926/jept.2602007

Open Access Original Research

Photovoltaic Power Forecasting Without Local Data: A Spatially-Aware Approach Using Neighboring Plants

Leonardo Alves Messias ¹, José Luis Domingos ¹, Thiago Augusto Mendes ¹ , Bruno Barzellay Ferreira da Costa ², Ana Carolina Fernandes Maciel ³, Saymon Fonseca Santos Mendes ⁴, Raphael de Aquino Gomes ^1,*

Instituto Federal de Goiás, Goiânia, Goiás, Brazil
Universidade Federal do Rio de Janeiro, Instituto Politécnico, Macaé, Rio de Janeiro, Brazil
Universidade Federal de Uberlândia, Santa Mônica, Uberlândia, Minas Gerais, Brazil
Secretaria-Geral de Governo de Goiás, Goiânia, Goiás, Brazil

* Correspondence: Raphael de Aquino Gomes

Academic Editor: George Papadakis

Special Issue: Recent Advances in Energy-Efficient Building Technologies: Envisioning a Net-Zero Future

Received: October 17, 2025 | Accepted: March 25, 2026 | Published: April 03, 2026

Journal of Energy and Power Technology 2026, Volume 8, Issue 2, doi:10.21926/jept.2602007

Recommended citation: Messias LA, Domingos JL, Mendes TA, da Costa BBF, Maciel ACF, Mendes SFS, de Aquino Gomes R. Photovoltaic Power Forecasting Without Local Data: A Spatially-Aware Approach Using Neighboring Plants. Journal of Energy and Power Technology 2026; 8(2): 007; doi:10.21926/jept.2602007.

© 2026 by the authors. This is an open access article distributed under the conditions of the Creative Commons by Attribution License, which permits unrestricted use, distribution, and reproduction in any medium or format, provided the original work is correctly cited.

Abstract

The increasing demand for renewable energy sources has intensified the adoption of photovoltaic systems. This study proposes predictive models for solar power generation that operate without dependence on on-site meteorological stations. The proposed approach integrates generation data from geographically distributed plants and accounts for the distance to meteorological stations when constructing climatic variables. Two machine learning techniques, Random Forest (RF) and Long Short-Term Memory (LSTM) networks, were evaluated. The RF model achieved R² > 0.90 with lower MAE and RMSE values for 24-hour prediction windows, whereas the LSTM model demonstrated superior performance for extended horizons (48 hours). Moreover, the proposed models effectively identified anomalies and maintained robust predictive accuracy even when utilizing data from meteorological stations located up to 151.9 km away. Overall, the proposed approach produced results comparable to those reported for traditional models and recent state-of-the-art methods that rely on local meteorological data, as indicated by R², MAE, and RMSE values reported in the literature.

Keywords

Anomaly detection; machine learning; photovoltaic forecasting

1. Introduction

Electricity generation from fossil fuels remains a major contributor to greenhouse gas emissions. In this context, photovoltaic solar energy plays a strategic role, particularly in Brazil, where hydroelectric plants still represent approximately 48.4% of the national installed capacity [1].

In 2022, photovoltaic generation increased by 72.6% compared to the previous year, according to data from the Electric Energy Trading Chamber (CCEE) [1]. This expansion was driven by Normative Resolution No. 482/2012 [2], which established the regulatory framework for microgeneration (≤75 kW) and minigeneration (75 kW-3 MW), as well as the electricity compensation system.

Policies, tax incentives, and financing programs have further accelerated the expansion of installed capacity; however, challenges persist in system operation, maintenance, and forecasting. Factors such as dust accumulation, atmospheric pollutants, and surface residues degrade module performance, whereas variations in solar radiation and temperature directly influence energy generation.

In this context, generation forecasting based on environmental data has become a crucial practice for ensuring the efficiency and reliability of photovoltaic systems. Studies such as [3] analyze photovoltaic system efficiency from the perspective of generation forecasting, highlighting the estimation of energy potential based on climatic and operational conditions. Similarly, [4] presents research conducted at a renewable energy park in Kuwait, which evaluates meteorological and photovoltaic generation data using machine learning techniques to predict short-term energy production.

The rapid expansion of global PV capacity and the challenges posed by climate variability have driven a vast literature on power forecasting, including physical, statistical, machine learning (ML), and deep learning (DL) models [5,6,7]. Recent reviews classify these models by forecast horizon, architecture, and application, highlighting the growing role of data-driven and hybrid approaches [5,6,8,9].

Machine learning (ML)-based models, such as Random Forest (RF) and Long Short-Term Memory (LSTM) networks, have been widely employed for photovoltaic generation forecasting, demonstrating superior predictive performance relative to traditional statistical methods [4,10]. Recent comparative studies show that ML-based models outperform traditional statistical methods in different forecast horizons [11,12,13]. Ensembles and hybrid models have shown even greater robustness and stability, especially under variable weather conditions [5,14,15].

Integrated approaches to forecasting and fault detection have attracted attention, combining power-forecasting models with DL- or autoencoder-based anomaly-detection techniques [16,17,18]. These works show the potential of using deviations between predicted and observed generation to identify faults such as partial shading, PID, and line faults [16,17]. However, in general, they rely on local meteorological data and do not explicitly account for multiple power plants or the distance between data sources. However, many existing approaches still rely on meteorological data collected from stations located at or near the photovoltaic plant. This dependence limits the applicability of such models in regions with low meteorological station density or insufficient monitoring infrastructure. Furthermore, increasing the distance between the meteorological station and the generation site is associated with higher average forecast errors.

Despite advances, most studies assume the availability of local meteorological measurements or dedicated plant sensors [5,11,19,20]. When this premise is not valid — due to low station density or incomplete monitoring — accuracy degrades rapidly with increasing distance between station and plant [21,22]. Recent studies are beginning to explore regional forecasting using aggregated or central-point meteorological data, without specific plant data, highlighting the relevance of models that explicitly address the spatial dimension and data scarcity [22].

Therefore, this study proposes a novel approach to energy forecasting that integrates photovoltaic generation data with meteorological variables from neighboring stations, explicitly accounting for the distance between the data-collection points and the forecast site.

The proposed approach seeks to expand the applicability of forecasting models in contexts with limited meteorological coverage and to enable the detection of anomalies and operational failures in photovoltaic systems, including hot spots, microcracks, delamination, and glass breakage in modules. The methodology considers generation data from multiple plants and meteorological variables from nearby stations, accounting for the spatial distance between the plants and the stations. Furthermore, the study includes a comparative evaluation of different algorithms, time horizons, and data volumes, contributing to the development of more robust and practical forecasting models. In addition to internal validation, a comparative analysis with recent literature was conducted, demonstrating that the proposed models outperformed traditional approaches in specific contexts and achieved competitive performance relative to state-of-the-art methods that use local meteorological data.

2. Methodology

The methodology involved collecting and preprocessing photovoltaic generation and meteorological data to construct input features for machine learning models. Several time window lengths—1, 6, 12, 24, and 48 hours—were evaluated to assess how the historical data window length influences model performance and to capture atmospheric patterns at different temporal scales. Similar approaches [23] have demonstrated effectiveness for environmental time series, particularly when using a 24-hour horizon, which is adopted in this study for short-term forecasting applications. This choice is consistent with the findings of Qin et al. [24], which report strong predictive performance and lower variability within this range.

The dataset comprises hourly meteorological and operational variables. Climatic data—including temperature, humidity, solar radiation, wind speed, and precipitation—were obtained from the INMET station located in Goiânia, Goiás, Brazil (16°38’24” S, 49°13’12” W), which represents the only regional source providing public, continuous, hourly records. Although this station is not co-located with the photovoltaic plants, its use reflects the typical Brazilian context, characterized by limited monitoring infrastructure.

The exclusive use of a single meteorological station limits the generalizability of the findings, as climatic variability in other Brazilian regions or countries with different topography, humidity patterns, or cloud dynamics may not be fully represented by the Goiânia station. Nonetheless, this configuration mirrors the reality of many distribution systems that rely on sparse monitoring networks, and the conclusions should therefore be interpreted as evidence of feasibility under similar infrastructure constraints rather than as universal performance guarantees.

Another limitation of this study is that the data collection period does not span the full annual cycle, potentially biasing the representation of seasonal variability. The dataset covers a continuous period of 3 months, from 8/2024 to 11/2024, which corresponds predominantly to the rainy season in the study region. As a consequence, the models are primarily calibrated under these seasonal conditions, and their performance metrics may not fully capture behavior across the annual cycle, particularly in underrepresented seasons.

The operational variable analyzed was the generated photovoltaic power (P_pv), measured in kilowatt-peak (kWp). Data were collected from five photovoltaic plants equipped with Growatt frequency inverters and photovoltaic modules from various manufacturers. This private dataset comprises power plants located in different municipalities within the state of Goiás, as presented in Table 1. The plants were selected according to two criteria: (i) data integrity and (ii) distance from the meteorological station, restricted to a radius of approximately 150 km.

Table 1 Distance and capacity of photovoltaic plants used in the research.

The selection of these plants was based on an analysis of distances between INMET meteorological stations in the state of Goiás, which form a sparse yet regionally representative observational network. The maximum distance between two stations in this network is approximately 142.78 km, indicating the typical spatial separation in the region. Therefore, adopting a radius of approximately 150 km ensures a representative sampling of geographic variability without substantially exceeding the spatial limits of the station distribution.

To evaluate the impact of different information sources on forecasting performance, three scenarios were defined. In all scenarios, one plant is designated as the target (output), and the remaining plants serve as inputs. The target plant is rotated, with each of the five plants predicted in turn, using the same scenario definition. The objective was to analyze how the availability of information provided to the model influences the accuracy of photovoltaic power generation forecasting.

In Scenario 1, the model receives as inputs: (i) the historical photovoltaic power of the neighboring plants (all plants except the current target), (ii) the meteorological variables measured at the INMET Goiânia station, and (iii) the distance between each plant and the station. The output is the target plant’s power generation.

In Scenario 2, only operational information from the neighboring plants is used. The inputs consist exclusively of historical photovoltaic power for all plants, excluding the current target, and the output remains the target plant’s power generation. Meteorological variables are not provided to the model in this scenario.

In Scenario 3, the inputs are restricted to the meteorological variables from the INMET Goiânia station and the distance between the station and the target plant. No power-generation data from other plants is used as input, and the model predicts only the target plant’s power generation.

This strategy enables evaluation of the impact of different input datasets—both individually and in combination—on model performance across various simulation scenarios for the output function.

For quantitative evaluation of the results, three widely used metrics were used: mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (R²). The first two quantify absolute and quadratic deviations between predicted and observed values, allowing comparison of how differences in solar generation data and distances to the meteorological station affect mean errors. The R² value measures the proportion of variance explained by the model and serves as a robust indicator of the model’s goodness of fit.

3. Construction of Models

To clarify the tuning process, it is important to note that hyperparameters are configuration variables specified before training, which directly influence a model’s generalization performance, convergence rate, and overall performance. Unlike parameters that are learned by the algorithm, hyperparameters must be selected manually or determined through automated search methods.

To tune the hyperparameters of the Random Forest (RF) and Long Short-Term Memory (LSTM) models, we used GridSearchCV with cross-validation. This approach conducts an exhaustive search over predefined parameter combinations, using k-fold cross-validation to assess the predictive performance of each configuration. In this technique, the dataset is partitioned into k approximately equal-sized subsets; at each iteration, one subset is reserved for validation, and the remaining subsets are used for training. In this study, k = 5 was selected as it provides a balanced trade-off between the number of folds and statistical reliability, while reducing the risk of overfitting compared to a simple hold-out split.

For the RF model, the tuned hyperparameters included the number of trees in the forest (n_estimators), the maximum tree depth (max_depth), the minimum number of samples required to split a node (min_samples_split), and the minimum number of samples required per leaf (min_samples_leaf). Table 2 presents the ranges of values considered for each parameter during the search process.

Table 2 Numerical hyperparameter ranges used in the GridSearchCV procedure.

For the LSTM model, the same database used for the RF model was employed to ensure comparability between the two approaches. The tested hyperparameters were defined based on recommendations from the specialized literature [25], including the dropout rate (model_dropout_rate), number of neurons per layer (model_neurons), activation function (model_activation), optimizer (model_optimizer), number of training epochs (epochs), and batch size (batch_size). The values considered for the numerical parameters are presented in Table 2. The categorical hyperparameters for the LSTM model included the activation function (model_activation), which was tested with four options: relu, tanh, sigmoid, and linear. Additionally, the optimizer (model_optimizer) was set to Adam.

At the end of the process, the hyperparameter combinations that yielded the best predictive performance were selected. For the Random Forest (RF) model, the optimal configuration included n_estimators set to 100, max_depth equal to 20, min_samples_split set to 2, and min_samples_leaf equal to 1. For the LSTM model, the best-performing hyperparameters were a batch size of 16, epochs set to 64, tanh activation, a dropout rate of 0.0, 15 neurons, and the Adam optimizer.

4. Results and Discussion

This section presents and discusses the results of applying machine learning models to predict photovoltaic power generation using different combinations of input variables. Although the case study was conducted in the state of Goiás, Brazil, the proposed approach aims to provide an adaptable solution for regions where meteorological stations are not installed at the exact locations of the power plants, thereby exploiting the spatial transferability of climatic and operational data.

Furthermore, because the data collection period does not span a full annual cycle, the reported performance should be interpreted as representative of the dominant seasonal conditions in the dataset rather than as a comprehensive characterization of all possible seasonal regimes. In particular, forecast accuracy and residual patterns may vary during months with markedly different cloud cover or irradiance profiles that are under‑represented in the current sample. Moreover, because all climatic inputs are derived from the INMET Goiânia station, the quantitative performance metrics reported here are representative of this specific regional context and station–plant configuration. As a result, the transfer of trained models to areas with markedly different meteorological regimes should be undertaken with caution and, ideally, supported by local validation.

Figure 1 illustrates the R² values obtained for each model and scenario.

Click to view original image

Figure 1 Values of R² obtained across different scenarios and time scales of the input data: (A) Scenario 1 using RF; (B) Scenario 1 using LSTM; (C) Scenario 2 using RF; (D) Scenario 2 using LSTM; (E) Scenario 3 using RF; (F) Scenario 3 using LSTM.

Overall, Scenario 1 performed best for both models, confirming the effectiveness of this combination in producing reliable estimates. The RF model demonstrated superior predictive performance for 24- and 48-hour windows. In contrast, the LSTM model performed better with longer historical periods, highlighting the LSTM’s sensitivity to the amount of available temporal information.

In Scenario 2, where the models were deprived of meteorological input data, the LSTM model outperformed the RF model in most plants and time windows, consistently achieving higher R² values, particularly for the 48- and 72-hour historical periods. This result underscores the neural network’s ability to identify nonlinear relationships and implicit correlations in historical generation data, thereby partially compensating for the absence of local meteorological variables. When meteorological variables from nearby stations were included, the model achieved higher R² values for all plants than in the scenario without these inputs, indicating that the network effectively extracted relevant information to improve forecast accuracy. These findings align with the study’s objective of developing a method that performs well under data-limited conditions.

Scenario 3, which included only meteorological variables and distance, performed worse than the others. Based on the R² results, the RF model showed less variation across all time windows, whereas the LSTM model required a larger dataset to achieve satisfactory performance. These findings indicate that modeling based solely on environmental variables can be useful but requires complementary generation history data to maximize predictive accuracy.

The analysis of the three scenarios demonstrates that the appropriate selection of models, input variables, and historical periods directly influences forecast performance. When the objective is to detect anomalies based on deviations from predicted generation, ensuring high predictive accuracy becomes crucial. In this context, the results confirm that Scenario 1 has the greatest practical potential, particularly for plants near meteorological stations.

Table 3 summarizes, for each target plant, the highest R² values obtained considering scenario, model, and historical window. In this table, only the scenario associated with the best result for each plant–model combination is reported to avoid ambiguity. For Plant 4 with the RF model, Scenarios 1 and 3 yielded the same R² value; therefore, Scenario 1 was retained in the table for brevity. The same happened for Plant 5 with the RF model, with Scenarios 1 and 2 yielding the same best R². For example, Plant 5 achieved the best overall performance (R² = 0.952) using an LSTM model with a 48-hour window, highlighting the advantage of longer input windows for recurrent neural network–based models. Across all plants, the LSTM yielded R² values above 0.917 for 48-hour windows, consistently surpassing the RF model, whose best result was R² = 0.944, also obtained for Plant 5.

Table 3 Best R² values by plant, scenario, and model.

These results indicate that the LSTM model is particularly effective in contexts with rich time series data and the inclusion of meteorological variables, making it a promising tool for automated monitoring and fault detection. However, its performance depends strongly on the quality and quantity of the input data, underscoring the importance of careful preprocessing and appropriate time-window selection.

The comparison between Scenarios 1 and 2 indicates that, even in the absence of local meteorological data, satisfactory forecasts can be achieved using information from distant plants or stations, provided that the data are properly contextualized, for instance, by incorporating the distance variable. This finding is essential for enabling the model’s application in remote locations or in systems without dedicated meteorological sensors.

Thus, the results corroborate the paper’s proposal by demonstrating that, even without local meteorological data, it is possible to forecast photovoltaic generation with good predictive accuracy, thereby enabling these forecasts to serve as a reference for identifying results and anomalies. The proposed methodology, based on data sharing among plants and accounting for distances to meteorological stations, is a concise, scalable, and low-cost alternative for diagnostic and predictive maintenance systems in solar plants.

4.1 Evaluation among Models

Analysis of the forecast residuals revealed a systematic underestimation by the LSTM model, as illustrated in Figure 2. This pattern suggests that the model was primarily trained on data from periods of lower solar radiation generation, leading to forecasts that underestimate actual values during periods of higher solar radiation. This limitation was particularly evident at Plant 2, where more than 59% of the residuals were negative, indicating a persistent tendency of the model to underestimate peak solar generation. Such behavior compromises accurate identification of relevant deviations and may hinder the detection of subtle operational failures, a central objective of this study.

Click to view original image

Figure 2 Comparison of actual and predicted values for the best R²: (A) Plant 1 - RF; (B) Plant 1 - LSTM; (C) Plant 2 - RF; (D) Plant 2 - LSTM; (E) Plant 3 - RF; (F) Plant 3 - LSTM; (G) Plant 4 - RF; (H) Plant 4 - LSTM; (I) Plant 5 - RF; (J) Plant 5 - LSTM.

Furthermore, an increase in error dispersion was observed as the forecast horizon extended, as shown in Figure 2. This behavior is associated with inherent uncertainty in climatic variables, which directly affects models’ ability to maintain accuracy over longer time scales. The LSTM model, in particular, exhibited greater instability during transitions between low and high generation, with residuals exhibiting more pronounced variation. This pattern arises from the LSTM’s sequential nature, which relies on historical data and tends to propagate errors over time, notably when the training dataset lacks sufficient representation of these critical periods.

The systematic underestimation observed in LSTM predictions during high-irradiance periods can be attributed to two factors. First, the training dataset contains a higher proportion of records corresponding to moderate or low generation levels, which biases the network toward these operating conditions. Second, the LSTM’s activation functions and temporal smoothing behavior tend to dampen abrupt variations, thereby reducing the model’s responsiveness to sudden peaks in solar irradiance or power output. As a result, extreme values are often predicted as closer to the mean, leading to an underestimation of generation peaks. In practical terms, this systematic bias affects anomaly detection by increasing the likelihood of false positives, situations where normal high-generation events are incorrectly flagged as anomalies because they exceed the model’s forecasted values. However, this tendency can also enhance the detection of true anomalies characterized by lower-than-expected generation, such as shading, inverter faults, or module degradation, since these deviations remain statistically significant relative to the predicted baseline. To mitigate this imbalance, future work could explore data augmentation strategies or hybrid modeling schemes that explicitly account for peak-generation dynamics during training.

As shown in Figure 3, the RF model produced smaller, more stable residuals across all plants, demonstrating greater reliability for continuous monitoring applications. This stability is significant for smaller plants, where deviations exceeding 30% of nominal power can significantly affect generation planning and grid operations. The robustness of the RF model arises from its ability to capture nonlinear patterns in solar generation while exhibiting reduced sensitivity to noise, making it well-suited for anomaly detection in environments characterized by high natural variability.

Click to view original image

Figure 3 Comparison of residuals of hourly P_pv for the best R²: (A) Plant 1 - RF; (B) Plant 1 - LSTM; (C) Plant 2 - RF; (D) Plant 2 - LSTM; (E) Plant 3 - RF; (F) Plant 3 - LSTM; (G) Plant 4 - RF; (H) Plant 4 - LSTM; (I) Plant 5 - RF; (J) Plant 5 - LSTM.

Although the LSTM achieved higher R² values, indicating better overall variance explanation, this metric does not directly reflect the point-to-point predictive reliability that is crucial for fault detection. This behavior can be observed, for instance, at Plant 1 (5 kWp), where the RF model yielded residuals ranging from -3919 W to 3117 W, whereas the LSTM exhibited an even wider range, from -4803 W to 3193 W. The greater dispersion observed in the LSTM forecasts compromises its suitability as a reference for detecting significant deviations.

At Plant 2 (10 kWp), the RF model maintained residuals between -5,413 W and 4,399 W (54% and 44% of peak power, respectively). In contrast, the LSTM model exhibited deviations exceeding 88% of its capacity, with residuals ranging from -8,838 W to 7,818 W. This high variability is attributable to the LSTM’s strong dependence on the quality of the time series and data preprocessing, which are critical for ensuring its stability in operational forecasting tasks.

Similar results were observed at the other plants. At Plant 3 (10 kWp), the RF model produced residuals ranging from -7,939 W to 7,070 W (up to 79% of peak power), with a more symmetric distribution around zero. The LSTM model, in contrast, exhibited more pronounced fluctuations (-9,137 W to 6,983 W), compromising forecast consistency. At Plant 4 (3.5 kWp), the RF model yielded maximum residuals of 1,671 W (47.7%), while the LSTM model reached 2,046 W (58.4%). Finally, at Plant 5 (3 kWp), the RF model’s residuals reached 1,825 W (60.8%), whereas the LSTM model reached 2,508 W.

The comparative analysis indicates that the LSTM consistently achieved lower RMSE values across all five plants. For instance, at Plant 1, the RMSE decreased from 567.57 kWp with the RF model to 491.85 kWp with the LSTM. At Plant 2, it declined from 1146.58 kWp to 914.70 kWp. Similar trends were observed at Plants 3, 4, and 5, where the LSTM achieved reductions from 1163.87 to 1046.53 kWp, from 275.37 to 242.34 kWp, and from 256.86 to 238.71 kWp, respectively. These results demonstrate the LSTM’s superior ability to capture temporal dependencies, particularly in Scenario 1 with 48-hour historical windows, supporting its use in applications focused on general trend analysis.

In contrast, when comparing MAE values, the RF model outperformed the LSTM in four of the five plants. Specifically, the RF achieved lower MAE at Plant 1 (234.24 vs. 281.56 kWp), Plant 2 (546.94 vs. 569.50 kWp), Plant 4 (110.38 vs. 144.29 kWp), and Plant 5 (106.73 vs. 140.42 kWp). Only at Plant 3 did the LSTM yield a slightly higher error (626.90 kWp) compared to the RF (554.60 kWp). These results indicate that the RF model tends to produce more stable and less dispersed predictions in terms of absolute deviations, particularly in Scenarios 1 and 2 with 24-hour windows, making it more suitable for operational contexts where minimizing mean absolute error is critical for detecting anomalous deviations.

Overall, the results confirm that the LSTM model effectively explains generation variability, especially when trained on larger, more diverse datasets. However, its sensitivity to the training data can compromise robustness in real-world applications. The RF model, in contrast, exhibits more stable residuals and lower susceptibility to outliers, making it more suitable for developing decision-support systems for failure identification. It also shows potential for use in plants lacking local meteorological sensors, provided that generation and meteorological data from nearby photovoltaic systems are available.

4.2 Comparison with Related Work

The results of this study were compared with recent research employing machine learning techniques to forecast photovoltaic power generation. The Random Forest (RF) model developed in this work achieved a coefficient of determination greater than 0.90 in 24-hour windows, which is higher than the R² = 0.8111 reported by Olcay et al. [26] for their RF model, suggesting competitive performance in the evaluated case study. However, these differences were not assessed with formal statistical significance tests. In addition, Pisal et al. [27] reported that, in their study on floating photovoltaic systems, the best traditional model, Gradient Boosting, achieved an R² value of 0.6893. Similarly, Rangelov et al. [28] found that RF and deep neural network (DNN) models exhibited superior performance. However, under conditions of high climatic variability, model instability is reduced, as demonstrated here by greater consistency—particularly when incorporating data from neighboring plants rather than relying solely on local meteorological stations.

Regarding LSTM networks, the studies by Olcay et al. [26] and Balal et al. [29] reported superior performance, with correlation and R² values approaching 0.98. In contrast, the present study observed the highest LSTM performance for forecast horizons longer than 48 hours. This outcome remains competitive in the application context, as the proposed model operates without local meteorological data — an often-unavailable resource in many regions. Although the absolute R² values were slightly lower than those reported by Balal et al. [29], who achieved R² = 0.977 for RF and R² = 0.975 for LSTM, the approach presented here enhances the practical applicability of photovoltaic power generation forecasting in scenarios with limited monitoring infrastructure.

Table 4 summarizes the results, indicating that the approach based on data from neighboring plants attains accuracy levels similar to or higher than those reported in studies employing traditional methodologies across different contexts. These comparisons are based on published point estimates of R², MAE, and RMSE and should be interpreted as indicative rather than as statistically validated superiority, as no hypothesis testing was performed. Moreover, it provides greater operational robustness by reducing dependence on local data, thereby advancing solar energy forecasting.

Table 4 Comparison of prediction results in different studies.

Although some studies listed in Table 4 report slightly higher R² values—particularly those employing dense local meteorological data or long multi‑year datasets—these results are obtained under conditions of extensive instrumentation and data availability that are uncommon in many real‑world photovoltaic installations. In contrast, the present approach is specifically designed for low‑data scenarios and requires only one regional meteorological station and operational data from nearby plants. This configuration markedly reduces hardware costs, data-acquisition complexity, and maintenance demands, while still achieving competitive accuracy (R² > 0.90) through spatially informed modeling. Therefore, even if the absolute R² values are slightly lower than those of data‑rich studies, the proposed method offers greater practical applicability in contexts with sparse monitoring networks, where traditional models relying on local sensors cannot be deployed effectively.

5. Final Remarks

This study investigated photovoltaic power generation forecasting using machine learning techniques to provide an effective solution for scenarios without a weather station at the plant site. To this end, Random Forest (RF) and Long Short-Term Memory (LSTM) models were evaluated and applied to different photovoltaic systems, input scenarios, and time windows. The main innovation of the proposed approach lies in using data from neighboring plants and the distance to the nearest weather station as explanatory variables, enabling accurate predictions even in the absence of local meteorological measurements.

The results showed that the RF model exhibited greater stability and lower mean absolute error (MAE) across several plants, with coefficients of determination (R²) ranging from 0.903 to 0.944, indicating its robustness under limited or noisy data. Although the LSTM model achieved slightly higher R² values in some cases (up to 0.954), it exhibited greater residual variability, indicating greater sensitivity to the volume and quality of historical data. These findings highlight the advantage of the RF model for applications in operational environments with restricted meteorological data and computational resources.

The residual analysis revealed that the LSTM model tends to underestimate generation during periods of high irradiance, particularly when the training data are concentrated in lower-generation intervals. Moreover, error dispersion increased with longer forecast horizons, reflecting the influence of climatic uncertainty. Intermediate 48-hour windows provided a good balance between performance and stability for the LSTM model, whereas 24-hour windows were more effective for the RF model.

The comparison between the models revealed that there is no single optimal solution applicable to all contexts. Nevertheless, the results confirm that, even under infrastructure constraints, reliable forecasts can be obtained by integrating historical generation data, regional meteorological variables, and distance information. This approach enables predictive models to serve as references for anomaly and fault detection, thereby eliminating the need for local sensors and offering an economically viable, scalable alternative for distributed power plants.

In addition to the internal analysis, a comparison with related literature was conducted. The proposed RF model demonstrated superior performance compared to recent works employing traditional methodologies [26,27] and exhibited greater consistency than models applied in contexts with high climatic variability [28]. Conversely, other studies [26,29] reported R² values slightly higher than those obtained in the present work. Nonetheless, the methodological advantage of the proposed approach lies in achieving competitive results without relying on local meteorological data, thereby expanding the practical applicability of forecasting in regions with limited monitoring infrastructure.

The limitations of this study include the use of meteorological data from a single station and a restricted data-collection period that does not encompass the full annual cycle, which may limit the generalizability of the results. In particular, relying solely on the INMET Goiânia station means that the reported error metrics reflect a specific combination of regional climate, station–plant distances, and orographic conditions. Consequently, while the proposed methodology is conceptually applicable to other regions with sparse monitoring networks, its predictive performance should be reassessed whenever different meteorological stations or climates are considered. Future research could address these limitations by integrating multiple meteorological data sources (including remote-sensing data, such as satellite observations), extending the historical dataset, and employing hybrid machine learning models that combine different techniques.

Acknowledgments

Fundação Sitawi supported the work of Leonardo Alves Messias through a permanence scholarship (Commitment Term n. CRF_ALPB_2023_108) and Fundação de Amparo a Pesquisa do Estado de Goiás (FAPEG) supported the work of Raphael de Aquino Gomes through a grant (Process n. 202410267000906).

Author Contributions

L.A.M. conceptualization, methodology, investigation, software, writing, validation; J.L.D. review, supervision; T.A.M. corrections, validation; B.B.F.C. technical feedback, suggestions; A.C.F.M. review, corrections; S.F.S.M. review, corrections, data curation; R.A.G. conceptualization, methodology, investigation, writing, validation, project administration, funding acquisition.

Funding

This research was supported by Fundação Sitawi through a permanence scholarship (Commitment Term n. CRF_ALPB_2023_108) and by Fundação de Amparo a Pesquisa do Estado de Goiás (FAPEG).

Competing Interests

The authors have declared that no competing interests exist.

Data Availability Statement

The datasets and source code supporting the findings of this study are publicly available at https://github.com/raphaeldeaquino/photovoltaic-forecasting. The repository includes all scripts used for data preprocessing, model training, and analysis, as well as the anonymized, processed photovoltaic generation data.

AI-Assisted Technologies Statement

The authors used Grammarly, an AI-assisted language editing tool, exclusively to improve spelling, grammar, and overall readability of the manuscript. The tool was not used to generate or paraphrase text, nor to contribute to the study’s scientific content, analysis, or conclusions. The authors carefully reviewed all suggestions and take full responsibility for the integrity and originality of the manuscript’s content.

References

Operador Nacional do Sistema Elétrico. O sistema em números, 2024 [Internet]. Operador Nacional do Sistema Elétrico; 2024 [cited date 2024 September 2]. Available from: https://www.ons.org.br/paginas/sobre-o-sin/o-sistema-em-numeros.
ANEEL. ANEEL Normative Resolution No. 482/2012 (in Portuguese) [Internet]. ANEEL; 2012 [cited date 2024 December 18]. Available from: https://atosoficiais.com.br/aneel/resolucao-normativa-n-482-2012-estabelece-as-condicoes-gerais-para-o-acesso-de-microgeracao-e-minigeracao-distribuida-aos-sistemas-de-distribuicao-de-energia-eletrica-o-sistema-de-compensacao-de-energia-eletrica-e-da-outras-providencias-2023-02-07-versao-compilada?origin=instituicao#.
Kim GG, Choi JH, Park SY, Bhang BG, Nam WJ, Cha HL, et al. Prediction model for PV performance with correlation analysis of environmental variables. IEEE J Photovolt. 2019; 9: 832-841. [CrossRef] [Google scholar]
Haupt SE, McCandless TC, Lee JC, Kosović B, Alessandrini S, Dettling S, et al. Combining physical modeling with artificial intelligence for solar power forecasting. Proceedings of the 47th IEEE Photovoltaic Specialists Conference (PVSC); 2020 June 15-August 21; Calgary, AB, Canada. New York, NY: IEEE. [CrossRef] [Google scholar]
Scott C, Ahsan M, Albarbar A. Machine learning for forecasting a photovoltaic (PV) generation system. Energy. 2023; 278: 127807. [CrossRef] [Google scholar]
Mellit A, Massi Pavan A, Ogliari E, Leva S, Lughi V. Advanced methods for photovoltaic output power forecasting: A review. Appl Sci. 2020; 10: 487. [CrossRef] [Google scholar]
Iheanetu KJ. Solar photovoltaic power forecasting: A review. Sustainability. 2022; 14: 17005. [CrossRef] [Google scholar]
Di Leo P, Ciocia A, Malgaroli G, Spertino F. Advancements and challenges in photovoltaic power forecasting: A comprehensive review. Energies. 2025; 18: 2108. [CrossRef] [Google scholar]
Rajagukguk RA, Ramadhan RA, Lee HJ. A review on deep learning models for forecasting time series data of solar irradiance and photovoltaic power. Energies. 2020; 13: 6623. [CrossRef] [Google scholar]
Tieghi CP, de Lima Caneppele F, Dal Pai A, Godinho EZ, Almeida CF, Malagueta DC, et al. Applications and challenges of artificial intelligence in solar radiation forecasting: A systematic review (in Portuguese). Rev Bras Climatol. 2025; 36: 170-201. [CrossRef] [Google scholar]
Gaboitaolelwe J, Zungeru AM, Yahya A, Lebekwe CK, Vinod DN, Salau AO. Machine learning based solar photovoltaic power forecasting: A review and comparison. IEEE Access. 2023; 11: 40820-40845. [CrossRef] [Google scholar]
Babu AR, Kumar NB, Narasipuram RP, Periyannan S, Hosseinpour A, Flah A. Solar energy forecasting using machine learning techniques for enhanced grid stability. IEEE Access. 2025; 13: 93735-93754. [CrossRef] [Google scholar]
Ledmaoui Y, El Maghraoui A, El Aroussi M, Saadane R, Chebak A, Chehri A. Forecasting solar energy production: A comparative study of machine learning algorithms. Energy Rep. 2023; 10: 1004-1012. [CrossRef] [Google scholar]
Khan W, Walker S, Zeiler W. Improved solar photovoltaic energy generation forecast using deep learning-based ensemble stacking approach. Energy. 2022; 240: 122812. [CrossRef] [Google scholar]
Liu H, Cai C, Li P, Tang C, Zhao M, Zheng X, et al. Hybrid prediction method for solar photovoltaic power generation using normal cloud parrot optimization algorithm integrated with extreme learning machine. Sci Rep. 2025; 15: 6491. [CrossRef] [Google scholar]
Abdelsattar M, AbdelMoety A, Emad-Eldeen A. Advanced machine learning techniques for predicting power generation and fault detection in solar photovoltaic systems. Neural Comput Appl. 2025; 37: 8825-8844. [CrossRef] [Google scholar]
Qiu Z, Ye J, Lu J, Zhu N. A novel hybrid model integrating CEEMDAN decomposition, dispersion entropy and LSTM for photovoltaic power forecasting and anomaly detection. Sci Rep. 2025; 15: 39386. [CrossRef] [Google scholar]
Park T, Song K, Jeong J, Kim H. Convolutional autoencoder-based anomaly detection for photovoltaic power forecasting of virtual power plants. Energies. 2023; 16: 5293. [CrossRef] [Google scholar]
Hassan AA, Atia DM, El-Madany HT, ElGhannam F. Multi-label machine learning for power forecasting of a grid-connected photovoltaic solar plant over multiple time horizons. Sci Rep. 2025; 15: 32676. [CrossRef] [Google scholar]
Elsaraiti M, Merabet A. Solar power forecasting using deep learning techniques. IEEE Access. 2022; 10: 31692-31698. [CrossRef] [Google scholar]
de Campos BN, Maionchi DD, da Silva JG, Biudes MS, Oliveira NN, Palácios RD. Photovoltaic energy modeling using machine learning applied to meteorological variables. Sustainability. 2025; 17: 7506. [CrossRef] [Google scholar]
Tucci M, Piazzi A, Thomopulos D. Machine learning models for regional photovoltaic power generation forecasting with limited plant-specific data. Energies. 2024; 17: 2346. [CrossRef] [Google scholar]
Yang G, Lee H, Lee G. A hybrid deep learning model to forecast particulate matter concentration levels in Seoul, South Korea. Atmosphere. 2020; 11: 348. [CrossRef] [Google scholar]
Qin S, Liu Z, Qiu R, Luo Y, Wu J, Zhang B, et al. Short–term global solar radiation forecasting based on an improved method for sunshine duration prediction and public weather forecasts. Appl Energy. 2023; 343: 121205. [CrossRef] [Google scholar]
Bischl B, Binder M, Lang M, Pielok T, Richter J, Coors S, et al. Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. Wiley Interdiscip Rev Data Min Knowl Discov. 2023; 13: e1484. [CrossRef] [Google scholar]
Olcay K, Tunca SG, Özgür MA. Forecasting and performance analysis of energy production in solar power plants using long short-term memory (LSTM) and random forest models. IEEE Access. 2024; 12: 103299-103312. [CrossRef] [Google scholar]
Pisal MV, Shinde AV. Application of machine learning model for forecasting of floating PV cell. Int J Sci Res Eng Manage. 2025; 9. doi: 10.55041/IJSREM44475. [CrossRef] [Google scholar]
Rangelov D, Boerger M, Tcholtchev N, Lämmel P, Hauswirth M. Design and development of a short-term photovoltaic power output forecasting method based on random forest, deep neural network and LSTM using readily available weather features. IEEE Access. 2023; 11: 41578-41595. [CrossRef] [Google scholar]
Balal AT, Jafarabadi YP, Demir AT, Igene MT, Giesselmann MT, Bayne ST. Forecasting solar power generation utilizing machine learning models in Lubbock. Emerg Sci J. 2023; 7: 1052-1062. [CrossRef] [Google scholar]

Power plant	Municipality	Distance (km)	Inverter power (kW)	Module power (kW)
1	Carmo do Rio Verde	151.9	5	4.92
2	Santa Cruz de Goiás	108.5	10	12.22
3	Gameleira de Goiás	62.7	10	14.84
4	Trindade	30.1	3	3.68
5	Goiânia	4.0	3	3.88

Study	Models Evaluated	RF	LSTM	Other Models
Our proposal	RF, LSTM	R² > 0.90, lowest MAE/RMSE in 24 h	Best on 48 h+ horizons	-
Olcay et al. (2024) [26]	RF, LSTM	R² = 0.8111	R² = 0.9759	-
Pisal et al. (2025) [27]	RF, LSTM, SVR, GB	R² = 0.6893 (GB)	MAPE = 2.5%	GB: RMSE = 0.4510, MAE = 0.3945
Balal et al. (2023) [29]	RF, LSTM	R² = 0.977, MSE = 2.06%	R² = 0.975, MSE = 2.23%	-
Rangelov et al. (2023) [28]	RF, DNN, LSTM	RF/DNN performs well but is unstable	Unstable in a dynamic climate	Competitive DNN

Parameter	Min. value	Median value	Max. value
Random Forest hyperparameters
n_estimators	50	100	200
max_depth	10	20	∞
min_samples_split	2	5	10
min_samples_leaf	1	2	4
LSTM hyperparameters
model_neurons	8	16	32
model_dropout_rate	0.0	0.4	0.7
batch_size	16	32	-
epochs	64	128	244

Plant	Scenario	Model	Best R²	Historical window (h)
1	1	LSTM	0.928	72
1	1	RF	0.903	24
2	1	LSTM	0.940	72
2	1	RF	0.904	72
3	1	LSTM	0.931	48
3	1	RF	0.915	24
4	1	LSTM	0.931	72
4	1	RF	0.919	48
5	1	LSTM	0.952	48
5	1	RF	0.944	24, 48