Our Blog

Latest News

Progress: Radial Velocity: Using Machine Learning to Analyze Exoplanetary Data

The code provided is a script for analyzing exoplanetary data. Exoplanets are planets that orbit stars outside of our solar system, and scientists are interested in understanding their characteristics such as atmospheric temperature, pressure, and composition. This script takes in a CSV file of exoplanetary data, cleans and processes the data, and then uses a machine learning model to cluster the exoplanets into groups based on their atmospheric features.

One of the first steps in the script is to handle missing values in the data. This is important because missing values can cause problems with the analysis or model training later on. In this case, the script simply drops any rows that contain missing values.

Next, the script converts the values in the ‘exoplanetary atmospheric composition’ column to floats. This is necessary because these values will be used as input to the machine learning model, and the model expects numerical data.

After this, the script scales the ‘exoplanetary atmospheric temperature’, ‘exoplanetary atmospheric pressure’, and ‘exoplanetary atmospheric composition’ columns using the StandardScaler method from scikit-learn. Scaling the data can be important because it can help the model converge faster and perform better.

Once the data has been cleaned and processed, the script uses the KMeans algorithm from scikit-learn to cluster the exoplanets into 3 groups based on their atmospheric temperature, pressure, and composition. The script then adds a new column to the original dataframe called ‘cluster’ which stores the cluster labels for each exoplanet.

After this, the script one-hot encodes the ‘exoplanetary water content’ column and defines a neural network model with two dense layers. The model is compiled using the ‘categorical_crossentropy’ loss function and the ‘adam’ optimizer, and is then fit to the training data for 10 epochs with a batch size of 32.

Finally, the script defines a mapping from exoplanetary atmospheric composition names to integers and uses the trained model to predict the water content for a single new exoplanet and for multiple new exoplanets. The script also visualizes the clusters by plotting the exoplanets in a scatterplot, with different colors for each cluster.

Some potential findings that could be derived from this script include:

  1. Identifying groups of exoplanets with similar atmospheric characteristics.
  2. Predicting the water content of new exoplanets based on their atmospheric features.
  3. Seeing if there is a relationship between exoplanetary atmospheric characteristics and water content.
  4. Visualizing the clusters of exoplanets to gain a better understanding of their characteristics.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from keras.utils import to_categorical
from keras.layers import Input, Dense, Dropout
from keras.models import Model, Sequential
from sklearn.cluster import KMeans

# Load the dataset
df = pd.read_csv('data.csv')

# Check for and handle missing values
df = df.dropna()
df = df.replace([np.inf, -np.inf], np.nan).dropna()

# Convert the values in the 'exoplanetary atmospheric composition' column to floats
df['exoplanetary atmospheric composition'] = pd.to_numeric(df['exoplanetary atmospheric composition'], errors='coerce')

# Scale the data
scaler = StandardScaler()
features = ['exoplanetary atmospheric temperature', 'exoplanetary atmospheric pressure', 'exoplanetary atmospheric composition']
X = scaler.fit_transform(df[features])

# Use an unsupervised learning model to cluster exoplanets based on their atmospheric temperature, pressure, and composition
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
cluster_labels = kmeans.predict(X)

# Add the cluster labels as a new column to the original dataframe
df['cluster'] = cluster_labels

# Print the number of exoplanets in each cluster
print(df['cluster'].value_counts())

# One-hot encode the labels
num_classes = 101
y = to_categorical(df['exoplanetary water content'], num_classes=num_classes)

# Define the model
model = Sequential()
model.add(Dense(3, input_shape=(3,)))
model.add(Dense(num_classes, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model to the training data
num_epochs = 10
batch_size = 32
model.fit(X, y, epochs=num_epochs, batch_size=batch_size)

# Define the composition_mapping variable
composition_mapping = {'H2O': 0, 'CO2': 1, 'O2': 2}

# Use the model to predict water content for a new exoplanet
new_exoplanet = np.array([[280, 1.2, composition_mapping['H2O']]])
new_exoplanet = scaler.transform(new_exoplanet)  # Scale the new exoplanet
prediction = model.predict(new_exoplanet)[0]
predicted_label = np.argmax(prediction)

# Use the model to predict water content for multiple new exoplanets
new_exoplanets = np.array([
    [280, 1.2, composition_mapping['H2O']], 
    [300, 1.5, composition_mapping['CO2']], 
    [320, 1.8, composition_mapping['O2']]
])
new_exoplanets = scaler.transform(new_exoplanets)  # Scale the new exoplanets
predictions = model.predict(new_exoplanets)
predicted_labels = np.argmax(predictions, axis=1)

# Add the cluster labels as a new column to the original dataframe
df['cluster'] = cluster_labels

# Print the number of exoplanets in each cluster
cluster_sizes = df['cluster'].value_counts()
print(cluster_sizes)

# Visualize the clusters by plotting the exoplanets in a scatterplot
import matplotlib.pyplot as plt
colors = {0: 'red', 1: 'blue', 2: 'green'}
for cluster in range(3):
    mask = df['cluster'] == cluster
    x = df[mask]['exoplanetary atmospheric temperature']
    y = df[mask]['exoplanetary atmospheric pressure']
    plt.scatter(x, y, color=colors[cluster])
plt.xlabel('Atmospheric Temperature (K)')
plt.ylabel('Atmospheric Pressure (bar)')
plt.show()

Exoplanet Discovery and Characterization Using Radial Velocity Data
Radial Velocity: Predicting Exoplanetary Albedo with Neural Networks