Run MLZ¶

Input file¶

A brief explanation on how to run MLZ. The main code is included as a executable and can be called directly or its directory, its content can be viewed here

A self-explanatory view of the Input file template is helpful to look at before running the code. This file can be used as a template for other files, the parameters can be checked in advance by setting CheckOnly to yes.

Note that the names of the variables are case insensitive but all of them need to be present.

Prepare data¶

Both the training file and test file must have the attributes (magnitudes, colors, etc.) and (optimally) their errors. If errors are not present assume a very small value is used. For now ascii files and numpy files (.npy) are valid. Spectroscopic redshifts must be included on the training file, if present on the test file they can be used for testing the performance of MLZ, although it is not required.

Add the full path relative to the working directory of these file to the input file and define a output folder for the results.

There are 3 very important variables on the input file to specify the columns and the attributes to use by separating them using comas. Make sure to indicate the spectroscopic redshift by its name in the KeyAtt variable. Also always indicate the name of the error columns by adding the letter e in front of the name of the attribute (see the Input file template for an example)

In the Att variable, indicate the attributes to use to make a compute photo-z, you can add or remove attributes but make sure they are present on the columns names. Order is not important, but order in columns name are important

Some hidden parameters¶

In order to make the Input file template not too busy, there are some hidden parameters in the utils/utils_mlz.py file that are not frequently used and can be manually modified, among the most important ones:

oobfraction: fraction of data used for cross-validation, default is 1/3

writepdf: Write the PDF? default if yes, if not needed can be set to no

Run the code¶

Check the Running a test for a example use of the code on SDSS data including with the distribution.

To run the code, if using mpi4py from the main folder type:
$ mpirun -n <cores> ./runMLZ <input file>
Where <cores> is the number of processors desired to use and <input file> is the name of the Input file template. If not using mpi4py, type:
$ ./runMLZ <input file>
Or if distribution is build or installed using pip, just type:
$ runMLZ <input file>
This will create two folder on the output directory, one named trees (or maps) where several files for trees or maps are stored for further analysis and the other folder named results where the main results are stored as well as the parameters used. The .mlz file contains 7 columns (zspec, zmode, zmean, zconf_mode, zcond_mean, error_mode, error_mean) which summarizes the results if no PDF is further needed. The PDF for all the galaxies are also stored in the same folder.

Optional arguments¶

Since version 1.2 there are some extra options from the command line at the moment of execution, these are:
--help          --> Show list of optional commands and help

--version       --> Shows current MLZ version

--no_train      --> It skips the training stage assuming it was already done

--no_test       --> It skips the testing stage on test data (it only does training)

--no_pdfs       --> It doesn't write or produce photo-z PDFs only single point estimates

--check_only    --> Do a quick run to check everything is OK before a long run

--print_keys    --> Print all the keys from inputs file

--modify MODIFY --> Modify a parameter from command line, e.g., --modify maxz=1.0 ntrees=2 testfile=data.fits

--replace       --> Replace output filenames (trees, random catalogs,maps)
For example, if only a training is needed but with a modification of the number of trees it can be done with:
$ mpirun -n <cores> ./runMLZ <input file> --no_test --modify ntrees=10
Now, suppose we have trained and we want to apply the solution to a test file defined in the inputs file and we only want the single point estimation (not the PDFs), we do:
$ mpirun -n <cores> ./runMLZ <input file> --no_train --no_pdfs
Let’s assume we have a second test file and we want to apply the same solution but with PDFs this time, we can do:
$ mpirun -n <cores> ./runMLZ <input file> --no_train --modify testfile=<newfile> finalfilename=<new name>
The options –print_keys and –check_only are very useful for preparing and checking before a big job sumbission

Machine learning approach¶

MLZ can be used through TPZ or SOMz and whichever is used is set on the Input file template under the PredictionMode variable. Whether is a classification or a Regression problem this is set on the PredictionClass variable. There are some variables common for both approaches and other exclusively used by one of them. For classification labels you can must use integers can can use the variable MinZ and MaxZ to enclose the range of values. OOB and cross-validation data are computed when the variable OobError is set to yes and a ranking of variable importance can be computed if the variable VarImportance is set to yes.

Preview of results¶

Some routines are provided to preview some results. See the Running a test and plotting for more information and some examples of figures that can be created