1. What is PySpark? – PySpark is the Python API for Apache Spark used for distributed data processing.
2. What is Microsoft Fabric? – A unified analytics platform integrating data engineering, data science, and Power BI.
3. What is the role of Spark in Fabric? – Spark provides distributed data processing within Fabric notebooks.
4. How to view Spark session in Fabric? – Use ‘spark’ variable; it’s auto-created in Fabric notebooks.
5. How to read a CSV file? – df = spark.read.csv(‘path’, header=True, inferSchema=True)
6. How to read a Parquet file? – df = spark.read.parquet(‘path’)
7. How to show top rows? – df.show(5)
8. How to print schema? – df.printSchema()
9. How to count total rows? – df.count()
10. How to display columns? – df.columns
11. How to describe data? – df.describe().show()
12. How to select columns? – df.select(‘col1’, ‘col2’)
13. How to filter rows? – df.filter(df.age > 30)
14. How to use SQL in PySpark? – df.createOrReplaceTempView(‘table’); spark.sql(‘SELECT * FROM table’)
15. What is a DataFrame? – A distributed collection of data organized into columns.
16. What is an RDD? – Resilient Distributed Dataset; low-level distributed data structure in Spark.
17. How to convert DataFrame to RDD? – df.rdd
18. How to add a new column? – df.withColumn(‘newCol’, df.col1 + 10)
19. How to drop a column? – df.drop(‘colName’)
20. How to rename a column? – df.withColumnRenamed(‘old’, ‘new’)
21. How to get distinct values? – df.select(‘col’).distinct()
22. How to sort data? – df.sort(‘col’) or df.orderBy(‘col’)
23. How to group and aggregate? – df.groupBy(‘col’).agg({‘sales’:’sum’})
24. What is a transformation? – Operation that creates a new DataFrame (lazy).
25. What is an action? – Operation that triggers computation (e.g., show, collect).
26. How to cache a DataFrame? – df.cache()
27. How to unpersist a DataFrame? – df.unpersist()
28. How to join two DataFrames? – df1.join(df2, ‘key’, ‘inner’)
29. Types of joins? – inner, left, right, full, cross.
30. How to save DataFrame as Parquet? – df.write.parquet(‘path’)
31. How to overwrite existing data? – df.write.mode(‘overwrite’).csv(‘path’)
32. How to append data? – df.write.mode(‘append’).csv(‘path’)
33. What is a Lakehouse in Fabric? – A data storage combining data lake and warehouse features.
34. How to write to Lakehouse table? – df.write.format(‘delta’).saveAsTable(‘lakehouse.table’)
35. How to read a Delta table? – spark.read.format(‘delta’).load(‘path’)
36. What is a Delta table? – A table format supporting ACID transactions in Fabric.
37. How to drop duplicates? – df.dropDuplicates()
38. How to replace null values? – df.fillna({‘col’:0})
39. How to check for nulls? – df.filter(df.col.isNull())
40. How to union DataFrames? – df1.union(df2)
41. How to sample data? – df.sample(0.1)
42. How to get unique count? – df.select(countDistinct(‘col’))
43. How to use when/otherwise? – df.withColumn(‘new’, when(df.col>0,1).otherwise(0))
44. How to repartition data? – df.repartition(4)
45. How to coalesce partitions? – df.coalesce(1)
46. What is lazy evaluation? – Spark waits until an action is called to execute transformations.
47. How to get DataFrame schema? – df.schema
48. What is broadcast join? – Joining small DataFrame using broadcast(df_small).
49. What is the default file format in Spark? – Parquet.
50. How to register a temp view? – df.createOrReplaceTempView(‘view’)
51. How to execute SQL query? – spark.sql(‘SELECT * FROM view’)
52. How to measure execution time? – Use %time or Python’s time module.
53. How to read JSON? – spark.read.json(‘path’)
54. How to save JSON? – df.write.json(‘path’)
55. How to check Spark version? – spark.version
56. How to show Spark config? – spark.sparkContext.getConf().getAll()
57. What is a Fabric notebook? – An interactive environment in Fabric for writing Spark, SQL, and Python code.
58. What is autoscale in Fabric Spark? – Automatically adjusts resources based on workload.
59. How to optimize joins? – Use broadcast joins for small tables.
60. How to improve performance? – Cache reused data, use partitioning, avoid shuffles.
61. How to create a Delta table in Fabric? – spark.sql(‘CREATE TABLE tbl USING delta LOCATION path’)
62. How to show tables? – spark.sql(‘SHOW TABLES’)
63. How to delete table? – spark.sql(‘DROP TABLE table_name’)
64. What is schema inference? – Spark automatically detects column types from data.
65. What is a partition? – Logical division of data for parallel processing.
66. What is a shuffle? – Data redistribution between partitions during wide transformations.
67. What is collect()? – Returns all rows as a list to the driver.
68. What is take()? – Returns first n rows to the driver.
69. How to convert DataFrame to Pandas? – df.toPandas()
70. What is show() used for? – Displays top rows of a DataFrame.
71. How to read Excel in Fabric PySpark? – Use pandas.read_excel() in a Spark notebook cell.
72. How to stop Spark session? – spark.stop()
73. What is the advantage of Delta format? – Supports ACID, versioning, and time travel.
74. What is display() in Fabric? – Renders DataFrame output in notebook cell with visualization options.
75. How to check number of partitions? – df.rdd.getNumPartitions()
76. How to repartition by column? – df.repartition(‘col’)
77. What is collect_set()? – Returns unique values as array per group.
78. What is a window function? – A function that performs operations across a window of rows.
79. How to use rank()? – from pyspark.sql.window import Window; rank().over(Window.partitionBy(‘col’).orderBy(‘val’))
80. What is persist()? – Caches DataFrame in memory or disk for reuse.
81. How to check data types? – df.dtypes
82. What is a checkpoint? – Saves RDD lineage to break dependency chain.
83. How to drop rows with nulls? – df.na.drop()
84. How to replace nulls with value? – df.na.fill(‘N/A’)
85. What is explode()? – Converts array or map column into multiple rows.
86. How to convert string to date? – to_date(df.col, ‘yyyy-MM-dd’)
87. How to write to SQL table? – df.write.jdbc(url, table, mode, properties)
88. How to read from SQL table? – spark.read.jdbc(url, table, properties)
89. How to rename multiple columns? – Use reduce or loop with withColumnRenamed.
90. How to get summary statistics? – df.summary().show()
91. What is collect_list()? – Aggregates values into a list per group.
92. What is countDistinct()? – Returns count of unique values.
93. What is broadcast()? – A method to broadcast a variable to all worker nodes.
94. What is df.limit()? – Limits rows in output.
95. How to cache table in SQL? – spark.sql(‘CACHE TABLE table_name’)
96. What is MLlib? – Spark’s machine learning library.
97. How to install packages in Fabric notebook? – Use %pip install package_name.
98. How to run Python code in Fabric? – Use Python cell in notebook.
99. How to schedule a notebook in Fabric? – Use Fabric Data Pipeline or Job scheduling feature.
100. How to monitor Spark jobs in Fabric? – Check Fabric Monitoring hub under Spark Job runs.
Thanks for reading this post! I hope you found it helpful. Feel free to share it with others or your teammates so they can benefit from it too.Â
![]()